In [1]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

# Web Scraping and DataFrame Creation

## Objective:
The objective of this code is to scrape data from a webpage and create a DataFrame containing information about the top 50 highly-ranked films.

## Code Explanation:

1. **URL Definition:**
   - The URL variable holds the web address from which data will be scraped. It points to a specific archived webpage containing information about highly-ranked films.

2. **Database and Table Names:**
   - `db_name` specifies the name of the database where the scraped data will be stored.
   - `table_name` specifies the name of the table within the database.

3. **DataFrame Initialization:**
   - A DataFrame (`df`) is initialized with three columns: "Average Rank", "Film", and "Year". This DataFrame will store the scraped data.

4. **Web Scraping:**
   - The code uses the `requests` library to fetch the HTML content of the specified URL.
   - `BeautifulSoup` is used to parse the HTML content.
   - The code then finds all the `<tbody>` elements (tables) in the parsed HTML.

5. **Data Extraction:**
   - It iterates through the rows (`<tr>`) of the first table found (assuming it contains the desired data).
   - For each row, if the count is less than 50 (to limit to the top 50 films), it extracts the data from the columns (`<td>`).
   - Extracted data (average rank, film name, and year) are stored in a dictionary.
   - The dictionary is converted into a DataFrame (`df1`) and concatenated with the main DataFrame (`df`).
   - The count is incremented until 50 rows are processed.

6. **Printing the DataFrame:**
   - Finally, the code prints the DataFrame containing the scraped data.

## Output:
The output of this code is a DataFrame (`df`) containing information about the top 50 highly-ranked films, including their average rank, film name, and year.


In [3]:
url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
df = pd.DataFrame(columns=["Average Rank","Film","Year"])
count = 0
html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')
tables = data.find_all('tbody')
rows = tables[0].find_all('tr')
for row in rows:
    if count<50:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {"Average Rank": col[0].contents[0],
                         "Film": col[1].contents[0],
                         "Year": col[2].contents[0]}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df,df1], ignore_index=True)
            count+=1
    else:
        break

print(df)


   Average Rank                                           Film  Year
0             1                                  The Godfather  1972
1             2                                   Citizen Kane  1941
2             3                                     Casablanca  1942
3             4                         The Godfather, Part II  1974
4             5                            Singin' in the Rain  1952
5             6                                         Psycho  1960
6             7                                    Rear Window  1954
7             8                                 Apocalypse Now  1979
8             9                          2001: A Space Odyssey  1968
9            10                                  Seven Samurai  1954
10           11                                        Vertigo  1958
11           12                                    Sunset Blvd  1950
12           13                                   Modern Times  1936
13           14                   