# Hands-on lab: Web scraping and Extracting Data using APIs

Web scraping is used for extraction of relevant data from web pages. If you require some data from a web page in a public domain, web scraping makes the process of data extraction quite convenient. The use of web scraping, however, requires some basic knowledge of the structure of HTML pages. In this lab, you will learn the process of analyzing the HTML code of a web page and how to extract the required information from it using web scraping in Python.

# Objectives
By the end of this lab, you will be able to:

Use the requests and BeautifulSoup libraries to extract the contents of a web page

Analyze the HTML code of a webpage to find the relevant information

Extract the relevant information and save it in the required form

# Scenario
Consider that you have been hired by a Multiplex management organization to extract the information of the top 50 movies with the best average rating from the web link shared below.

https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films

In [12]:
# Importing Libraries

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies.db'
table_name = 'Top_50'
csv_path = 'top_50_films.csv'

df = pd.DataFrame(columns = ["Average Rank", "Film", "Year"])
count = 0

# Scrape the webpage

html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

# Extract table data
 
tables = data.find_all('tbody')    # Searches for all <tbody> elements in the parsed HTML (data). 
                                   # <tbody> is a container for the rows of an HTML table.
    
rows = tables[0].find_all('tr')    # Selects the first <tbody> element because the code assumes the table of interest is the first table in the HTML.
                                   # Retrieves all <tr> (table row) elements within the first <tbody>.
                                   # These rows contain the data you want to extract.


for row in rows:
    
    if count < 50:
        
        col = row.find_all('td')   # Retrieves all <td> elements (table cells) in the current row (row).
        
        if len(col) != 0:          # If len(col) == 0, the row is empty or doesn’t contain data, so it is skipped.
            
            data_dict = {
                "Average Rank": int(col[0].contents[0]),  # If you directly try to access col[1] without .contents[0], it returns a BeautifulSoup tag object, not the text inside it.
                "Film": str(col[1].contents[0]),
                "Year": int(col[2].contents[0])
            }
            
            df1 = pd.DataFrame(data_dict, index = [0])
            
            df = pd.concat([df, df1], ignore_index = True)
            
            count += 1
            
    else:
        
        break

# Display the DataFrame

print(df)

# Save to CSV

df.to_csv(csv_path, index = False)

# Save to SQLite

conn = sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists = 'replace', index = False)
conn.close()

   Average Rank                                               Film  Year
0             1                                      The Godfather  1972
1             2                                       Citizen Kane  1941
2             3                                         Casablanca  1942
3             4                             The Godfather, Part II  1974
4             5                                Singin' in the Rain  1952
5             6                                             Psycho  1960
6             7                                        Rear Window  1954
7             8                                     Apocalypse Now  1979
8             9                              2001: A Space Odyssey  1968
9            10                                      Seven Samurai  1954
10           11                                            Vertigo  1958
11           12                                        Sunset Blvd  1950
12           13                                    

# Practice problems

Modify the code to extract Film, Year, and Rotten Tomatoes' Top 100 headers.

Restrict the results to only the top 25 entries.

Filter the output to print only the films released in the 2000s (year 2000 included).

In [14]:
# Importing Libraries

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://web.archive.org/web/20230902185655/https://en.everybodywiki.com/100_Most_Highly-Ranked_Films'
db_name = 'Movies2.db'
table_name = 'Top_25'
csv_path = 'top_25_films.csv'

df = pd.DataFrame(columns = ["Film", "Year", "Rotten Tomatoes' Top 100"])
count = 0

# Scrape the webpage

html_page = requests.get(url).text
data = BeautifulSoup(html_page, 'html.parser')

# Extract table data
 
tables = data.find_all('tbody')    # Searches for all <tbody> elements in the parsed HTML (data). 
                                   # <tbody> is a container for the rows of an HTML table.
    
rows = tables[0].find_all('tr')    # Selects the first <tbody> element because the code assumes the table of interest is the first table in the HTML.
                                   # Retrieves all <tr> (table row) elements within the first <tbody>.
                                   # These rows contain the data you want to extract.


for row in rows:
    
    if count < 25:
        
        col = row.find_all('td')   # Retrieves all <td> elements (table cells) in the current row (row).
        
        if len(col) != 0:          # If len(col) == 0, the row is empty or doesn’t contain data, so it is skipped.
            
            data_dict = {
                "Film": str(col[1].contents[0]),  # If you directly try to access col[1] without .contents[0], it returns a BeautifulSoup tag object, not the text inside it.
                "Year": int(col[2].contents[0]),
                "Rotten Tomatoes' Top 100": str(col[3].contents[0])
            }
            
            df1 = pd.DataFrame(data_dict, index = [0])
            
            df = pd.concat([df, df1], ignore_index = True)
            
            count += 1
            
    else:
        
        break

# Display the DataFrame

pd.set_option('display.width', 1000)  # Increase the max width of the display
pd.set_option('display.colheader_justify', 'center')  # Center align the column headers
df.style.set_properties(**{'text-align': 'center'})   # apply the center alignment to all cells in the DataFrame when rendered.

# Filter films released in the 2000s (year >= 2000)
filtered_df = df[df['Year'] >= 2000]

# Display the filtered DataFrame
print(filtered_df)

# Save to CSV

filtered_df.to_csv(csv_path, index = False)

# Save to SQLite

conn = sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists = 'replace', index = False)
conn.close()

                        Film                       Year Rotten Tomatoes' Top 100
16                                       Parasite  2019                6        
18  Lord of the Rings: The Fellowship of the Ring  2001         unranked        
22                              Avengers: Endgame  2019                7        
