## Webscraping IMDb 1000 Greatest Films of all Time and Exploratory Data Analysis (EDA)

#### This Webscraper will scrape data for all 1000 films and will collect the following:
* Title
* Genre
* Release Year
* Director
* Lead Actor
* Run Time
* MPA_Rating
* IMDb_Stars_Rating
* Synopsis
* Metascore
* Votes
* Gross
* Thumbnail Image URL

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
from time import sleep
from random import randint
import re
import xlsxwriter


In [2]:
# enable English translation of film titles
headers = {'Accept-Language': 'en-US, en;q=0.5'}

In [3]:
page_1_url = 'https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&ref_=adv_prv'
page_1_url_response = requests.get(page_1_url)

print(page_1_url_response.status_code)

200


In [4]:
# initialize empty lists in which to store extracted (scraped) data

title = []
genre = []
year = []
director = []
lead = []
time_minutes = []
mpa_rating = []
imdb_stars_rating = []
synopsis = []
metascore = []
votes = []
gross = []
image_url = []



From first page, each successive page in the 1000 film list has a url that contains a change to the page number starting with no page designation for the first page, to a the number indicating the additional 100 items in the film list: 101, 201, 301, 401, 501, 601, 701, 801, 901, with the latter containing the last 100 films in the collection.


items 1 to 100

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&ref_=adv_prv

items 101 to 200

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=101&ref_=adv_prv

items 201 to 300

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=201&ref_=adv_prv

items 301 to 400

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=301&ref_=adv_prv

items 401 to 500

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=401&ref_=adv_prv

items 501 to 600

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=501&ref_=adv_prv


...

items 901 to 1000

https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=901&ref_=adv_nxt


In [5]:
# create an array using: numpy.arange([start, ]stop, [step, ], dtype=None)
# consider the imdb pages list 1000 films with each of the 10 pages listing 100 films
# from above page URLs we can see that each page contains a list of 100 films
# variable pages will store the 10 page designations for each of the page URLs
pages = np.arange(1, 1001, 100)

# get list containing the 10 pages that contain 100 films each
pages

array([  1, 101, 201, 301, 401, 501, 601, 701, 801, 901])

Loop through the pages that contain each of the page URLs

Get the page data from each of the URLs

Use BeautifulSoup to parse the HTML returned from each of the URLs

Begin inspecting web elements in the pages

Begin scraping (extracting) data from the HTML soup using the find_all( ) method. This method will extract all of the div containers that have a class attribute of lister-item mode-advanced 

<img src="images/div_container.png" alt="inspecting HTML" width="600"/>

Then create another for loop to loop through each of these divs and scrape the data from the pertinent elements contained therein

In [None]:
for page in pages:
    page = requests.get("https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=" + str(page) + "&ref_=adv_nxt", headers = headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    imdb_data = soup.find_all('div', {'class': 'lister-item mode-advanced'})
    # random delay in seconds between requests so as not to send excess number of requests for given time period
    sleep(randint(2,10)) # random integer between 2 and 10 seconds
    
    for data in imdb_data:
        
        film = data.h3.a.text
        title.append(film)
        
        genre_type = data.p.find('span', {'class': 'genre'}).text.replace('\n', ' ')
        genre.append(genre_type)
        
        release = data.h3.find('span', {'class': 'lister-item-year text-muted unbold'}).text.replace('(', ' ').replace(')', ' ')
        year.append(release)
        
        a_tag = data.find('p', {'class': ''})
        a_tag = a_tag.select('a[href^="/name/"]')
        directors = a_tag[0].text
        director.append(directors)
        
        actor = a_tag[1].text
        lead.append(actor)
        
        runtime = data.p.find('span', {'class': 'runtime'}).text.replace(' min', ' ')
        time_minutes.append(runtime)
        
        # motion picture association (mpa) ratings such as R, PG, PG-14
        mpa_rate = data.find('span', {'class': 'certificate'})
        if mpa_rate is not None:
            mpa_rate = data.find('span', {'class': 'certificate'}).text
            # print('mpa_rate:', mpa_rate.text)
        mpa_rating.append(mpa_rate)
        
        star_rate = data.find('div', {'class': 'inline-block ratings-imdb-rating'}).text.replace('\n', ' ')
        imdb_stars_rating.append(star_rate)
        
        p_tag = data.find_all('p', {'class': 'text-muted'})
        syn = p_tag[1].text.replace('\n', ' ')
        synopsis.append(syn)
        
        meta = data.find('span', {'class': 'metascore'}).text if data.find('span', {'class': 'metascore'}) else '----'
        metascore.append(meta)
        
        nv_tag = data.find_all('span', attrs = {'name': 'nv'})
        
        vote = nv_tag[0].text
        votes.append(vote)
        
        grs = nv_tag[1].text if len(nv_tag) > 1 else '----'
        gross.append(grs)
        
        img_tag = data.find('div', {'class': 'lister-item-image float-left'})
        x = img_tag.img.attrs['loadlate']
        image_url.append(x)
        


#### Ensure that all columns are the same length 

In [None]:
# Pandas dataframes must have columns of the same length
print(len(title),
      len(genre),
      len(year),
      len(director),
      len(lead),
      len(time_minutes),
      len(mpa_rating),
      len(imdb_stars_rating),
      len(synopsis),
      len(metascore),
      len(votes),
      len(gross),
      len(image_url))

#### Create Pandas DataFrame with Columns for Scraped Film Data

In [None]:
imdb_df = pd.DataFrame({
    'film_title': title,
    'genre': genre,
    'release_year': year,
    'director': director,
    'lead_actor': lead,
    'run_time_minutes': time_minutes,
    'mpa_rating': mpa_rating,
    'imdb_stars_rating': imdb_stars_rating,
    'synopsis': synopsis,
    'metascore': metascore,
    'votes': votes,
    'gross': gross,
    'image_url': image_url
    
     })

In [None]:
imdb_df.head()

#### Export Scraped IMDb 1000 Greatest Films Data to .csv File

In [None]:
# export preliminary scraped imdb data to csv
# imdb_df.to_csv('top_imdb_movies.csv')

#### Scrape Film Data to Get A Specific Film Thumbnail Image 

In [None]:
data.find_all('div', {'class': 'lister-item-image float-left'}) 

In [None]:
data.img.attrs

In [None]:
data.img.attrs['alt']

In [None]:
data.img.attrs['class']

In [None]:
# print film image url
from IPython.display import Image
image_url = data.find('img').attrs['src']
Image(url= image_url)
image_url = data.src
image_url = data.find('img').attrs['src']
print(image_url)

In [None]:
# display thumbnail image
from IPython.display import Image
image_url = data.find('img').attrs['loadlate']
Image(url= image_url)

### Exploratory Data Analysis

#### Explore Data

In [None]:
imdb_df.dtypes

In [None]:
# check for duplicates
imdb_df.duplicated().sum()

In [None]:
# check dataset for null values
imdb_df.isnull().sum()

In [None]:
# newdf = df.copy()
final_df = imdb_df.copy()

#### Clean Data

In [None]:
# get rid of null values by filling with unknown for those in mpa_rating, and mean values for gross and mestascore
final_df['mpa_rating'].fillna("unknown",inplace=True)

In [None]:
# df['first_set'] = df['first_set'].str.replace('_','|')
# final_df['mpa_rating'] = final_df['mpa_rating'].map(lambda z: z.lstrip('[').rstrip(']'))
# df['col1'] = df['col1'].str.strip()
# final_df['mpa_rating'] = final_df['mpa_rating'].str.lstrip('[')
final_df.head()


In [None]:
# check dataset for null values after using the fillna method above
final_df.isnull().sum()

In [None]:
final_df.dtypes

In [None]:
final_df.head()

In [None]:
final_df.dtypes['release_year']

In [None]:
# filter for extraneous text characters in 'release_year' column
# find where the II and III characters are so we can remove them in order to convert year column to date
mask = final_df['release_year'].str.contains('II', na=False)
final_df[mask].head()

In [None]:
# df['column name'] = df['column name'].str.replace('old character','new character')
final_df['release_year'] = final_df['release_year'].str.replace('I', '')
final_df['release_year'] = final_df['release_year'].str.replace('II', '')
final_df['release_year'] = final_df['release_year'].str.replace('III', '')
final_df[mask].head()

In [None]:
mask = final_df['release_year'].str.contains(' 1994 ', na=False)
final_df[mask].head()

In [None]:
# get a count of how many NaN values in the release_year column
final_df['release_year'].isna().sum()

In [None]:
# get a count of how many NaN values in the dataframe
# df.isna().sum().sum()
final_df.isna().sum().sum()

In [None]:
# get which columns contain NaN
final_df.isna().sum()

In [None]:
# df.col1 = pd.to_numeric(df.col1, errors="coerce")
final_df.release_year = pd.to_numeric(final_df.release_year, errors="coerce")

In [None]:
final_df.dtypes['release_year']

In [None]:
# df['Year'] = pd.to_numeric(df['Year'], errors='coerce').fillna(0).astype(int)
final_df['release_year'] = pd.to_numeric(final_df['release_year'], errors='coerce').fillna(0).astype(int)

In [None]:
# get release year of nth movie
final_df.loc[302,'release_year']

In [None]:
final_df.dtypes['release_year']

In [None]:
final_df.dtypes

In [None]:
# convert 'run_time' to numeric
final_df['run_time_minutes'] = final_df['run_time_minutes'].str.extract('(\d+)').astype(int)
final_df.head(3)

In [None]:
# convert 'imdb_stars_rating' to float
final_df['imdb_stars_rating'] = final_df['imdb_stars_rating'].str.extract('(\d+)').astype(float)

In [None]:
final_df.info()

In [None]:
final_df.dtypes

In [None]:
final_df['mpa_rating'].dtypes

In [None]:
# df['col_name'] = df['col_name'].str.strip('[]')
# final_df['mpa_rating'] = final_df['mpa_rating'].str.strip('[]')

In [None]:
# gross column: left strip $ and right strip M 
final_df['gross'] = final_df['gross'].map(lambda x: x.lstrip('$').rstrip('M'))
# convert gross to float and if there are dashes turn it into NaN
final_df['gross'] = pd.to_numeric(final_df['gross'], errors='coerce')
final_df.head(5)

In [None]:
final_df.dtypes

In [None]:
# convert 'metascore' to numeric 
final_df['metascore'] = final_df['metascore'].str.extract('(\d+)')
# convert to float and if there are dashes turn it into NaN
final_df['metascore'] = pd.to_numeric(final_df['metascore'], errors='coerce')

In [None]:
imdb_df.dtypes

In [None]:
# remove commas from votes and convert votes to numeric
final_df['votes'] = final_df['votes'].str.replace(',', '').astype(int)
final_df.head(22)

In [None]:
final_df.dtypes

In [None]:
final_df.head()

#### Export Final Dataframe to CSV

In [None]:
final_df.to_csv('top_imdb_movies_final_df.csv')

In [None]:
# summary statistics
final_df.describe()

In [None]:
# get number of values in each column, data types
final_df.info()

In [None]:
# create a heatmap to show correlation between quantitative variables
sns.heatmap(final_df.corr(), annot=True);

In [None]:
corr_df = final_df.copy()
# corr_df.head()

In [None]:
 # plot a graph showing the correlation between metascore and IMDb stars rating 

# Extract the columns we need
metascore = corr_df['metascore']
star_rating = corr_df['imdb_stars_rating']

# Create the scatter plot
plt.scatter(metascore, star_rating)

# Add labels and a title
plt.xlabel('Metascore')
plt.ylabel('IMDb Stars Rating')
plt.title('Correlation between Metascore and IMDb Stars Rating for Top 1000 Films')

# Show the plot
plt.show()
   

In [None]:

fig = px.scatter(data_frame=corr_df, x='metascore', y='imdb_stars_rating',title='Correlation between Metascore and IMDb Stars Rating for Top 1000 Films')
fig.show()


In [None]:
# Plot the release years of the top 10 films

# Get the top 10 films
top_10_films = final_df.head(10)

# Extract the columns we need
release_year = top_10_films['release_year']
title = top_10_films['film_title']


# Set the figure size
plt.figure(figsize=(12,8))

# Create the scatter plot
plt.scatter(np.arange(len(release_year)), release_year)

# Add labels and title
plt.xlabel('Films')
plt.ylabel('Release Year')
plt.title('Release Years of Top 10 Films')

# Add the title of each film as a label
for i, txt in enumerate(title):
    plt.annotate(txt, (i, release_year[i]), xytext=(0, 10), textcoords='offset points', ha='center')

plt.grid(True, linestyle='-')

# Show the plot
plt.show()


In [None]:
# newdf = df.copy()
df = final_df.copy()
df.head()

In [None]:
# plot a graph of the top 20 films rated PG


# Filter the dataframe to only include films rated PG
pg_df = df[df['mpa_rating'] == 'PG']

# Sort the dataframe by the number of votes in descending order
pg_df = pg_df.sort_values(by='votes', ascending=False)

# Select the top 20 films rated PG
pg_df = pg_df.head(20)

# Create a bar chart using Plotly Express
fig = px.bar(pg_df, x='film_title', y='votes', title='Top 20 Films Rated PG')

# Display the chart
fig.show()


In [None]:
pg_df.head()

In [None]:
# plot a graph of the top 10 films rated G


# Filter the dataframe to only include films rated G
g_df = df[df['mpa_rating'] == 'G']

# Sort the dataframe by the number of votes in descending order
g_df = g_df.sort_values(by='votes', ascending=False)

# Select the top 20 films rated PG
g_df = g_df.head(20)

# Create a bar chart using Plotly Express
fig = px.bar(g_df, x='film_title', y='votes', title='Top 20 Films Rated G')

# Display the chart
fig.show()


In [None]:
g_df.head(10)

In [None]:
# plot a graph of the top 20 films rated PG-13


# Filter the dataframe to only include films rated PG-13
pg13_df = df[df['mpa_rating'] == 'PG-13']

# Sort the dataframe by the number of votes in descending order
pg13_df = pg13_df.sort_values(by='votes', ascending=False)

# Select the top 20 films rated PG-13
pg13_df = pg13_df.head(20)

# Create a bar chart using Plotly Express
fig = px.bar(pg13_df, x='film_title', y='votes', title='Top 20 Films Rated PG-13')

# Display the chart
fig.show()


In [None]:
# plot a graph of the top 20 films rated R


# Filter the dataframe to only include films rated R
r_df = df[df['mpa_rating'] == 'R']

# Sort the dataframe by the number of votes in descending order
r_df = r_df.sort_values(by='votes', ascending=False)

# Select the top 20 films rated R
r_df = r_df.head(20)

# Create a bar chart using Plotly Express
fig = px.bar(r_df, x='film_title', y='votes', title='Top 20 Films Rated R')

# Display the chart
fig.show()

In [None]:
# newdf = df.copy()
th_df = final_df.copy()
th_df.head()

In [None]:
# Create a new column that is True if Tom Hanks is in the cast and False otherwise
th_df['Tom Hanks'] = th_df['lead_actor'].str.contains('Tom Hanks')

# Filter the dataframe to only include films starring Tom Hanks
th_df = th_df[th_df['Tom Hanks'] == True]

# Sort the dataframe by the number of votes in descending order
th_df = th_df.sort_values(by='votes', ascending=False)

# Create a bar chart using Plotly Express
fig = px.bar(th_df, x='film_title', y='votes', title='Top Films Starring Tom Hanks')

# Display the chart
fig.show()


In [None]:
# newdf = df.copy()
spiel_df = final_df.copy()
spiel_df.head()

In [None]:
# Create a new column that is True if Steven Spielberg is the director and False otherwise
spiel_df['Steven Spielberg'] = spiel_df['director'].str.contains('Steven Spielberg')

# Filter the dataframe to only include films starring Tom Hanks
spiel_df = spiel_df[spiel_df['Steven Spielberg'] == True]

# Sort the dataframe by the number of votes in descending order
spiel_df = spiel_df.sort_values(by='votes', ascending=False)

# Create a bar chart using Plotly Express
fig = px.bar(spiel_df, x='film_title', y='votes', title='Top Films Directed by Steven Spielberg')

# Display the chart
fig.show()

In [None]:
final_df.head()

In [None]:
# newdf = df.copy()
ry_df = df.copy()
ry_df.head()

In [None]:
vr_df = df.copy()

In [None]:


# Group the data by rating and get the number of votes for each group
grouped = vr_df.groupby('imdb_stars_rating')['votes'].sum()

# Set the colors for each rating
colors = {1.0:'r', 2.0:'g', 3.0:'b', 4.0:'c', 5.0:'m', 6.0:'y', 7.0:'k', 8.0:'r', 9.0:'g', 10.0:'b'}

# Create the stacked bar chart
grouped.plot(kind='bar', stacked=True, color=[colors[i] for i in grouped.index])

# Add labels and a title
plt.xlabel('IMDb Stars Rating')
plt.ylabel('Number of Votes')
plt.title('Number of Votes by IMDb Stars Rating for Top 1000 Films')

# Show the plot
plt.show()



In [None]:
ax = final_df['mpa_rating'].value_counts().plot(kind='bar',
                                   figsize=(12,6),
                                   title="Number of Movies by MPA Rating")
ax.set_xlabel("MPA Rating")
ax.set_ylabel("Number of Movies")
ax.plot();

In [None]:
# Resources:
# https://www.freecodecamp.org/news/web-scraping-sci-fi-movies-from-imdb-with-python/
# https://medium.com/analytics-vidhya/detailed-tutorials-for-beginners-web-scrap-movie-database-from-multiple-pages-with-beautiful-soup-5836828d23
