## Final Project Submission

Please fill out:
* Student name: Derek Davis
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


# Scraping Movie Data: Dates, Rating, Runtime, Genre, Release Date, Actors, Directors, Budget, and Gross Income.
The websites that I will be scraping in this notebook are:
    1. https://www.imdb.com/search/title/?title_type=feature&num_votes=5000,&languages=en&sort=boxoffice_gross_us,desc&start=1&explore=genres&ref_=adv_nxt
    2. https://www.the-numbers.com/movie/budgets/all
The goal is to create four clean tables; a base movie table with and without the finances included, an actors table, and a directors table.

In [1]:
#Import necessary packages
import pandas as pd
import numpy as np
import requests
import re
import bleach
from time import sleep
from random import randint
from bs4 import BeautifulSoup
%matplotlib inline

First, we are going to scrap the IMDb website for all movies with over 5000 reviews. In our initial table we are going to pull: Movie Title, Release Date, IMDb Rating, Movie Rating, Runtime, and finally Genre. We are going to combine all of these data points into a table called IMDb_df.

In [2]:
# Create a list for each of the data points being scraped.
names = []
years = []
imdb_ratings = []
ratings = []
runtimes = []
genres = []
#Develop a for-loop that will iterate through each of the pages.
for i in range(1, 10252, 50):
    url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=5000,&languages=en&sort=boxoffice_gross_us,desc&start={}&explore=genres&ref_=adv_nxt'.format(i)
    response = requests.get(url)
    sleep(randint(8,15))
    html_soup = BeautifulSoup(response.text, 'html.parser')
    movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
# Extract data from individual movie container
    for container in movie_containers:
# If the movie has Rating, then extract:
        if container.find('span', class_ = 'certificate') is not None:
# The Movie Title
            name = container.h3.a.text
            names.append(name)
# The Release Date
            year = container.h3.find('span', class_ = 'lister-item-year').text
            years.append(year)
# The IMDB rating
            imdb = float(container.strong.text)
            imdb_ratings.append(imdb)
# The Rating
            rating = container.find('span', class_ = 'certificate').text
            ratings.append(rating)
# The Movie Runtime
            runtime = container.find('span', class_ = 'runtime').text
            runtimes.append(runtime)
# The Movie Genres
            genre = container.find('span', class_ = 'genre').text
            genres.append(genre)

In [3]:
# Create a DataFrame from the newly acquired data.
IMDb_df = pd.DataFrame({'Movie': names,
'Year': years,
'IMDb': imdb_ratings,
'Rating': ratings,
'Runtime': runtimes,
'Genre': genres
})

In [4]:
# Clean the Year, Runtime, and Genre data.
IMDb_df['Year'] = IMDb_df['Year'].str[-5:-1].astype(int)
IMDb_df['Runtime'] = IMDb_df['Runtime'].str[:-4].astype(int)
IMDb_df['Genre'] = IMDb_df['Genre'].map(lambda x: x.strip())

In [7]:
print(IMDb_df.info())
IMDb_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10135 entries, 0 to 10134
Data columns (total 6 columns):
Movie      10135 non-null object
Year       10135 non-null int64
IMDb       10135 non-null float64
Rating     10135 non-null object
Runtime    10135 non-null int64
Genre      10135 non-null object
dtypes: float64(1), int64(2), object(3)
memory usage: 475.2+ KB
None


Unnamed: 0,Movie,Year,IMDb,Rating,Runtime,Genre
0,Star Wars: Episode VII - The Force Awakens,2015,7.9,PG-13,138,"Action, Adventure, Sci-Fi"
1,Avengers: Endgame,2019,8.4,PG-13,181,"Action, Adventure, Drama"
2,Avatar,2009,7.8,PG-13,162,"Action, Adventure, Fantasy"
3,Black Panther,2018,7.3,PG-13,134,"Action, Adventure, Sci-Fi"
4,Avengers: Infinity War,2018,8.4,PG-13,149,"Action, Adventure, Sci-Fi"
...,...,...,...,...,...,...
10130,The Secret Life of Pets,2016,6.5,PG,87,"Animation, Adventure, Comedy"
10131,Despicable Me 2,2013,7.3,PG,98,"Animation, Adventure, Comedy"
10132,The Jungle Book,2016,7.4,PG,106,"Adventure, Drama, Family"
10133,Deadpool,2016,8.0,R,108,"Action, Adventure, Comedy"


Using the same range of movies from above let's create a table containing all of the Movie Titles, Release Dates, Actors, and Directors. Just like we did before, we will start by appending each data point to a list that can then all be combined into a table.

In [8]:
# Create a list for each of the data points being scraped.
a_names = []
a_release = []
actors = []
directors = []
#Develop a for-loop that will iterate through each of the pages.
for i in range(1, 10252, 50):
    url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=5000,&languages=en&sort=boxoffice_gross_us,desc&start={}&explore=genres&ref_=adv_nxt'.format(i)
    response = requests.get(url)
    sleep(randint(8,15))
    html_soup = BeautifulSoup(response.text, 'lxml')
    movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
# Extract data from individual movie container    
    for container in movie_containers:
# Movie Title      
        a_name = container.h3.a.text
        a_names.append(a_name)
#Release Date        
        year = container.h3.find('span', class_ = 'lister-item-year').text
        a_release.append(year)
#Actors and Directors    
        imdb_names_cont = container.find('p', class_ = '')
        b = imdb_names_cont.find_all('a')
        actors.append(b[-4:])
        directors.append(b[-5::-1])

In [67]:
# Create a DataFrame containing: Movie Title, Release Date, and Actors
actors_df = pd.DataFrame({
'Movie': a_names,
'Year': a_release,
'Actors': actors})

In [68]:
# Separate each Actor from an element in a list to its own row.
actors_df1 = actors_df['Actors'].apply(pd.Series)
actors_df2 = pd.merge(actors_df, actors_df1, right_index = True, left_index = True)
actors_df2 = actors_df2.drop(['Actors'], axis = 1)
final_actors_df = actors_df2.melt(id_vars = ['Movie', 'Year'], var_name = ['Actors'])
final_actors_df = final_actors_df.drop('Actors', axis=1)
final_actors_df = final_actors_df.drop_duplicates()

In [69]:
# Clean the Year value.
final_actors_df['Year'] = final_actors_df['Year'].str[-5:-1].astype(int)

In [72]:
# Clean the new 'value' field to remove tags.
final_actors_df['value'] = final_actors_df['value'].apply(lambda x: re.sub('<[^<]+?>', '', str(x)))

In [74]:
print(final_actors_df.info())
final_actors_df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39996 entries, 0 to 40899
Data columns (total 3 columns):
Movie    39996 non-null object
Year     39996 non-null int64
value    39996 non-null object
dtypes: int64(1), object(2)
memory usage: 1.2+ MB
None


Unnamed: 0,Movie,Year,value
0,Star Wars: Episode VII - The Force Awakens,2015,Daisy Ridley
1,Avengers: Endgame,2019,Robert Downey Jr.
2,Avatar,2009,Sam Worthington
3,Black Panther,2018,Chadwick Boseman
4,Avengers: Infinity War,2018,Robert Downey Jr.
...,...,...,...
40895,The Narrow Margin,1952,Queenie Leonard
40896,The Naked Spur,1953,Ralph Meeker
40897,Monkey Business,1952,Charles Coburn
40898,The Lavender Hill Mob,1951,Alfie Bass


In [80]:
# Create a DataFrame containing: Movie Title, Release Date, and Directors.
directors_df = pd.DataFrame({
'Movie': a_names,
'Year': a_release,
'Directors': directors})

In [81]:
# Separate each Director from an element in a list to its own row.
directors_df1 = directors_df['Directors'].apply(pd.Series)
directors_df2 = pd.merge(directors_df, directors_df1, right_index = True, left_index = True)
directors_df2 = directors_df2.drop(['Directors'], axis = 1)
final_directors_df = directors_df2.melt(id_vars = ['Movie', 'Year'], var_name = ['Directors'])
final_directors_df = final_directors_df.drop('Directors', axis=1)
final_directors_df = final_directors_df.drop_duplicates()

In [82]:
# Clean the Year value.
final_directors_df['Year'] = final_directors_df['Year'].str[-5:-1].astype(int)

In [84]:
# Clean the new 'value' field to remove tags.
final_directors_df['value'] = final_directors_df['value'].apply(lambda x: re.sub('<[^<]+?>', '', str(x)))

In [85]:
final_directors_df = final_directors_df.drop_duplicates()
final_directors_df

Unnamed: 0,Movie,Year,value
0,Star Wars: Episode VII - The Force Awakens,2015,J.J. Abrams
1,Avengers: Endgame,2019,Joe Russo
2,Avatar,2009,James Cameron
3,Black Panther,2018,Ryan Coogler
4,Avengers: Infinity War,2018,Joe Russo
...,...,...,...
285562,The ABCs of Death,2012,
285694,ABCs of Death 2,2014,Alejandro Brugués
295994,ABCs of Death 2,2014,Robert Boocheck
306294,ABCs of Death 2,2014,Julian Barratt


In [21]:
# Using The Number webpage, scrape all Movie Titles, their Release Date, Production Budget, Domestic Gross, and Worldwide Gross.
final_budget_container = []
#Develop a for-loop that will iterate through each of the pages.
for x in range(1, 6002, 100):
    url2 = 'https://www.the-numbers.com/movie/budgets/all/{}'.format(x)
    response2 = requests.get(url2)
    sleep(randint(8,15))
    soup = BeautifulSoup(response2.text, 'html.parser')
    budget_containers = soup.find_all('td')
    for containers in budget_containers:
# Extract data from individual movie container
        final_budget_container.append(containers)

In [22]:
# Breakdown the data points listed above.
date_release = final_budget_container[1::3][::2]
movie_title = final_budget_container[2::3][::2]
production_budget = final_budget_container[3::3][::2]
domestic_gross = final_budget_container[4::3][::2]
worldwide_gross = final_budget_container[5::3][::2]

In [23]:
# Create lists for each data point.
final_date = []
final_movie = []
final_budget = []
final_domestic = []
final_worldwide = []
# Define a function to strip the tags from each value.
def budget_clean(table, new):
    for p in table:
        clean = bleach.clean(p, tags=[], strip=True)
        new.append(clean)

In [24]:
# Call our new function as needed.
budget_clean(date_release, final_date)
budget_clean(movie_title, final_movie)
budget_clean(production_budget, final_budget)
budget_clean(domestic_gross, final_domestic)
budget_clean(worldwide_gross, final_worldwide)

In [25]:
# Create a DataFrame containing: Movie Title, Release Date, Production Budget, Domestic Gross, and Worldwide Gross.
budget_df = pd.DataFrame({'Release Date': final_date,
'Movie': final_movie,
'Production Budget': final_budget,
'Domestic Gross': final_domestic,
'Worldwide Gross': final_worldwide,
})
print(budget_df.info())
budget_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6043 entries, 0 to 6042
Data columns (total 5 columns):
Release Date         6043 non-null object
Movie                6043 non-null object
Production Budget    6043 non-null object
Domestic Gross       6043 non-null object
Worldwide Gross      6043 non-null object
dtypes: object(5)
memory usage: 236.2+ KB
None


Unnamed: 0,Release Date,Movie,Production Budget,Domestic Gross,Worldwide Gross
0,"Apr 23, 2019",Avengers: Endgame,"$400,000,000","$858,373,000","$2,797,800,564"
1,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$379,000,000","$241,063,875","$1,045,663,875"
2,"Apr 22, 2015",Avengers: Age of Ultron,"$365,000,000","$459,005,868","$1,396,099,202"
3,"Dec 16, 2015",Star Wars Ep. VII: The Force Awakens,"$306,000,000","$936,662,225","$2,065,478,084"
4,"Apr 25, 2018",Avengers: Infinity War,"$300,000,000","$678,815,482","$2,048,359,754"
...,...,...,...,...,...
6038,Unknown,Red 11,"$7,000",$0,$0
6039,"Apr 2, 1999",Following,"$6,000","$48,482","$240,495"
6040,"Jul 13, 2005",Return to the Land of Wonders,"$5,000","$1,338","$1,338"
6041,"Sep 29, 2015",A Plague So Pleasant,"$1,400",$0,$0


In [26]:
# Convert each $ field into an integer type.
budget_df['Production Budget'] = budget_df['Production Budget'].str.replace(',','').str.replace('$','').astype(int)
budget_df['Domestic Gross'] = budget_df['Domestic Gross'].str.replace(',','').str.replace('$','').astype(int)
budget_df['Worldwide Gross'] = budget_df['Worldwide Gross'].str.replace(',','').str.replace('$','').astype(int)

In [27]:
# Add Year column to budget_df table.
budget_df['Year'] = budget_df['Release Date'].str[-4:]
budget_df = budget_df[budget_df['Year'] != 'nown']
budget_df['Year'] = budget_df['Year'].astype(int)

In [32]:
# Merge IMDb_df and budget_df
IMDb_budget = pd.merge(IMDb_df, budget_df)

In [34]:
IMDb_budget.nunique()

Movie                3770
Year                   85
IMDb                   74
Rating                 13
Runtime               138
Genre                 310
Release Date         2241
Production Budget     379
Domestic Gross       3712
Worldwide Gross      3765
dtype: int64

In [35]:
# Remove Duplicates
IMDb_budget = IMDb_budget.drop_duplicates()

In [36]:
print(IMDb_budget.info())
IMDb_budget

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3828 entries, 0 to 4061
Data columns (total 10 columns):
Movie                3828 non-null object
Year                 3828 non-null int64
IMDb                 3828 non-null float64
Rating               3828 non-null object
Runtime              3828 non-null int64
Genre                3828 non-null object
Release Date         3828 non-null object
Production Budget    3828 non-null int64
Domestic Gross       3828 non-null int64
Worldwide Gross      3828 non-null int64
dtypes: float64(1), int64(5), object(4)
memory usage: 329.0+ KB
None


Unnamed: 0,Movie,Year,IMDb,Rating,Runtime,Genre,Release Date,Production Budget,Domestic Gross,Worldwide Gross
0,Avengers: Endgame,2019,8.4,PG-13,181,"Action, Adventure, Drama","Apr 23, 2019",400000000,858373000,2797800564
7,Avatar,2009,7.8,PG-13,162,"Action, Adventure, Fantasy","Dec 17, 2009",237000000,760507625,2788701337
14,Black Panther,2018,7.3,PG-13,134,"Action, Adventure, Sci-Fi","Feb 13, 2018",200000000,700059566,1346103376
21,Avengers: Infinity War,2018,8.4,PG-13,149,"Action, Adventure, Sci-Fi","Apr 25, 2018",300000000,678815482,2048359754
28,Titanic,1997,7.8,PG-13,194,"Drama, Romance","Dec 18, 1997",200000000,659363944,2208208395
...,...,...,...,...,...,...,...,...,...,...
4057,The Brain That Wouldn't Die,1962,4.5,Approved,82,"Horror, Sci-Fi","Aug 10, 1962",60000,0,0
4058,The Wrong Man,1956,7.4,Not Rated,105,"Drama, Film-Noir","Dec 23, 1956",1200000,2000000,2000000
4059,Carousel,1956,6.6,Approved,128,"Drama, Fantasy, Musical","Feb 16, 1956",3380000,0,3604
4060,The Trouble with Harry,1955,7.1,PG,99,"Comedy, Mystery","Oct 3, 1955",1200000,7000000,7000000


In [86]:
# Merge final_actors_df and budget_df
actors_finance = pd.merge(final_actors_df, budget_df)

In [87]:
# Remove duplicates.
actors_finance = actors_finance.drop_duplicates()

In [88]:
print(actors_finance.info())
actors_finance

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15320 entries, 0 to 15319
Data columns (total 7 columns):
Movie                15320 non-null object
Year                 15320 non-null int64
value                15320 non-null object
Release Date         15320 non-null object
Production Budget    15320 non-null int64
Domestic Gross       15320 non-null int64
Worldwide Gross      15320 non-null int64
dtypes: int64(4), object(3)
memory usage: 957.5+ KB
None


Unnamed: 0,Movie,Year,value,Release Date,Production Budget,Domestic Gross,Worldwide Gross
0,Avengers: Endgame,2019,Robert Downey Jr.,"Apr 23, 2019",400000000,858373000,2797800564
1,Avengers: Endgame,2019,Chris Evans,"Apr 23, 2019",400000000,858373000,2797800564
2,Avengers: Endgame,2019,Mark Ruffalo,"Apr 23, 2019",400000000,858373000,2797800564
3,Avengers: Endgame,2019,Chris Hemsworth,"Apr 23, 2019",400000000,858373000,2797800564
4,Avatar,2009,Sam Worthington,"Dec 17, 2009",237000000,760507625,2788701337
...,...,...,...,...,...,...,...
15315,The Trouble with Harry,1955,Mildred Natwick,"Oct 3, 1955",1200000,7000000,7000000
15316,Niagara,1953,Marilyn Monroe,"Jan 21, 1953",1250000,2500000,2500000
15317,Niagara,1953,Joseph Cotten,"Jan 21, 1953",1250000,2500000,2500000
15318,Niagara,1953,Jean Peters,"Jan 21, 1953",1250000,2500000,2500000


In [96]:
# Merge final_directors and budget_df.
directors_finance = pd.merge(final_directors_df, budget_df)

In [97]:
# Remove duplicates
directors_finance = directors_finance.drop_duplicates()

In [101]:
# Remove 'nan' values from 'value' column.
directors_finance = directors_finance[directors_finance['value'] != 'nan']

In [102]:
print(directors_finance.info())
directors_finance

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4181 entries, 0 to 8009
Data columns (total 7 columns):
Movie                4181 non-null object
Year                 4181 non-null int64
value                4181 non-null object
Release Date         4181 non-null object
Production Budget    4181 non-null int64
Domestic Gross       4181 non-null int64
Worldwide Gross      4181 non-null int64
dtypes: int64(4), object(3)
memory usage: 261.3+ KB
None


Unnamed: 0,Movie,Year,value,Release Date,Production Budget,Domestic Gross,Worldwide Gross
0,Avengers: Endgame,2019,Joe Russo,"Apr 23, 2019",400000000,858373000,2797800564
1,Avengers: Endgame,2019,Anthony Russo,"Apr 23, 2019",400000000,858373000,2797800564
3,Avatar,2009,James Cameron,"Dec 17, 2009",237000000,760507625,2788701337
5,Black Panther,2018,Ryan Coogler,"Feb 13, 2018",200000000,700059566,1346103376
7,Avengers: Infinity War,2018,Joe Russo,"Apr 25, 2018",300000000,678815482,2048359754
...,...,...,...,...,...,...,...
8001,The Brain That Wouldn't Die,1962,Joseph Green,"Aug 10, 1962",60000,0,0
8003,The Wrong Man,1956,Alfred Hitchcock,"Dec 23, 1956",1200000,2000000,2000000
8005,Carousel,1956,Henry King,"Feb 16, 1956",3380000,0,3604
8007,The Trouble with Harry,1955,Alfred Hitchcock,"Oct 3, 1955",1200000,7000000,7000000


Finally, we will take all of the new tables we have scraped and export them to CSV files.

In [103]:
IMDb_budget.to_csv('IMDb_budgets.csv',  index=False)

In [104]:
IMDb_df.to_csv('IMDb_base.csv', index=False)

In [105]:
actors_finance.to_csv('Actors_Table.csv', index=False)

In [106]:
directors_finance.to_csv('Directors_Table.csv', index=False)

Congratulations! We have successfully scraped and cleaned all of the data we needed from our two websites. Now let's move over to our analysis notebook where we get to put all of this data to use in making our recommendations.