# Data Extraction Section 

The project will aim to compare and contrast a wide variety of popular films to hopefully ascertain if there is any relationship between genre, budget, language, etc and box office return

In [2]:
from bs4 import BeautifulSoup
from urllib import request
import json, requests, urllib
from pathlib import Path
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None )

%matplotlib inline



dir_raw = Path("raw_film_data")
dir_raw.mkdir(parents=True, exist_ok=True)


The first step in pulling the data is choosing a good set of films to study. The Internet Movie Database (IMDb.com) has a definitive list of the 250 highest rated films on the site. Although this list based on user ratings it is a good blend of commercially and critically sucessful films both in the USA and worldwide. Therefore I will scrape the titles of the IMDB Top 250 from the to use in my requests from the API. I am only taking the names of the films as strings from the page, no other data is being extracted at this point. 

$\textit{Note:}$ The only point I will make about the extracted data at this point is that IMDb is mostly skewed toward American and English speaking films which is reflected in the percent of non-english films.This is encountered and discussed more in section 2 of the assignment. There are films from Europe and Asia present in the list...just a lot more American films.

In [3]:
# Download IMDB's Top 250 data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')


imdb = []


for index in range(0, len(movies)):    
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    data =  movie_title

    imdb.append(data)
    
    
print(len(imdb))
print(imdb)


250
['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II', 'The Dark Knight', '12 Angry Men', "Schindler's List", 'The Lord of the Rings: The Return of the King', 'Pulp Fiction', 'Il buono, il brutto, il cattivo', ' The Lord of the Rings: The Fellowship of the Ring', 'Fight Club', 'Forrest Gump', 'Inception', 'The Lord of the Rings: The Two Towers', 'Star Wars: Episode V - The Empire Strikes Back', 'The Matrix', 'Goodfellas', "One Flew Over the Cuckoo's Nest", 'Shichinin no samurai', 'Se7en', 'The Silence of the Lambs', 'Cidade de Deus', 'La vita è bella', "It's a Wonderful Life", 'Star Wars', 'Saving Private Ryan', 'Interstellar', 'Sen to Chihiro no kamikakushi', 'The Green Mile', 'Gisaengchung', 'Léon', 'Seppuku', 'The Pianist', 'Terminator 2: Judgment Day', 'The Usual Suspects', 'Back to the Future', 'Psycho', 'The Lion King', 'Modern Times', 'American History X', 'Hotaru no haka', 'City Lights', 'Whiplash', 'Gladiator', 'The Departed', 'The Intouchables', 'The Pres

This list is fine to see the list of movies we will be examining but to use then in a search query we need to format them properly. This includes removing spaces and replacing them with + symbols. In general The Movie Database seems fairly able to handle variations for example searching for "Empire Strikes Back", "Star Wars Episode 5" "Star Wars Empire" all return a result containing "Star Wars Episode V: The Empire Strikes Back"  

In [4]:
imdb_list=list(imdb)

top_250=[]
for i in range(len(imdb_list)):
    x = imdb_list[i].replace(' ','+')
    top_250.append(x)

    
print(top_250)

['The+Shawshank+Redemption', 'The+Godfather', 'The+Godfather:+Part+II', 'The+Dark+Knight', '12+Angry+Men', "Schindler's+List", 'The+Lord+of+the+Rings:+The+Return+of+the+King', 'Pulp+Fiction', 'Il+buono,+il+brutto,+il+cattivo', '+The+Lord+of+the+Rings:+The+Fellowship+of+the+Ring', 'Fight+Club', 'Forrest+Gump', 'Inception', 'The+Lord+of+the+Rings:+The+Two+Towers', 'Star+Wars:+Episode+V+-+The+Empire+Strikes+Back', 'The+Matrix', 'Goodfellas', "One+Flew+Over+the+Cuckoo's+Nest", 'Shichinin+no+samurai', 'Se7en', 'The+Silence+of+the+Lambs', 'Cidade+de+Deus', 'La+vita+è+bella', "It's+a+Wonderful+Life", 'Star+Wars', 'Saving+Private+Ryan', 'Interstellar', 'Sen+to+Chihiro+no+kamikakushi', 'The+Green+Mile', 'Gisaengchung', 'Léon', 'Seppuku', 'The+Pianist', 'Terminator+2:+Judgment+Day', 'The+Usual+Suspects', 'Back+to+the+Future', 'Psycho', 'The+Lion+King', 'Modern+Times', 'American+History+X', 'Hotaru+no+haka', 'City+Lights', 'Whiplash', 'Gladiator', 'The+Departed', 'The+Intouchables', 'The+Prestige

Now that we have a list of films we can enquire as to whether TMDB has data for. The way TMDB works is that each film has a specific ID but this ID is unknown until we enquire as to whether the film is included in the database. We make an inquiry as to whether the film is in the database. If so, it returns basic data about the film including the ID. This ID can then be re-entered as a request and returns complete info about the film. 

In [5]:
# key for the movie database
api_key = "5899dec5f66ddd50a6d23338465b766c"
# Prefix for API URLs
api_prefix = "https://api.themoviedb.org/3"


In [6]:
def fetch_film_ids(endpoint, query):
    # construct the url
    url = api_prefix
    if not endpoint.startswith("/"):
        url += "/"
    url += endpoint 
    url += "?" + "api_key=" + api_key + '&query='+query
    print("Fetching %s" % url)
    # fetch the page
    response = requests.get(url)
    jdata = response.text
    return json.loads(jdata)

def fetch_film_details(endpoint): 
    # construct the url
    url = api_prefix
    if not endpoint.startswith("/"):
        url += "/"
    url += endpoint 
    print("Fetching %s" % url)
    # fetch the page
    response = requests.get(url)
    jdata = response.text
    return json.loads(jdata)

The next step is to check whether the films are included in the API. This for loop cycles through the formatted names and fetches a search query. To determine whether a film the the actual one we are looking for we make a number of assumptions: 
-The film is well rated so will probably have a user score > 7.0 
-The film is well known so will probably have more than 100 votes. 

This method will hopefully filter out 

In [108]:
input_films = []
collected_films=[]
film_metadata = {}
film_keys = {}
for i in range(len(top_250)):
    film_data = fetch_film_ids("/search/movie", top_250[i])
    if film_data["results"]==[]:
        collected_films.append('N/A')
    else:
        for j in range(len(film_data['results'])):
            if film_data["results"][j]['vote_count']>=100 and film_data["results"][j]['vote_average']>=7.5:
                print("Found match for %s: ID=%s Film=%s" % 
                      (imdb[i], film_data["results"][j]['id'], film_data["results"][j]['title']))
                collected_films.append(film_data["results"][j]['title'])
                input_films.append(imdb[i])
                film_metadata[film_data["results"][j]['title']] = film_data["results"][j]
                film_keys[film_data["results"][j]['title']] = film_data["results"][j]['id']
                break
            
            
            else:
                print("No match found for %s" %imdb[i])
                

            
print("Found keys for %d films" % len(film_keys))
        
  

Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=The+Shawshank+Redemption
Found match for The Shawshank Redemption: ID=278 Film=The Shawshank Redemption
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=The+Godfather
Found match for The Godfather: ID=238 Film=The Godfather
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=The+Godfather:+Part+II
Found match for The Godfather: Part II: ID=240 Film=The Godfather: Part II
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=The+Dark+Knight
Found match for The Dark Knight: ID=155 Film=The Dark Knight
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=12+Angry+Men
Found match for 12 Angry Men: ID=389 Film=12 Angry Men
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&

Found match for The Departed: ID=1422 Film=The Departed
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=The+Intouchables
Found match for The Intouchables: ID=77338 Film=The Intouchables
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=The+Prestige
Found match for The Prestige: ID=1124 Film=The Prestige
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Casablanca
Found match for Casablanca: ID=289 Film=Casablanca
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Once+Upon+a+Time+in+the+West
No match found for Once Upon a Time in the West
Found match for Once Upon a Time in the West: ID=335 Film=Once Upon a Time in the West
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Rear+Window
Found match for Rear Window: ID=567 Film=Rear Window
Fetching htt

Found match for M - Eine Stadt sucht einen Mörder: ID=832 Film=M
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Vertigo
Found match for Vertigo: ID=426 Film=Vertigo
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Taare+Zameen+Par
Found match for Taare Zameen Par: ID=7508 Film=Like Stars on Earth
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Citizen+Kane
Found match for Citizen Kane: ID=15 Film=Citizen Kane
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Idi+i+smotri
Found match for Idi i smotri: ID=25237 Film=Come and See
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Jagten
Found match for Jagten: ID=103663 Film=The Hunt
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Requiem+for+a+Drea

Found match for All About Eve: ID=705 Film=All About Eve
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Some+Like+It+Hot
Found match for Some Like It Hot: ID=239 Film=Some Like It Hot
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Batman+Begins
Found match for Batman Begins: ID=272 Film=Batman Begins
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Unforgiven
Found match for Unforgiven: ID=33 Film=Unforgiven
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Bacheha-Ye+aseman
Found match for Bacheha-Ye aseman: ID=21334 Film=Children of Heaven
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Dune:+Part+One
Found match for Dune: Part One: ID=438631 Film=Dune
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b7

Found match for Tôkyô monogatari: ID=18148 Film=Tokyo Story
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=On+the+Waterfront
Found match for On the Waterfront: ID=654 Film=On the Waterfront
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Babam+ve+Oglum
Found match for Babam ve Oglum: ID=13393 Film=My Father and My Son
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
No match found for Z
Found match for Z: ID=2721 Film=Z
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Relatos+salvajes
Found match for Relatos salvajes: ID=265195 Film=Wild Tal

No match found for Gangs of Wasseypur
No match found for Gangs of Wasseypur
No match found for Gangs of Wasseypur
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=La+haine
Found match for La haine: ID=406 Film=La Haine
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Platoon
Found match for Platoon: ID=792 Film=Platoon
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Spotlight
Found match for Spotlight: ID=314365 Film=Spotlight
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Koe+no+katachi
Found match for Koe no katachi: ID=378064 Film=A Silent Voice: The Movie
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d23338465b766c&query=Rebecca
Found match for Rebecca: ID=223 Film=Rebecca
Fetching https://api.themoviedb.org/3/search/movie?api_key=5899dec5f66ddd50a6d233384

In [109]:
metadata_rows = []
for film in collected_films:
    
    row = {"Film": film, "key": film_keys[film]}
    row["Title"] = film_metadata[film]["title"]
    row["id"] = film_metadata[film]["id"]
    row["release date"] =  film_metadata[film]['release_date']
    row["vote_count"] =  film_metadata[film]['vote_count']
    row["vote_average"] =  film_metadata[film]['vote_average']
   
    metadata_rows.append(row)
x = pd.DataFrame(metadata_rows).set_index("Film")

In [106]:

x.sort_values(by=['vote_average'])

Unnamed: 0_level_0,key,Title,id,release date,vote_count,vote_average
Film,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Mad Max: Fury Road,76341,Mad Max: Fury Road,76341,2015-05-13,18580,7.5
Casino Royale,36557,Casino Royale,36557,2006-11-14,8404,7.5
Blade Runner 2049,335984,Blade Runner 2049,335984,2017-10-04,10087,7.5
The Bandit,26900,The Bandit,26900,1996-11-29,216,7.5
The Princess Bride,2493,The Princess Bride,2493,1987-09-25,3529,7.6
Amores Perros,55,Amores Perros,55,2000-06-16,1845,7.6
My Father and My Son,13393,My Father and My Son,13393,2005-11-18,199,7.6
Die Hard,562,Die Hard,562,1988-07-15,8642,7.7
Batman Begins,272,Batman Begins,272,2005-06-10,16784,7.7
Pan's Labyrinth,1417,Pan's Labyrinth,1417,2006-08-25,8516,7.7


In [110]:
for i in range(len(input_films)):
    print(input_films[i]+' = '+ collected_films[i])

The Shawshank Redemption = The Shawshank Redemption
The Godfather = The Godfather
The Godfather: Part II = The Godfather: Part II
The Dark Knight = The Dark Knight
12 Angry Men = 12 Angry Men
Schindler's List = Schindler's List
The Lord of the Rings: The Return of the King = The Lord of the Rings: The Return of the King
Pulp Fiction = Pulp Fiction
Il buono, il brutto, il cattivo = The Good, the Bad and the Ugly
 The Lord of the Rings: The Fellowship of the Ring = The Lord of the Rings: The Fellowship of the Ring
Fight Club = Fight Club
Forrest Gump = Forrest Gump
Inception = Inception
The Lord of the Rings: The Two Towers = The Lord of the Rings: The Two Towers
Star Wars: Episode V - The Empire Strikes Back = The Empire Strikes Back
The Matrix = The Matrix
Goodfellas = GoodFellas
One Flew Over the Cuckoo's Nest = One Flew Over the Cuckoo's Nest
Shichinin no samurai = Seven Samurai
Se7en = Se7en
The Silence of the Lambs = The Silence of the Lambs
Cidade de Deus = City of God
La vita è b

In [111]:
def fetch_full_film_details(film_name):
    # create the endpoint URL    
    endpoint = "/movie/%s?api_key=5899dec5f66ddd50a6d23338465b766c" % film_keys[film_name]
    film_formatted = (film_name.replace(' ','_')).replace(':','')
    # fetch the current data
    film_data = fetch_film_details(endpoint)
    # write it out to our raw dataset directory
    fname = "%s.json" % film_formatted
    out_path = dir_raw / fname
    print("Writing data to %s" % out_path)
    with open(out_path, "w") as fout:
        json.dump(film_data, fout, indent=4, sort_keys=True)


Finally, with the film IDs known we can request the actual film data and save it to a JSON file. 

In [112]:
for film in collected_films:
    fetch_full_film_details(film)

Fetching https://api.themoviedb.org/3/movie/278?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Shawshank_Redemption.json
Fetching https://api.themoviedb.org/3/movie/238?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Godfather.json
Fetching https://api.themoviedb.org/3/movie/240?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Godfather_Part_II.json
Fetching https://api.themoviedb.org/3/movie/155?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Dark_Knight.json
Fetching https://api.themoviedb.org/3/movie/389?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\12_Angry_Men.json
Fetching https://api.themoviedb.org/3/movie/424?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Schindler's_List.json
Fetching https://api.themoviedb.org/3/movie/122?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Lord_of_the_Rings_The_Retu

Fetching https://api.themoviedb.org/3/movie/10681?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\WALL·E.json
Fetching https://api.themoviedb.org/3/movie/299536?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Avengers_Infinity_War.json
Fetching https://api.themoviedb.org/3/movie/37257?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Witness_for_the_Prosecution.json
Fetching https://api.themoviedb.org/3/movie/694?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Shining.json
Fetching https://api.themoviedb.org/3/movie/324857?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Spider-Man_Into_the_Spider-Verse.json
Fetching https://api.themoviedb.org/3/movie/935?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Dr._Strangelove.json
Fetching https://api.themoviedb.org/3/movie/475557?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Joker.

Fetching https://api.themoviedb.org/3/movie/10193?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Toy_Story_3.json
Fetching https://api.themoviedb.org/3/movie/938?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\For_a_Few_Dollars_More.json
Fetching https://api.themoviedb.org/3/movie/14160?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Up.json
Fetching https://api.themoviedb.org/3/movie/89?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Indiana_Jones_and_the_Last_Crusade.json
Fetching https://api.themoviedb.org/3/movie/949?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Heat.json
Fetching https://api.themoviedb.org/3/movie/2118?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\L.A._Confidential.json
Fetching https://api.themoviedb.org/3/movie/11878?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Yojimbo.json
Fetching https://api.t

Writing data to raw_film_data\Tokyo_Story.json
Fetching https://api.themoviedb.org/3/movie/654?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\On_the_Waterfront.json
Fetching https://api.themoviedb.org/3/movie/13393?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\My_Father_and_My_Son.json
Fetching https://api.themoviedb.org/3/movie/2721?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Z.json
Fetching https://api.themoviedb.org/3/movie/265195?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Wild_Tales.json
Fetching https://api.themoviedb.org/3/movie/11778?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Deer_Hunter.json
Fetching https://api.themoviedb.org/3/movie/992?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Sherlock_Jr..json
Fetching https://api.themoviedb.org/3/movie/13223?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data

Writing data to raw_film_data\Before_Sunset.json
Fetching https://api.themoviedb.org/3/movie/17295?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Battle_of_Algiers.json
Fetching https://api.themoviedb.org/3/movie/18491?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Neon_Genesis_Evangelion_The_End_of_Evangelion.json
Fetching https://api.themoviedb.org/3/movie/2493?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\The_Princess_Bride.json
Fetching https://api.themoviedb.org/3/movie/19426?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Nights_of_Cabiria.json
Fetching https://api.themoviedb.org/3/movie/655?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Paris,_Texas.json
Fetching https://api.themoviedb.org/3/movie/110?api_key=5899dec5f66ddd50a6d23338465b766c
Writing data to raw_film_data\Three_Colors_Red.json
Fetching https://api.themoviedb.org/3/movie/238628?api_key=5899de