

# Loading Data

This notebook contains the code that I used to query the TMDb API to download the data for this project. To begin, we import the dependencies.

In [4]:
import config
import pandas as pd
import numpy as np
from IPython.display import display
import requests, json

As an example of an API query, we construct the dictionary below and use `requests` and `json` to query the API. The query below selects the first page of movies released in 2018 in English, sorted by descending order of popularity.

In [3]:
api_url = "https://api.themoviedb.org/3/discover/movie/?"
api_params = {
    "api_key" : config.api_key,
    "primary_release_year" : 2018,
    "sort_by" : "poularity.desc",
    "page" : 1,
    "with_original_language" : "en"
}
response = requests.get(api_url, params=api_params)
page = json.loads(response.text)

To reuse this pipeline, we define a function `get_movies` below that reconstructs the above query for a specific year and then returns a list of JSON responses for each page of the results.

In [None]:
def get_movies(year):
    """
    
    Queries TMDb API to get list of movies released in `year`
    
    Returns:
        a list of dictionaries, where each item corresponds to a page
        of results
    
    """
    api_url = "https://api.themoviedb.org/3/discover/movie/?"
    api_params = {
        "api_key" : config.api_key,
        "primary_release_year" : year,
        "sort_by" : "poularity.desc",
        "page" : 1,
        "with_original_language" : "en",
        "include_adult" : True
    }
    response = requests.get(api_url, params=api_params)
    response = json.loads(response.text)
    n = response["total_pages"]
    pages = []
    print("{} pages to load".format(n))
    for i in range(1, n+1):
        if i % 50 == 0:
            print("Loaded page {}".format(i))
        api_params["page"] = i
        response = requests.get(api_url, params=api_params)
        response = json.loads(response.text)
        pages.append(response)
        
    print("Finished downloading data")
    return pages

pages = get_movies(2018)

434 pages to load


In order to load the data into a CSV file, we first need to expand each page of results into a list of results. The function `expand_pages` below takes a list of pages and expands them all into one concatenated list.

In [18]:
def expand_pages(pages):
    """
    Expands the results into a single list of dictionaries
    """
    results = [page["results"] for page in pages]
    data = []
    for page in results:
        data.extend(page)
    return data

Finally, we create a dataframe from the list of results and save the results as `tmdb_api_data.csv` in the `data` folder.

In [37]:
data = expand_pages(pages)
data = pd.DataFrame(data)
print("DataFrame has {} rows".format(data.shape[0]))
display(data.head())
data.to_csv("../data/tmdb_api_data.csv", index=False)

DataFrame has 8473 rows


Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/bOGkgRGdhrBYJSLpXaxhXVstddV.jpg,"[12, 28, 14]",299536,en,Avengers: Infinity War,As the Avengers and their allies have continue...,153.811,/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg,2018-04-25,Avengers: Infinity War,False,8.3,12490
1,False,/5zfVNTrkhMu673zma6qhFzG01ig.jpg,[878],300668,en,Annihilation,"A biologist signs up for a dangerous, secret e...",29.516,/d3qcpfNwbAMCNqWDHzPQsUYiUgS.jpg,2018-02-22,Annihilation,False,6.3,4232
2,False,/zjG95oDnBcFKMPgBEmmuNVOMC90.jpg,"[35, 18]",299782,en,The Other Side of the Wind,"Surrounded by fans and skeptics, grizzled dire...",6.82,/kFky1paYEfHxfCYByEc9g7gn6Zk.jpg,2018-11-02,The Other Side of the Wind,False,7.1,55
3,False,/q9hnJ9SzwcF30seRtXEzLd5l1gw.jpg,"[18, 35, 14]",351044,en,Welcome to Marwen,When a devastating attack shatters Mark Hoganc...,61.973,/o45VIAUYDcVCGuzd43l8Sr5Dfti.jpg,2018-12-21,Welcome to Marwen,False,6.6,174
4,False,/AmO8I38bkHwKhgxPNrd6djBQyPU.jpg,"[53, 9648, 27, 14]",361292,en,Suspiria,A darkness swirls at the center of a world-ren...,41.461,/dzWTnkert9EoiPWldWJ15dnfAFl.jpg,2018-10-11,Suspiria,False,7.2,579


Finally, to get the mapping from genre IDs to strings, we query the API again and save the genres dataframe as `tmdb_genres.csv` in the `data` folder.

In [33]:
# get genre list
api_url = "https://api.themoviedb.org/3/genre/movie/list?"
api_params = {
    "api_key" : config.api_key,
    "language" : "en-US"
}
genres = requests.get(api_url, params=api_params)
genres = json.loads(genres.text)
genres = pd.DataFrame(genres["genres"])
display(genres.head())
genres.to_csv("../data/tmdb_genres.csv", index=False)

Unnamed: 0,id,name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime


filtering

In [22]:
movies = pd.read_csv("../data/tmdb_api_data.csv", lineterminator="\n")

In [23]:
cols_of_interest = [
    "genre_ids", "overview", "original_title", "original_language", "poster_path", "release_date",
    "vote_average", "vote_count"
]

movies = movies.dropna(subset=cols_of_interest)
movies.shape

(7275, 14)

### Filtering `nan` Rows

Before we can movie on, we first need to get an idea of how many rows have values in the `vote_average` and `vote_count` columns that do not make sense. We only want rows where the rating is nonzero *or* where there is at least one vote.

In [24]:
print("Total number of movies: {}".format(movies.shape[0]))
print("Movies with nonzero rating: {}".format(movies[movies["vote_average"] != 0].shape[0]))
print("Movies with nonzero vote count: {}".format(movies[movies["vote_count"] != 0].shape[0]))

Total number of movies: 7275
Movies with nonzero rating: 2638
Movies with nonzero vote count: 2700


In the cell below, we filter the rows that either have a nonzero rating or at least 1 vote.

In [25]:
movies = movies[(movies["vote_average"] != 0) | (movies["vote_count"] != 0)]

In [26]:
movies.shape

(2700, 14)

images

In [27]:
from PIL import Image
import urllib
from concurrent.futures import ThreadPoolExecutor, wait

In [28]:
posters = {}

img_size = (140, 92)

def get_poster(row):
    if (row[0] + 1) % 1000 == 0:
        print(row[0] + 1)
    row = row[1]
    if not pd.isna(row["poster_path"]):
        im = Image.open(urllib.request.urlopen("http://image.tmdb.org/t/p/w185/" + row["poster_path"]))
        im = im.resize(img_size)
        arr = np.asarray(im)
        posters[str(row["id"])] = arr

In [29]:
poster_pool = ThreadPoolExecutor()
poster_futures = []
for r in movies.iterrows():
    if r[1]["id"] not in posters:
        poster_futures.append(poster_pool.submit(get_poster, row=r))
    
poster_results = wait(poster_futures)

1000
8000


In [32]:
len(posters.keys())

1876

In [34]:
from scipy import io

io.savemat("../data/posters.mat", posters)

In [35]:
posters["373209"].shape

(92, 140, 3)

In [36]:
movies.to_csv("../data/filtered_movies.csv", index=False)