# Data Gathering
As we are starting this project from total scratch, we need a means to get the data to support the machine learning model from wherever we can get it. Fortunately for us, a number of different movie APIs exist that we can call to get data relevant to support each reviewed movie.

## Project Setup

In [1]:
# Importing the necessary Python libraries
import os
import yaml
import numpy as np
import pandas as pd
import tmdbv3api
from imdb import IMDb

In [2]:
# Loading Caelan's ratings from Google Sheets-sourced CSV
df_ratings = pd.read_csv('../data/raw/caelan-reviews.csv')

In [3]:
# Viewing the first few rows of df_ratings
df_ratings.head()

Unnamed: 0,Name,Rating,Flickable
0,Zoolander 2,7.0,Yes
1,Dope,8.5,Yes
2,The Big Short,8.0,Yes
3,Deadpool,10.0,Yes
4,The Martian,8.0,Yes


In [4]:
# Viewing the DataFrame info of df_ratings
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       121 non-null    object 
 1   Rating     121 non-null    float64
 2   Flickable  121 non-null    object 
dtypes: float64(1), object(2)
memory usage: 3.0+ KB


In [5]:
# Loading the API keys from the separate, secret YAML file
with open('../keys/keys.yml', 'r') as f:
    keys_yaml = yaml.safe_load(f)

In [6]:
# Extracting the API keys from the loaded YAML
tmdb_key = keys_yaml['api_keys']['tmdb_key']

## Data Source #1: The Movie Database (TMDb)
The first data source we will be looking at is called **The Movie Database (TMDb)**, and it highly lauded as one of best APIs on the internet for gathering movie data. Fortunately for us, somebody created a Python wrapper that allows simple use of Python code to interact with this API.

### TMDb API Key
First, you will need to create a sign up to get an API key that will allow you to use the API. This API works on a "freemium" tier, but the part that costs is if you're going to use this API at a mass scale. Our project is well within the free tier, so you fortunately don't have to provide anything like a credit card number. But you will have to give some basic information, including your email. I will not be sharing my API key so as to maintain the ability to use the free tier. To sign up for your own API key, please follow the sign up instructions here: [TMDb API Key Registration](https://developers.themoviedb.org/3/getting-started/introduction)

### TMDb Python Library
There are a number of different user-created Python wrappers to support TMDb, but the one that appears to be the most popular is this one called **tmdbv3api**. The documentation for this API can be found at this link: [tmdbv3api Documentation](https://github.com/AnthonyBloomer/tmdbv3api). To install this Python library on your machine, run the following command:

```
pip install tmdbv3api
```

In [7]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

### Testing Process with Single Movie
Before we craft something fancy, let's do some basic testing to see how we can get what we need using an example movie, which in this case is *The Matrix*.

In [8]:
# Getting the tmdb_id from the preliminary search
tmdb_id = tmdb_search.movies({'query': 'The Matrix'})[0]['id']
tmdb_id

603

In [9]:
# Getting the details of the moving using the tmdb_id
tmdb_details = dict(tmdb_movies.details(tmdb_id))

In [10]:
# Adding the movie_title and tmdb_id to the tmdb_details dictionary
tmdb_details['movie_title'] = 'The Matrix'
tmdb_details['tmdb_id'] = tmdb_id

In [11]:
# Checking the length of TMDb genres to see if there is a secondary genre
tmdb_genre_length = len(tmdb_details['genres'])
tmdb_genre_length

2

In [12]:
# Separating the primary_genre from the 'genres' nested child dictionary if it exists
if tmdb_genre_length == 0:
    tmdb_details['primary_genre'] = np.nan
else:
    tmdb_details['primary_genre'] = tmdb_details['genres'][0]['name']

In [13]:
# Separating the secondary_genre from the 'genres' nested child dictionary if it exists
if tmdb_genre_length >= 2:
    tmdb_details['secondary_genre'] = tmdb_details['genres'][1]['name']
else:
    tmdb_details['secondary_genre'] = np.nan

In [14]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_title', 'rating', 'flickable', 'tmdb_id', 'imdb_id', 'budget', 'primary_genre', 'secondary_genre', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [15]:
# Slimming down tmdb_details with only the features we want to keep
tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
tmdb_details

{'budget': 63000000,
 'imdb_id': 'tt0133093',
 'popularity': 127.06,
 'revenue': 463517383,
 'runtime': 136,
 'vote_average': 8.2,
 'vote_count': 19853,
 'movie_title': 'The Matrix',
 'tmdb_id': 603,
 'primary_genre': 'Action',
 'secondary_genre': 'Science Fiction'}

In [16]:
# Converting the tmdb_details dictionary to a Python DataFrame
new_matrix_entry = pd.DataFrame.from_dict([tmdb_details])
new_matrix_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_title,tmdb_id,primary_genre,secondary_genre
0,63000000,tt0133093,127.06,463517383,136,8.2,19853,The Matrix,603,Action,Science Fiction


### Creating TMDb Data Pipeline
Now that we have our test working, we can go ahead and create a pipeline that will iterate through all the movies in the `df_ratings` DataFrame and get all the TMDb data that we need to perform feature engineering later.

In [17]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

In [18]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_title', 'rating', 'flickable', 'tmdb_id', 'imdb_id', 'budget', 'primary_genre', 'secondary_genre', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [19]:
# Creating a new DataFrame to hold all the TMDb data
df_tmdb = pd.DataFrame(columns = tmdb_feats)

# Iterating through the df_ratings DataFrame to get the names for extracting detailed info from TMDb
for index, row in df_ratings.iterrows():
    # Extracting info from df_ratings
    movie_name = row['Name']
    rating = row['Rating']
    flickable = row['Flickable']
    
    # Performing the preliminary search
    search_results = tmdb_search.movies({'query': movie_name})
    
    # Extracting tmdb_id if search results exist
    if len(search_results) != 0:
        tmdb_id = search_results[0]['id']
    else:
        print(f'Results not found for title: {movie_name}.')
        continue
    
    # Getting the details of the movie using the tmdb_id
    tmdb_details = dict(tmdb_movies.details(tmdb_id))
    
    # Adding the df_ratings info and tmdb_id to the tmdb_details dictionary
    tmdb_details['movie_title'] = movie_name
    tmdb_details['rating'] = rating
    tmdb_details['flickable'] = flickable
    tmdb_details['tmdb_id'] = tmdb_id
    
    # Checking the length of TMDb genres to see if there is a secondary genre
    tmdb_genre_length = len(tmdb_details['genres'])
    
    # Separating the primary_genre from the 'genres' nested child dictionary if it exists
    if tmdb_genre_length == 0:
        tmdb_details['primary_genre'] = np.nan
    else:
        tmdb_details['primary_genre'] = tmdb_details['genres'][0]['name']
        
    # Separating the secondary_genre from the 'genres' nested child dictionary if it exists
    if tmdb_genre_length >= 2:
        tmdb_details['secondary_genre'] = tmdb_details['genres'][1]['name']
    else:
        tmdb_details['secondary_genre'] = np.nan
    
    # Slimming down tmdb_details with only the features we want to keep
    tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
    
    # Converting the tmdb_details dictionary to a Pandas DataFrame
    new_tmdb_entry = pd.DataFrame.from_dict([tmdb_details])
    
    # Appending the new movie entry to the overall df_tmdb DataFrame
    df_tmdb = df_tmdb.append(new_tmdb_entry, ignore_index = True)

In [20]:
# Viewing the first few rows of the full TMDb data
df_tmdb.head()

Unnamed: 0,movie_title,rating,flickable,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,popularity,revenue,runtime,vote_average,vote_count
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,14.443,55969000,100,4.8,1766
1,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,10.102,17986781,103,7.1,1179
2,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,27.252,133346506,131,7.3,6894
3,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,132.891,783100000,108,7.6,25491
4,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,54.332,630161890,144,7.7,16074


In [21]:
# Viewing the Pandas DataFrame info of the full TMDb data
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_title      121 non-null    object 
 1   rating           121 non-null    float64
 2   flickable        121 non-null    object 
 3   tmdb_id          121 non-null    object 
 4   imdb_id          121 non-null    object 
 5   budget           121 non-null    object 
 6   primary_genre    119 non-null    object 
 7   secondary_genre  107 non-null    object 
 8   popularity       121 non-null    float64
 9   revenue          121 non-null    object 
 10  runtime          121 non-null    object 
 11  vote_average     121 non-null    float64
 12  vote_count       121 non-null    object 
dtypes: float64(3), object(10)
memory usage: 12.4+ KB


In [22]:
# Saving out the TMDb data
df_tmdb.to_csv('../data/raw/tmdb_data.csv', index = False)

In [23]:
# Using df_tmdb as source for all new data in a single DataFrame
df_all_data = df_tmdb

## Data Source #2: iMDb
The next source we will be using is iMDb. While iMDb does have an official API, it doesn't seem readily accessible to the public. Fortunately, it looks like somebody created an alternative that I'm guessing that the developer used some advanced screen scraping to do. We could also do our own screen scraping if we wanted to, but that seems like a lot of work when this other Python alternative works just fine. This Python alternative is called **iMDbPY** and can be installed by running the following command:
```
pip install imdbpy
```

The `README` supporting this Python library can be found at this link: [iMDbPY Documentation](https://github.com/alberanid/imdbpy).

Because TMDb already gave us so much information, the only new features we'll be adding from this source are `imdb_rating` and `imdb_votes`.

### Testing Process with Single Movie
Just as we did with TMDb, let's first do some testing with a single movie in order to ensure that we can use this just fine. Because this Python library bases its search on iMDb ID, we're going to build off our work in TMDb since that API nice supplies us with the apporpriate iMDb IDs.

In [24]:
# Refreshing ourselves on the TMDb output for The Matrix
new_matrix_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_title,tmdb_id,primary_genre,secondary_genre
0,63000000,tt0133093,127.06,463517383,136,8.2,19853,The Matrix,603,Action,Science Fiction


In [25]:
# Extracting the iMDb ID from the TMDb search results and removing first two "tt" characters
imdb_id = new_matrix_entry['imdb_id']
imdb_id = imdb_id[0][2:]

In [26]:
# Instantiating the iMDbPY search object
imdb_search = IMDb()

In [27]:
# Using iMDbPY to get movie details using the iMDb ID
imdb_details = dict(imdb_search.get_movie(imdb_id))

In [28]:
# Adding imdb_rating and imdb_votes to new test entry for The Matrix
new_matrix_entry['imdb_rating'] = imdb_details['rating']
new_matrix_entry['imdb_votes'] = imdb_details['votes']

In [29]:
# Viewing the updated entry
new_matrix_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_title,tmdb_id,primary_genre,secondary_genre,imdb_rating,imdb_votes
0,63000000,tt0133093,127.06,463517383,136,8.2,19853,The Matrix,603,Action,Science Fiction,8.7,1755501


### Creating iMDb Data Pipeline
Now that we have our test working, we can go ahead and create a pipeline that will iterate through all the movies in the `df_ratings` DataFrame and get all the iMDb data that we need to perform feature engineering later.

In [30]:
# Instantiating the iMDbPY search object
imdb_search = IMDb()

In [31]:
# Copying the original df_tmdb DataFrame for testing purposes
df_tmdb_copy = df_all_data

In [32]:
# Iterating through each entry in df_tmdb, using the iMDb ID to extract relevant movie information
for index, row in df_all_data.iterrows():
    # Extracting movie title from the row
    movie_title = row['movie_title']
    
    # Extracting the iMDb ID from the TMDb search results and removing first two "tt" characters
    imdb_id = row['imdb_id']
    imdb_id = imdb_id[2:]
    
    # Using iMDbPY to get movie details using the iMDb ID
    imdb_details = dict(imdb_search.get_movie(imdb_id))
    
    # Adding imdb_rating and imdb_votes to movie's row if available
    if 'rating' not in imdb_details.keys():
        print(f'The following movie has no iMDb rating: {movie_title}.')
        df_all_data.loc[index, 'imdb_rating'] = np.nan
    else:
        df_all_data.loc[index, 'imdb_rating'] = imdb_details['rating']
    if 'votes' not in imdb_details.keys():
        df_all_data.loc[index, 'imdb_rating'] = np.nan
    else:
        df_all_data.loc[index, 'imdb_votes'] = imdb_details['votes']

The following movie has no iMDb rating: Raw.


In [34]:
# Viewing the first few rows of the updated df_all_data DataFrame
df_all_data.head()

Unnamed: 0,movie_title,rating,flickable,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,popularity,revenue,runtime,vote_average,vote_count,imdb_rating,imdb_votes
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,14.443,55969000,100,4.8,1766,4.7,67044.0
1,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,10.102,17986781,103,7.1,1179,7.2,82769.0
2,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,27.252,133346506,131,7.3,6894,7.8,389680.0
3,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,132.891,783100000,108,7.6,25491,8.0,947323.0
4,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,54.332,630161890,144,7.7,16074,8.0,794224.0


In [35]:
# Extracting out only the iMDb bits to save as separate CSV file
df_imdb = df_all_data[['movie_title', 'imdb_id', 'rating', 'flickable', 'imdb_rating', 'imdb_votes']]
df_imdb

Unnamed: 0,movie_title,imdb_id,rating,flickable,imdb_rating,imdb_votes
0,Zoolander 2,tt1608290,7.0,Yes,4.7,67044.0
1,Dope,tt3850214,8.5,Yes,7.2,82769.0
2,The Big Short,tt1596363,8.0,Yes,7.8,389680.0
3,Deadpool,tt1431045,10.0,Yes,8.0,947323.0
4,The Martian,tt3659388,8.0,Yes,8.0,794224.0
...,...,...,...,...,...,...
116,Scary Stories to Tell in the Dark,tt3387520,6.2,Yes,6.2,67622.0
117,Palm Springs,tt9484998,8.9,Yes,7.4,130120.0
118,Beverly Hills Ninja,tt0118708,8.6,Yes,5.6,40302.0
119,The Girl Next Door,tt0265208,7.3,No,6.7,214740.0


In [36]:
# Saving iMDb data to its own CSV file
df_imdb.to_csv('../data/raw/imdb_data.csv', index = False)

In [37]:
# Saving df_all_data to CSV as a checkpoint spot
df_all_data.to_csv('../data/raw/all_data.csv', index = False)

## Data Source #3: Rotten Tomatoes
Of all the things I have been watching for in our other data sources so far, I have had in my mind to use the Rotten Tomatoes critic and audience scores to support the model. Interestingly, neither the TMDb API nor iMDb package contained this information, so we'll have to look for it from its own source!