# Data Gathering
As we are starting this project from total scratch, we need a means to get the data to support the machine learning model from wherever we can get it. Fortunately for us, a number of different movie APIs exist that we can call to get data relevant to support each reviewed movie.

## Project Setup

In [1]:
# Importing the necessary Python libraries
import os
import yaml
import numpy as np
import pandas as pd
import tmdbv3api

In [2]:
# Loading Caelan's ratings from Google Sheets-sourced CSV
df_ratings = pd.read_csv('../data/raw/caelan-reviews.csv')

In [3]:
# Viewing the first few rows of df_ratings
df_ratings.head()

Unnamed: 0,Name,Rating,Flickable
0,Zoolander 2,7.0,Yes
1,Dope,8.5,Yes
2,The Big Short,8.0,Yes
3,Deadpool,10.0,Yes
4,The Martian,8.0,Yes


In [4]:
# Viewing the DataFrame info of df_ratings
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       121 non-null    object 
 1   Rating     121 non-null    float64
 2   Flickable  121 non-null    object 
dtypes: float64(1), object(2)
memory usage: 3.0+ KB


In [5]:
# Loading the API keys from the separate, secret YAML file
with open('../keys/keys.yml', 'r') as f:
    keys_yaml = yaml.safe_load(f)

In [6]:
# Extracting the API keys from the loaded YAML
tmdb_key = keys_yaml['api_keys']['tmdb_key']

## API #1: The Movie Database (TMDb)
The first API we will be looking at is called **The Movie Database (TMDb)**, and it highly lauded as one of best APIs on the internet for gathering movie data. Fortunately for us, somebody created a Python wrapper that allows simple use of Python code to interact with this API.

### TMDb API Key
First, you will need to create a sign up to get an API key that will allow you to use the API. This API works on a "freemium" tier, but the part that costs is if you're going to use this API at a mass scale. Our project is well within the free tier, so you fortunately don't have to provide anything like a credit card number. But you will have to give some basic information, including your email. I will not be sharing my API key so as to maintain the ability to use the free tier. To sign up for your own API key, please follow the sign up instructions here: [TMDb API Key Registration](https://developers.themoviedb.org/3/getting-started/introduction)

### TMDb Python Library
There are a number of different user-created Python wrappers to support TMDb, but the one that appears to be the most popular is this one called **tmdbv3api**. The documentation for this API can be found at this link: [tmdbv3api Documentation](https://github.com/AnthonyBloomer/tmdbv3api). To install this Python library on your machine, run the following command:

```
pip install tmdbv3api
```

In [7]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

#### Testing Process with Single Movie
Before we craft something fancy, let's do some basic testing to see how we can get what we need using an example movie, which in this case is *The Matrix*.

In [8]:
# Getting the tmdb_id from the preliminary search
tmdb_id = tmdb_search.movies({'query': 'The Matrix'})[0]['id']
tmdb_id

603

In [9]:
# Getting the details of the moving using the tmdb_id
tmdb_details = dict(tmdb_movies.details(tmdb_id))

In [10]:
# Adding the movie_title and tmdb_id to the tmdb_details dictionary
tmdb_details['movie_title'] = 'The Matrix'
tmdb_details['tmdb_id'] = tmdb_id

In [11]:
# Checking the length of TMDb genres to see if there is a secondary genre
tmdb_genre_length = len(tmdb_details['genres'])
tmdb_genre_length

2

In [12]:
# Separating the primary_genre from the 'genres' nested child dictionary
tmdb_details['primary_genre'] = tmdb_details['genres'][0]['name']

In [13]:
# Separating the secondary_genre from the 'genres' nested child dictionary if it exists
if tmdb_genre_length >= 2:
    tmdb_details['secondary_genre'] = tmdb_details['genres'][1]['name']
else:
    tmdb_details['secondary_genre'] = 'No Secondary Genre'

In [14]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_title', 'tmdb_id', 'budget', 'primary_genre', 'secondary_genre', 'imdb_id', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [15]:
# Slimming down tmdb_details with only the features we want to keep
tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
tmdb_details

{'budget': 63000000,
 'imdb_id': 'tt0133093',
 'popularity': 111.724,
 'revenue': 463517383,
 'runtime': 136,
 'vote_average': 8.2,
 'vote_count': 19846,
 'movie_title': 'The Matrix',
 'tmdb_id': 603,
 'primary_genre': 'Action',
 'secondary_genre': 'Science Fiction'}

In [16]:
# Converting the tmdb_details dictionary to a Python DataFrame
new_tmdb_entry = pd.DataFrame.from_dict([tmdb_details])
new_tmdb_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_title,tmdb_id,primary_genre,secondary_genre
0,63000000,tt0133093,111.724,463517383,136,8.2,19846,The Matrix,603,Action,Science Fiction


#### Creating TMDb Data Pipeline
Now that we have our test working, we can go ahead and create a pipeline that will iterate through all the movies in the `df_ratings` DataFrame and get all the TMDb data that we need to perform feature engineering later.

In [17]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

In [27]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_title', 'rating', 'flickable', 'tmdb_id', 'imdb_id', 'budget', 'primary_genre', 'secondary_genre', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [35]:
# Creating a new DataFrame to hold all the TMDb data
df_tmdb = pd.DataFrame(columns = tmdb_feats)

In [36]:
# Iterating through the df_ratings DataFrame to get the names for extracting detailed info from TMDb
for index, row in df_ratings.iterrows():
    # Extracting info from df_ratings
    movie_name = row['Name']
    rating = row['Rating']
    flickable = row['Flickable']
    
    # Performing the preliminary search
    search_results = tmdb_search.movies({'query': movie_name})
    
    # Extracting tmdb_id if search results exist
    if len(search_results) != 0:
        tmdb_id = search_results[0]['id']
    else:
        print(f'Results not found for title: {movie_name}.')
        continue
    
    # Getting the details of the movie using the tmdb_id
    tmdb_details = dict(tmdb_movies.details(tmdb_id))
    
    # Adding the df_ratings info and tmdb_id to the tmdb_details dictionary
    tmdb_details['movie_title'] = movie_name
    tmdb_details['rating'] = rating
    tmdb_details['flickable'] = flickable
    tmdb_details['tmdb_id'] = tmdb_id
    
    # Checking the length of TMDb genres to see if there is a secondary genre
    tmdb_genre_length = len(tmdb_details['genres'])
    
    # Separating the primary_genre from the 'genres' nested child dictionary if it exists
    if tmdb_genre_length == 0:
        tmdb_details['primary_genre'] = np.nan
    else:
        tmdb_details['primary_genre'] = tmdb_details['genres'][0]['name']
        
    # Separating the secondary_genre from the 'genres' nested child dictionary if it exists
    if tmdb_genre_length >= 2:
        tmdb_details['secondary_genre'] = tmdb_details['genres'][1]['name']
    else:
        tmdb_details['secondary_genre'] = np.nan
    
    # Slimming down tmdb_details with only the features we want to keep
    tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
    
    # Converting the tmdb_details dictionary to a Pandas DataFrame
    new_tmdb_entry = pd.DataFrame.from_dict([tmdb_details])
    
    # Appending the new movie entry to the overall df_tmdb DataFrame
    df_tmdb = df_tmdb.append(new_tmdb_entry)

In [37]:
# Viewing the first few rows of the full TMDb data
df_tmdb.head()

Unnamed: 0,movie_title,rating,flickable,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,popularity,revenue,runtime,vote_average,vote_count
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,12.902,55969000,100,4.8,1766
0,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,8.966,17986781,103,7.1,1179
0,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,19.466,133346506,131,7.3,6892
0,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,115.889,783100000,108,7.6,25487
0,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,62.166,630161890,144,7.7,16071


In [38]:
# Viewing the Pandas DataFrame info of the full TMDb data
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 121 entries, 0 to 0
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_title      121 non-null    object 
 1   rating           121 non-null    float64
 2   flickable        121 non-null    object 
 3   tmdb_id          121 non-null    object 
 4   imdb_id          121 non-null    object 
 5   budget           121 non-null    object 
 6   primary_genre    120 non-null    object 
 7   secondary_genre  108 non-null    object 
 8   popularity       121 non-null    float64
 9   revenue          121 non-null    object 
 10  runtime          121 non-null    object 
 11  vote_average     121 non-null    float64
 12  vote_count       121 non-null    object 
dtypes: float64(3), object(10)
memory usage: 13.2+ KB


In [39]:
# Saving out the TMDb data
df_tmdb.to_csv('../data/raw/tmdb_data.csv', index = False)