# Data Gathering
As we are starting this project from total scratch, we need a means to get the data to support the machine learning model from wherever we can get it. Fortunately for us, a number of different movie APIs exist that we can call to get data relevant to support each reviewed movie.

## Project Setup

In [1]:
# Importing the necessary Python libraries
import os
import yaml
import pandas as pd
import tmdbv3api

In [2]:
# Loading Caelan's ratings from Google Sheets-sourced CSV
df_ratings = pd.read_csv('../data/caelan-reviews.csv')

In [3]:
# Viewing the first few rows of df_ratings
df_ratings.head()

Unnamed: 0,Name,Rating,Flickable
0,Zoolander 2,7.0,Yes
1,Dope,8.5,Yes
2,The Big Short,8.0,Yes
3,Deadpool,10.0,Yes
4,The Martian,8.0,Yes


In [4]:
# Viewing the DataFrame info of df_ratings
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123 entries, 0 to 122
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       123 non-null    object 
 1   Rating     123 non-null    float64
 2   Flickable  123 non-null    object 
dtypes: float64(1), object(2)
memory usage: 3.0+ KB


In [5]:
# Loading the API keys from the separate, secret YAML file
with open('../keys/keys.yml', 'r') as f:
    keys_yaml = yaml.safe_load(f)

In [6]:
# Extracting the API keys from the loaded YAML
tmdb_key = keys_yaml['api_keys']['tmdb_key']

## API #1: The Movie Database (TMDb)
The first API we will be looking at is called **The Movie Database (TMDb)**, and it highly lauded as one of best APIs on the internet for gathering movie data. Fortunately for us, somebody created a Python wrapper that allows simple use of Python code to interact with this API.

### TMDb API Key
First, you will need to create a sign up to get an API key that will allow you to use the API. This API works on a "freemium" tier, but the part that costs is if you're going to use this API at a mass scale. Our project is well within the free tier, so you fortunately don't have to provide anything like a credit card number. But you will have to give some basic information, including your email. I will not be sharing my API key so as to maintain the ability to use the free tier. To sign up for your own API key, please follow the sign up instructions here: [TMDb API Key Registration](https://developers.themoviedb.org/3/getting-started/introduction)

### TMDb Python Library
There are a number of different user-created Python wrappers to support TMDb, but the one that appears to be the most popular is this one called **tmdbv3api**. The documentation for this API can be found at this link: [tmdbv3api Documentation](https://github.com/AnthonyBloomer/tmdbv3api). To install this Python library on your machine, run the following command:

```
pip install tmdbv3api
```

In [7]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

#### Testing Process with Single Movie: The Matrix
Before we craft something fancy, let's do some basic testing to see how we can get what we need using an example movie, which in this case is *The Matrix*.

In [22]:
# Getting the tmdb_id from the preliminary search
tmdb_id = tmdb_search.movies({'query': 'The Matrix'})[0]['id']
tmdb_id

603

In [23]:
# Getting the details of the moving using the tmdb_id
tmdb_details = dict(tmdb_movies.details(tmdb_id))

In [29]:
# Adding the movie_title and tmdb_id to the tmdb_details dictionary
tmdb_details['movie_title'] = 'The Matrix'
tmdb_details['tmdb_id'] = tmdb_id

In [30]:
# Separating the genre details from the nested child dictionary in tmdb_details
tmdb_details['genre1'] = tmdb_details['genres'][0]['name']
tmdb_details['genre2'] = tmdb_details['genres'][1]['name']

In [31]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_title', 'tmdb_id', 'budget', 'genre1', 'genre2', 'imdb_id', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [32]:
# Slimming down tmdb_details with only the features we want to keep
tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
tmdb_details

{'budget': 63000000,
 'imdb_id': 'tt0133093',
 'popularity': 114.152,
 'revenue': 463517383,
 'runtime': 136,
 'vote_average': 8.2,
 'vote_count': 19841,
 'movie_title': 'The Matrix',
 'tmdb_id': 603,
 'genre1': 'Action',
 'genre2': 'Science Fiction'}

In [35]:
# Converting the tmdb_details dictionary to a Python DataFrame
new_tmdb_entry = pd.DataFrame.from_dict([tmdb_details])
new_tmdb_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_title,tmdb_id,genre1,genre2
0,63000000,tt0133093,114.152,463517383,136,8.2,19841,The Matrix,603,Action,Science Fiction


#### Creating TMDb Data Pipeline
Now that we have our test working, we can go ahead and create a pipeline that will iterate through all the movies in the `df_ratings` DataFrame and get all the TMDb data that we need to perform feature engineering later.