# Data Gathering
As we are starting this project from total scratch, we need a means to get the data to support the machine learning model from wherever we can get it. Fortunately for us, a number of different movie APIs exist that we can call to get data relevant to support each reviewed movie.

## Project Setup

In [1]:
# Importing the libraries we typically use
import os
import warnings
import yaml
import numpy as np
import pandas as pd

# Importing the special data source libraries
import tmdbv3api
from imdb import IMDb
from omdb import OMDBClient
from rotten_tomatoes_scraper.rt_scraper import MovieScraper

# Hiding any warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading Caelan's ratings from Google Sheets-sourced CSV
df_ratings = pd.read_csv('../data/raw/caelan_reviews.csv')

In [3]:
# Viewing the first few rows of df_ratings
df_ratings.head()

Unnamed: 0,movie_name,biehn_scale_rating,biehn_yes_or_no
0,Zoolander 2,7.0,Yes
1,Dope,8.5,Yes
2,The Big Short,8.0,Yes
3,Deadpool,10.0,Yes
4,The Martian,8.0,Yes


In [4]:
# Viewing the DataFrame info of df_ratings
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138 entries, 0 to 137
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_name          138 non-null    object 
 1   biehn_scale_rating  137 non-null    float64
 2   biehn_yes_or_no     137 non-null    object 
dtypes: float64(1), object(2)
memory usage: 3.4+ KB


In [5]:
# Loading the API keys from the separate, secret YAML file
with open('../keys/keys.yml', 'r') as f:
    keys_yaml = yaml.safe_load(f)

In [6]:
# Extracting the API keys from the loaded YAML
tmdb_key = keys_yaml['api_keys']['tmdb_key']
omdb_key = keys_yaml['api_keys']['omdb_key']

## Data Source #1: The Movie Database (TMDb)
The first data source we will be looking at is called **The Movie Database (TMDb)**, and it highly lauded as one of best APIs on the internet for gathering movie data. Fortunately for us, somebody created a Python wrapper that allows simple use of Python code to interact with this API.

### TMDb API Key
First, you will need to create a sign up to get an API key that will allow you to use the API. This API works on a "freemium" tier, but the part that costs is if you're going to use this API at a mass scale. Our project is well within the free tier, so you fortunately don't have to provide anything like a credit card number. But you will have to give some basic information, including your email. I will not be sharing my API key so as to maintain the ability to use the free tier. To sign up for your own API key, please follow the sign up instructions here: [TMDb API Key Registration](https://developers.themoviedb.org/3/getting-started/introduction)

### TMDb Python Library
There are a number of different user-created Python wrappers to support TMDb, but the one that appears to be the most popular is this one called **tmdbv3api**. The documentation for this API can be found at this link: [tmdbv3api Documentation](https://github.com/AnthonyBloomer/tmdbv3api). To install this Python library on your machine, run the following command:

```
pip install tmdbv3api
```

In [7]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

### Testing Process with Single Movie
Before we craft something fancy, let's do some basic testing to see how we can get what we need using an example movie, which in this case is *The Matrix*.

In [8]:
# Getting the tmdb_id from the preliminary search
tmdb_id = tmdb_search.movies({'query': 'The Matrix'})[0]['id']
tmdb_id

624860

In [9]:
# Getting the details of the moving using the tmdb_id
tmdb_details = dict(tmdb_movies.details(tmdb_id))

In [10]:
# Adding the movie_title and tmdb_id to the tmdb_details dictionary
tmdb_details['movie_name'] = 'The Matrix'
tmdb_details['tmdb_id'] = tmdb_id

In [11]:
# Checking the length of TMDb genres to see if there is a secondary genre
tmdb_genre_length = len(tmdb_details['genres'])
tmdb_genre_length

3

In [12]:
# Separating the primary_genre from the 'genres' nested child dictionary if it exists
if tmdb_genre_length == 0:
    tmdb_details['primary_genre'] = np.nan
else:
    tmdb_details['primary_genre'] = tmdb_details['genres'][0]['name']

In [13]:
# Separating the secondary_genre from the 'genres' nested child dictionary if it exists
if tmdb_genre_length >= 2:
    tmdb_details['secondary_genre'] = tmdb_details['genres'][1]['name']
else:
    tmdb_details['secondary_genre'] = np.nan

In [14]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_name', 'biehn_scale_rating', 'biehn_yes_or_no', 'tmdb_id', 'imdb_id', 'budget', 'primary_genre', 'secondary_genre', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [15]:
# Slimming down tmdb_details with only the features we want to keep
tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
tmdb_details

{'budget': 190000000,
 'imdb_id': 'tt10838180',
 'popularity': 1794.78,
 'revenue': 148000000,
 'runtime': 148,
 'vote_average': 6.8,
 'vote_count': 2835,
 'movie_name': 'The Matrix',
 'tmdb_id': 624860,
 'primary_genre': 'Science Fiction',
 'secondary_genre': 'Action'}

In [16]:
# Converting the tmdb_details dictionary to a Python DataFrame
new_matrix_entry = pd.DataFrame.from_dict([tmdb_details])
new_matrix_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_name,tmdb_id,primary_genre,secondary_genre
0,190000000,tt10838180,1794.78,148000000,148,6.8,2835,The Matrix,624860,Science Fiction,Action


### Creating TMDb Data Pipeline
Now that we have our test working, we can go ahead and create a pipeline that will iterate through all the movies in the `df_ratings` DataFrame and get all the TMDb data that we need to perform feature engineering later.

In [17]:
# Instantiating the TMDb objects and setting the API key
tmdb = tmdbv3api.TMDb()
tmdb_search = tmdbv3api.Search()
tmdb_movies = tmdbv3api.Movie()
tmdb.api_key = tmdb_key

In [18]:
# Defining which features we need to keep from tmdb_details
tmdb_feats = ['movie_name', 'biehn_scale_rating', 'biehn_yes_or_no', 'tmdb_id', 'imdb_id', 'budget', 'primary_genre', 'secondary_genre', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']

In [19]:
# Creating a new DataFrame to hold all the TMDb data
df_tmdb = pd.DataFrame(columns = tmdb_feats)

# Iterating through the df_ratings DataFrame to get the names for extracting detailed info from TMDb
for index, row in df_ratings.iterrows():
    # Extracting info from df_ratings
    movie_name = row['movie_name']
    biehn_scale_rating = row['biehn_scale_rating']
    biehn_yes_or_no = row['biehn_yes_or_no']
    
    # Performing the preliminary search
    search_results = tmdb_search.movies({'query': movie_name})
    
    # Extracting tmdb_id if search results exist
    if len(search_results) != 0:
        tmdb_id = search_results[0]['id']
    else:
        print(f'Results not found for title: {movie_name}.')
        continue
    
    # Getting the details of the movie using the tmdb_id
    tmdb_details = dict(tmdb_movies.details(tmdb_id))
    
    # Adding the df_ratings info and tmdb_id to the tmdb_details dictionary
    tmdb_details['movie_name'] = movie_name
    tmdb_details['biehn_scale_rating'] = biehn_scale_rating
    tmdb_details['biehn_yes_or_no'] = biehn_yes_or_no
    tmdb_details['tmdb_id'] = tmdb_id
    
    # Checking the length of TMDb genres to see if there is a secondary genre
    tmdb_genre_length = len(tmdb_details['genres'])
    
    # Separating the primary_genre from the 'genres' nested child dictionary if it exists
    if tmdb_genre_length == 0:
        tmdb_details['primary_genre'] = np.nan
    else:
        tmdb_details['primary_genre'] = tmdb_details['genres'][0]['name']
        
    # Separating the secondary_genre from the 'genres' nested child dictionary if it exists
    if tmdb_genre_length >= 2:
        tmdb_details['secondary_genre'] = tmdb_details['genres'][1]['name']
    else:
        tmdb_details['secondary_genre'] = np.nan
    
    # Slimming down tmdb_details with only the features we want to keep
    tmdb_details = {key: value for key, value in tmdb_details.items() if key in tmdb_feats}
    
    # Converting the tmdb_details dictionary to a Pandas DataFrame
    new_tmdb_entry = pd.DataFrame.from_dict([tmdb_details])
    
    # Appending the new movie entry to the overall df_tmdb DataFrame
    df_tmdb = df_tmdb.append(new_tmdb_entry, ignore_index = True)

In [20]:
# Renaming some of the columns to avoid ambiguity later
tmdb_new_col_names = {
    'popularity': 'tmdb_popularity',
    'vote_average': 'tmdb_vote_average',
    'vote_count': 'tmdb_vote_count'
}

In [21]:
# Applying the new column names appropriately
df_tmdb.rename(columns = tmdb_new_col_names, inplace = True)

In [22]:
# Viewing the first few rows of the full TMDb data
df_tmdb.head()

Unnamed: 0,movie_name,biehn_scale_rating,biehn_yes_or_no,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,tmdb_popularity,revenue,runtime,tmdb_vote_average,tmdb_vote_count
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,17.909,55969000,100,4.8,1817
1,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,21.628,17986781,103,7.1,1219
2,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,28.406,133346506,131,7.3,7229
3,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,191.89,783100000,108,7.6,26252
4,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,74.555,630161890,144,7.7,16629


In [23]:
# Viewing the Pandas DataFrame info of the full TMDb data
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138 entries, 0 to 137
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_name          138 non-null    object 
 1   biehn_scale_rating  137 non-null    float64
 2   biehn_yes_or_no     137 non-null    object 
 3   tmdb_id             138 non-null    object 
 4   imdb_id             138 non-null    object 
 5   budget              138 non-null    object 
 6   primary_genre       137 non-null    object 
 7   secondary_genre     125 non-null    object 
 8   tmdb_popularity     138 non-null    float64
 9   revenue             138 non-null    object 
 10  runtime             138 non-null    object 
 11  tmdb_vote_average   138 non-null    float64
 12  tmdb_vote_count     138 non-null    object 
dtypes: float64(3), object(10)
memory usage: 14.1+ KB


In [24]:
# Using df_tmdb as source for all new data in a single DataFrame
df_all_data = df_tmdb

## Data Source #2: IMDb
The next source we will be using is IMDb. While IMDb does have an official API, it doesn't seem readily accessible to the public. Fortunately, it looks like somebody created an alternative that I'm guessing that the developer used some advanced screen scraping to do. We could also do our own screen scraping if we wanted to, but that seems like a lot of work when this other Python alternative works just fine. This Python alternative is called **IMDbPY** and can be installed by running the following command:
```
pip install imdbpy
```

The `README` supporting this Python library can be found at this link: [IMDbPY Documentation](https://github.com/alberanid/imdbpy).

Because TMDb already gave us so much information, the only new features we'll be adding from this source are `imdb_rating` and `imdb_votes`.

### Testing Process with Single Movie
Just as we did with TMDb, let's first do some testing with a single movie in order to ensure that we can use this just fine. Because this Python library bases its search on iMDb ID, we're going to build off our work in TMDb since that API nice supplies us with the apporpriate iMDb IDs.

In [26]:
# Refreshing ourselves on the TMDb output for The Matrix
new_matrix_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_name,tmdb_id,primary_genre,secondary_genre
0,190000000,tt10838180,1794.78,148000000,148,6.8,2835,The Matrix,624860,Science Fiction,Action


In [27]:
# Extracting the iMDb ID from the TMDb search results and removing first two "tt" characters
imdb_id = new_matrix_entry['imdb_id']
imdb_id = imdb_id[0][2:]

In [28]:
# Instantiating the IMDbPY search object
imdb_search = IMDb()

In [29]:
# Using iMDbPY to get movie details using the IMDb ID
imdb_details = dict(imdb_search.get_movie(imdb_id))

In [30]:
imdb_details['year']

2021

In [31]:
# Adding imdb_rating and imdb_votes to new test entry for The Matrix
new_matrix_entry['imdb_rating'] = imdb_details['rating']
new_matrix_entry['imdb_votes'] = imdb_details['votes']

In [32]:
# Viewing the updated entry
new_matrix_entry

Unnamed: 0,budget,imdb_id,popularity,revenue,runtime,vote_average,vote_count,movie_name,tmdb_id,primary_genre,secondary_genre,imdb_rating,imdb_votes
0,190000000,tt10838180,1794.78,148000000,148,6.8,2835,The Matrix,624860,Science Fiction,Action,5.7,190094


### Creating IMDb Data Pipeline
Now that we have our test working, we can go ahead and create a pipeline that will iterate through all the movies in the `df_ratings` DataFrame and get all the IMDb data that we need to perform feature engineering later.

In [33]:
# Instantiating the IMDbPY search object
imdb_search = IMDb()

In [34]:
# Copying the original df_tmdb DataFrame for testing purposes
df_tmdb_copy = df_all_data

In [35]:
# Iterating through each entry in df_tmdb, using the iMDb ID to extract relevant movie information
for index, row in df_all_data.iterrows():
    # Extracting the movie title from the row
    movie_name = row['movie_name']
    
    # Extracting the IMDb ID from the TMDb search results and removing first two "tt" characters
    imdb_id = row['imdb_id']
    imdb_id = imdb_id[2:]
    
    # Using IMDbPY to get movie details using the IMDb ID
    imdb_details = dict(imdb_search.get_movie(imdb_id))
    
    # Adding imdb_rating and imdb_votes to movie's row if available
    if 'rating' not in imdb_details.keys():
        print(f'The following movie has no IMDb rating: {movie_name}.')
        df_all_data.loc[index, 'imdb_rating'] = np.nan
    else:
        df_all_data.loc[index, 'imdb_rating'] = imdb_details['rating']
    if 'votes' not in imdb_details.keys():
        df_all_data.loc[index, 'imdb_rating'] = np.nan
    else:
        df_all_data.loc[index, 'imdb_votes'] = imdb_details['votes']
    
    # Adding the year the movie debuted
    df_all_data.loc[index, 'year'] = imdb_details['year']

In [36]:
# Viewing the first few rows of the updated df_all_data DataFrame
df_all_data.head()

Unnamed: 0,movie_name,biehn_scale_rating,biehn_yes_or_no,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,tmdb_popularity,revenue,runtime,tmdb_vote_average,tmdb_vote_count,imdb_rating,imdb_votes,year
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,17.909,55969000,100,4.8,1817,4.7,68135.0,2016.0
1,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,21.628,17986781,103,7.1,1219,7.2,83713.0,2015.0
2,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,28.406,133346506,131,7.3,7229,7.8,407757.0,2015.0
3,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,191.89,783100000,108,7.6,26252,8.0,978517.0,2016.0
4,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,74.555,630161890,144,7.7,16629,8.0,816538.0,2015.0


In [37]:
# Extracting out only the iMDb bits to save as separate CSV file
df_imdb = df_all_data[['movie_name', 'imdb_id', 'biehn_scale_rating', 'biehn_yes_or_no', 'imdb_rating', 'imdb_votes', 'year']]
df_imdb

Unnamed: 0,movie_name,imdb_id,biehn_scale_rating,biehn_yes_or_no,imdb_rating,imdb_votes,year
0,Zoolander 2,tt1608290,7.0,Yes,4.7,68135.0,2016.0
1,Dope,tt3850214,8.5,Yes,7.2,83713.0,2015.0
2,The Big Short,tt1596363,8.0,Yes,7.8,407757.0,2015.0
3,Deadpool,tt1431045,10.0,Yes,8.0,978517.0,2016.0
4,The Martian,tt3659388,8.0,Yes,8.0,816538.0,2015.0
...,...,...,...,...,...,...,...
133,The Girl Next Door,tt0265208,7.3,No,6.7,219088.0,2004.0
134,Talladega Nights: The Ballad of Ricky Bobby,tt0415306,8.9,Yes,6.6,179080.0,2006.0
135,Air Force One,tt0118571,,,6.5,191074.0,1997.0
136,Pineapple Express,tt0910936,9.7,Yes,6.9,331073.0,2008.0


## Data Source #3: The Open Movie Database (OMDb)
Of all the things I have been watching for in our other data sources so far, I have had in my mind to use the Rotten Tomatoes critic and audience scores to support the model. Interestingly, neither the TMDb API nor IMDb package contained this information, so I decided to check out **The Open Movie Database (OMDb)** to see what it would have to offer. The unfortunate news is that they do not have the Rotten Tomatoes audience score, but we can derive at least the primary critic score. Additionally, this API offers the ability to get the Metacritic *metascore*, so we'll go ahead and keep that here.

### Testing Process with Single Movie
As we have done with our other data sources so far, we're going to continue our testing out with a single movie, *The Matrix.*

In [38]:
# Getting the IMDb ID from our previous work
matrix_imdb_id = new_matrix_entry['imdb_id'][0]
matrix_imdb_id

'tt10838180'

In [39]:
# Instantiating the OMDb client
omdb_client = OMDBClient(apikey = omdb_key)

In [40]:
# Getting the search results from the OMDb client using "The Matrix" IMDb key
omdb_matrix_details = omdb_client.imdbid(matrix_imdb_id)
omdb_matrix_details

{'title': 'The Matrix Resurrections',
 'year': '2021',
 'rated': 'R',
 'released': '22 Dec 2021',
 'runtime': '148 min',
 'genre': 'Action, Sci-Fi',
 'director': 'Lana Wachowski',
 'writer': 'Lana Wachowski, David Mitchell, Aleksandar Hemon',
 'actors': 'Keanu Reeves, Carrie-Anne Moss, Yahya Abdul-Mateen II',
 'plot': 'Return to a world of two realities: one, everyday life; the other, what lies behind it. To find out if his reality is a construct, to truly know himself, Mr. Anderson will have to choose to follow the white rabbit once more.',
 'language': 'English, French, Spanish, Japanese',
 'country': 'United States',
 'awards': '21 nominations',
 'poster': 'https://m.media-amazon.com/images/M/MV5BMGJkNDJlZWUtOGM1Ny00YjNkLThiM2QtY2ZjMzQxMTIxNWNmXkEyXkFqcGdeQXVyMDM2NDM2MQ@@._V1_SX300.jpg',
 'ratings': [{'source': 'Internet Movie Database', 'value': '5.7/10'},
  {'source': 'Rotten Tomatoes', 'value': '63%'},
  {'source': 'Metacritic', 'value': '63/100'}],
 'metascore': '63',
 'imdb_rat

In [41]:
# Extracting the Rotten Tomatoes critic score
omdb_ratings_len = len(omdb_matrix_details['ratings'])

if omdb_ratings_len == 0:
    print('The Matrix has no recorded ratings.')
elif omdb_ratings_len >= 0:
    # Extracting out the Rotten Tomatoes score if available
    for rater in omdb_matrix_details['ratings']:
        if rater['source'] == 'Rotten Tomatoes':
            rt_critic_score = rater['value']
            
# Printing out the critic score            
print(rt_critic_score)

63%


### Creating OMDb Data Pipeline
Okay, now that we have succesfully tested the API with *The Matrix*, let's go ahead and build our pipeline that will get all the Rotten Tomatoes critic scores and Metascores for each movie.

In [42]:
# Instantiating the OMDb client
omdb_client = OMDBClient(apikey = omdb_key)

In [43]:
# Iterating through all the movies to extract the proper OMDb information
for index, row in df_all_data.iterrows():
    # Extracting movie name from the row
    movie_name = row['movie_name']
    
    # Using the OMDb client to search for the movie results using the IMDb ID
    omdb_details = omdb_client.imdbid(row['imdb_id'])
    
    # Resetting the Rotten Tomatoes critic score variable
    rt_critic_score = None
    
    # Checking if the movie has any ratings populated under 'ratings'
    omdb_ratings_len = len(omdb_details['ratings'])
    
    if omdb_ratings_len == 0:
        print(f'{movie_name} has no Rotten Tomatoes critic score.')
    elif omdb_ratings_len >= 0:
        # Extracting out the Rotten Tomatoes score if available
        for rater in omdb_details['ratings']:
            if rater['source'] == 'Rotten Tomatoes':
                rt_critic_score = rater['value']
                
    # Populating Rotten Tomatoes critic score appropriately
    if rt_critic_score:
        df_all_data.loc[index, 'rt_critic_score'] = rt_critic_score
    else:
        df_all_data.loc[index, 'rt_critic_score'] = np.nan
        
    # Populating the Metacritic metascore appropriately
    df_all_data.loc[index, 'metascore'] = omdb_details['metascore']

In [44]:
# Showing all the data with the new OMDb data
df_all_data.head()

Unnamed: 0,movie_name,biehn_scale_rating,biehn_yes_or_no,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,tmdb_popularity,revenue,runtime,tmdb_vote_average,tmdb_vote_count,imdb_rating,imdb_votes,year,rt_critic_score,metascore
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,17.909,55969000,100,4.8,1817,4.7,68135.0,2016.0,22%,34
1,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,21.628,17986781,103,7.1,1219,7.2,83713.0,2015.0,88%,72
2,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,28.406,133346506,131,7.3,7229,7.8,407757.0,2015.0,89%,81
3,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,191.89,783100000,108,7.6,26252,8.0,978517.0,2016.0,85%,65
4,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,74.555,630161890,144,7.7,16629,8.0,816538.0,2015.0,91%,80


In [45]:
# Showing all the data attribute information collected across all 3 sources
df_all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138 entries, 0 to 137
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_name          138 non-null    object 
 1   biehn_scale_rating  137 non-null    float64
 2   biehn_yes_or_no     137 non-null    object 
 3   tmdb_id             138 non-null    object 
 4   imdb_id             138 non-null    object 
 5   budget              138 non-null    object 
 6   primary_genre       137 non-null    object 
 7   secondary_genre     125 non-null    object 
 8   tmdb_popularity     138 non-null    float64
 9   revenue             138 non-null    object 
 10  runtime             138 non-null    object 
 11  tmdb_vote_average   138 non-null    float64
 12  tmdb_vote_count     138 non-null    object 
 13  imdb_rating         138 non-null    float64
 14  imdb_votes          138 non-null    float64
 15  year                138 non-null    float64
 16  rt_criti

## Data Source #4: Rotten Tomatoes Python Library
While we were able to extract the Rotten Tomatoes critic score from OMDb, I still really want to get the Rotten Tomatoes audience score for our model. To that end, I found a Python library that will help us do just that. In order to install, you will need to run the following command: 

`pip3 install rotten_tomatoes_scraper`

### Testing Process with Single Movie
As we have done with our other data sources so far, we're going to continue our testing out with a single movie, *The Matrix.*

In [46]:
# Using the MovieScraper object to search for the movie
matrix_scraper = MovieScraper(movie_title = 'The Matrix')

In [47]:
# Extracting the metadata from the scraper
matrix_scraper.extract_metadata()

In [48]:
# Viewing the extracted metadata
print(matrix_scraper.metadata)

{'Score_Rotten': '88', 'Score_Audience': '85', 'Rating': 'R', 'Genre': ['Sci-fi', 'Action']}


In [49]:
# Checking to see that the RT critic score from OMDb matches the scraper score 
for rating in omdb_matrix_details['ratings']:
    if rating['source'] == 'Rotten Tomatoes':
        omdb_matrix_rt_score = rating['value'][:2]

print(omdb_matrix_rt_score == matrix_scraper.metadata['Score_Rotten'])

False


### Creating RT Data Pipeline
Let's finally wrap up our data collection by creating a pipeline to get the RT audience score!

In [50]:
for index, row in df_all_data.iterrows():
    # Extracting movie name from row
    movie_name = row['movie_name']
    print(movie_name)
    
    # Checking to see if the movie has a critic score from the OMDb run
    rt_critic_score_string = str(row['rt_critic_score'])
    if rt_critic_score_string == 'nan':
        df_all_data.loc[index, 'rt_audience_score'] = np.nan
        continue
    
    # Instantiating scraper object with movie title
    try:
        movie_scraper = MovieScraper(movie_title = movie_name)
    except:
        df_all_data.loc[index, 'rt_audience_score'] = np.nan
        continue
    
    # Extracting the metadata about the movie
    try:
        movie_scraper.extract_metadata()
    except:
        df_all_data.loc[index, 'rt_audience_score'] = np.nan
        continue
    
    # Extracting the critic and audience scores from the metadata
    rt_critic_score = movie_scraper.metadata['Score_Rotten']
    rt_audience_score = movie_scraper.metadata['Score_Audience']
    
    # Comparing the RT critic score to OMDb and saving audience score if the same
    if rt_critic_score == row['rt_critic_score'][:2]:
        df_all_data.loc[index, 'rt_audience_score'] = rt_audience_score
    else:
        df_all_data.loc[index, 'rt_audience_score'] = np.nan

Zoolander 2
Dope
The Big Short
Deadpool
The Martian
Hardcore Henry
San Andreas
Terminator
Captain America: Civil War
All the Way
The Man from U.N.C.L.E.
Chef
Independence Day: Resurgence
Suicide Squad
Steve Jobs
Tickled
Arrival
The Jungle Book
Why Him
Office Christmas Party
Assassin's Creed
Sing
Hell or High Water
Logan
Alien: Covenant
Spider-Man: Homecoming
The Belko Experiment
Guardians of the Galaxy, Vol. 2
Get Out
Don't Breathe
Hannah
Dunkirk
Fantastic Beasts and Where to Find Them
Life
Imperium
Death Note
Little Evil
Gerald's Game
The Meyerowitz Stories
The Mist
Swingers
Schindler's List
The Darkest Hour
Three Billboards Outside Ebbing Missouri
Ladybird
Futile and Stupid Gesture
The Cloverfield Paradox
Black Panther
Ready Player One
A Quiet Place
Avengers: Infinity War
The Disaster Artist
Annihilation
Sicario Day of the Soldado 
Mission Impossible: Fallout
It Comes at Night
Hush
The Conjuring 2
Annabelle Creation
The Witch
The Ritual
Unbreakable
Patient Zero
Deadpool 2
Raw
Littles

In [51]:
df_all_data.head()

Unnamed: 0,movie_name,biehn_scale_rating,biehn_yes_or_no,tmdb_id,imdb_id,budget,primary_genre,secondary_genre,tmdb_popularity,revenue,runtime,tmdb_vote_average,tmdb_vote_count,imdb_rating,imdb_votes,year,rt_critic_score,metascore,rt_audience_score
0,Zoolander 2,7.0,Yes,329833,tt1608290,50000000,Comedy,,17.909,55969000,100,4.8,1817,4.7,68135.0,2016.0,22%,34,20
1,Dope,8.5,Yes,308639,tt3850214,700000,Crime,Drama,21.628,17986781,103,7.1,1219,7.2,83713.0,2015.0,88%,72,83
2,The Big Short,8.0,Yes,318846,tt1596363,28000000,Comedy,Drama,28.406,133346506,131,7.3,7229,7.8,407757.0,2015.0,89%,81,88
3,Deadpool,10.0,Yes,293660,tt1431045,58000000,Action,Adventure,191.89,783100000,108,7.6,26252,8.0,978517.0,2016.0,85%,65,90
4,The Martian,8.0,Yes,286217,tt3659388,108000000,Drama,Adventure,74.555,630161890,144,7.7,16629,8.0,816538.0,2015.0,91%,80,91


In [52]:
# Saving df_all_data to CSV
df_all_data.to_csv('../data/raw/all_data.csv', index = False)