## Part 2

Use TMDb API to collect budget, revenue, and MPAA Rating (G/PG/PG-13/R) or "Certification" data for analysis.

In [1]:
# Install tmdbsimple (only need to run once)

# this package will make it easier to extract the data we need without manually 
# constructing the URLs for our API calls.
!pip install tmdbsimple



In [2]:
# package that provides PROGRESS BAR for processing data from returned API calls
!pip install tqdm



In [3]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os, json, math, time
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

In [4]:
# testing to see if .csv.gz file from Project 3, Part 1 actually has data
test = pd.read_csv('Data/title_basics_cleaned.csv.gz')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292582 entries, 0 to 292581
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          292582 non-null  object
 1   titleType       292582 non-null  object
 2   primaryTitle    292582 non-null  object
 3   originalTitle   292582 non-null  object
 4   isAdult         292582 non-null  int64 
 5   startYear       292582 non-null  object
 6   endYear         292582 non-null  object
 7   runtimeMinutes  292582 non-null  object
 8   genres          292582 non-null  object
dtypes: int64(1), object(8)
memory usage: 20.1+ MB


In [5]:
# check to see if imdb_id or movie_ids in basics df
test['tconst'].value_counts()

tt0011801     1
tt3843334     1
tt3843564     1
tt3843540     1
tt3843532     1
             ..
tt14716932    1
tt14717132    1
tt14717260    1
tt14717428    1
tt9916730     1
Name: tconst, Length: 292582, dtype: int64

In [6]:
# verify startYear values in basics df
test['startYear'].value_counts().sort_values()

2029        4
2028        7
2027       16
2026       18
2025       62
2024      257
2023     1767
2000     4283
2001     4453
2002     4585
2003     4612
2004     4829
2005     5526
2006     5925
2007     6410
2008     7358
2009     8670
2010     9636
2022    10276
2011    10469
2020    11247
2012    11344
2013    12213
2015    12991
2014    13050
2021    13087
2019    13358
2016    13618
2017    13893
2018    13973
\N      74645
Name: startYear, dtype: int64

In [7]:
# Load my TMDb login credentials
with open('/Users/shenekaallen/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
    
## Display the keys of the loaded dict
login.keys()

dict_keys(['API Key', 'Authorization'])

In [8]:
# set the tmdbapi variable equal to my unique TMDb "API Key(v3 auth)" entry in JSON file
tmdb.API_KEY =  login['API Key']

### Practice:  Test data extraction

In [9]:
## make a movie object using the .Movies function from tmdb
movie = tmdb.Movies(603)

In [10]:
## movie objects has a .info dictionary 
response = movie.info()
response

{'adult': False,
 'backdrop_path': '/y9wuhlrqSHvhTLNVNwKMKe6HZzY.jpg',
 'belongs_to_collection': {'id': 2344,
  'name': 'The Matrix Collection',
  'poster_path': '/bV9qTVHTVf0gkW0j7p7M0ILD4pG.jpg',
  'backdrop_path': '/bRm2DEgUiYciDw3myHuYFInD7la.jpg'},
 'budget': 63000000,
 'genres': [{'id': 28, 'name': 'Action'},
  {'id': 878, 'name': 'Science Fiction'}],
 'homepage': 'http://www.warnerbros.com/matrix',
 'id': 603,
 'imdb_id': 'tt0133093',
 'original_language': 'en',
 'original_title': 'The Matrix',
 'overview': 'Set in the 22nd century, The Matrix tells the story of a computer hacker who joins a group of underground insurgents fighting the vast and powerful computers who now rule the earth.',
 'popularity': 78.676,
 'poster_path': '/f89U3ADr1oiB1s9GkdPOEpXUk5H.jpg',
 'production_companies': [{'id': 79,
   'logo_path': '/tpFpsqbleCzEE2p5EgvUq6ozfCA.png',
   'name': 'Village Roadshow Pictures',
   'origin_country': 'US'},
  {'id': 174,
   'logo_path': '/IuAlhI9eVC9Z8UQWOIDdWRKSEJ.png'

In [11]:
# What was the budget of Tom and Jerry which had imdb id of "tt1361336"?
movie = tmdb.Movies('tt1361336')
info = movie.info()
info['budget']

50000000

In [12]:
# Extract movie certification/MPAA Rating from package README for current id
movie = tmdb.Movies('tt1361336')
# save the .info .releases dictionaries
info = movie.info()
releases = movie.releases()
# Loop through countries in releases
for c in releases['countries']:
    # if the country abbreviation==US
    if c['iso_3166_1' ] =='US':
        ## save a "certification" key in the info dict with the certification
       info['certification'] = c['certification']


In [13]:
info['certification']

'PG'

## Setup to use TMDB API

Define functions, Specify movie Years to extract and folder to save results

### Defined Function:  get_movie_with_rating 

In [14]:
# function that 1) accepts the movie_id as an argument and
# 2) returns a dictionary of results that includes certification
def get_movie_with_rating(movie_id):
    ## Get movie and release dates
    movie = tmdb.Movies(movie_id)
    ## Construct output dict
    movie_info = movie_info()
    releases = movie.releases()
    # Loop through countries in releases
    for c in releases['countries']:
        # if the country abbreviation==US
        if c['iso_3166_1' ] =='US':
            ## save a "certification" key in the info dict with the certification
            movie_info['certification'] = c['certification']
    return movie_info


### Defined Function:  write_json

In [15]:
def write_json(new_data, filename):
    """Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
            file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

In [16]:
# Define years to collect from movie db and store in a variable
YEARS_TO_GET = [2000,2001]

In [17]:
# Specify folder for saving data in Python vs Jupyter Notebook
FOLDER = "Data/"
# list current files in Data/
os.listdir(FOLDER)

['title_basics_cleaned.csv.gz',
 'title.akas.tsv.gz',
 '.DS_Store',
 'title_ratings_cleaned.csv.gz',
 'title.akas.tsv',
 'tmdb_api_results_2000.json',
 'final_tmdb_data_2000.csv.gz',
 'title.basics.tsv.gz',
 'tmdb_api_results_2001.json',
 'title.ratings.tsv.gz',
 'final_tmdb_data_2001.csv.gz',
 '.ipynb_checkpoints',
 'title_akas_cleaned.csv.gz']

### Test data extraction for 2000-2001 Year Movie Releases

### OUTER Loop to collect data by YEAR

Checks if JSON file exists, if not, creates and writes select query data to the file.


Identifies the designated FOLDER (Data/) and names the file based on the current year. Saves data in separate .csv.gz files BY YEAR.

### INNER Loop to collect index and movie ID

## Error:  INNER Loop code is not copying Project 3, Part 1 'basics' dataframe into new df leaving movie_ids empty and get_movie_with_rating( ) empty as well. 

## What am I missing?

In [18]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET,desc='YEARS',position=0):
    
    #Defining the JSON file to store results for year
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    # Check if file exists
    file_exists = os.path.isfile(JSON_FILE)
     # If it does not exist: create it
    if file_exists == False:
        ## If it does not exist:
        print('The year', YEAR, 'file does not exist.  Creating empty file.')
        # save an empty dict with just "imdb_id" to the new json file.
        with open(JSON_FILE,'w') as f:
            json.dump([{'imdb_id':0}],f)
    else:
        print('The year', YEAR, 'file already exists.')
        
    # Load in the dataframe from project part 1 as basics:
    basics = pd.read_csv('Data/title_basics_cleaned.csv.gz')
    #print(basics.info())
    #Saving new year as the current df
    df = basics.loc[basics['startYear']==YEAR].copy()
    print(df)
    # saving movie ids to list
    movie_ids = df['tconst'].copy()#.to_list()
    #print(movie_ids)
    # Load existing data from json into a dataframe called "previous_basics"
    previous_df = pd.read_json(JSON_FILE)
    # filter out any ids that are already in the JSON_FILE
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
        
    # Get index and movie id from list
    # This loop uses 2 functions: 1) "get_movie_with_rating" to add the certification to the .info results 
    # and 2) "write_json" to extend/append the results to the .json file. 
    
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                          desc=f'Movies from {YEAR}',
                          position=1,
                          leave=True):
        # Attempt to retrieve the data for the movie id
        try:
            temp = get_movie_with_rating(movie_id)  #This uses your pre-made function
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            # If it fails,  make a dict with just the id and None for certification.
        except Exception as e: 
            continue

    ## Saving filtered file as csv.gz
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

The year 2000 file already exists.
Empty DataFrame
Columns: [tconst, titleType, primaryTitle, originalTitle, isAdult, startYear, endYear, runtimeMinutes, genres]
Index: []


Movies from 2000: 0it [00:00, ?it/s]

The year 2001 file already exists.
Empty DataFrame
Columns: [tconst, titleType, primaryTitle, originalTitle, isAdult, startYear, endYear, runtimeMinutes, genres]
Index: []


Movies from 2001: 0it [00:00, ?it/s]

In [19]:
# Extract 3 pieces of information for each movie:
# Revenue, Budget, Certification (P, PG, etc)

final_year_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   imdb_id  1 non-null      int64
dtypes: int64(1)
memory usage: 136.0 bytes


In [59]:
#Once you have retrieved and saved the final results to 2 separate .csv.gz files, 
#move on to a new Exploratory Data Analysis notebook to explore the following questions.

#Exploratory Data Analysis
#1. Load in your csv.gz's of results for each year extracted.
#2. Concatenate the data into 1 dataframe for the remainder of the analysis.
#3. Once you have your data from the API, they would like you to perform some light EDA to show:
#3a.How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
#3b.Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.
#3c. How many movies are there in each of the certification categories (G/PG/PG-13/R)?
#3d. What is the average revenue per certification category?
#3e. What is the average budget per certification category?

#Deliverables

#After you have joined the tmdb results into 1 dataframe in the EDA Notebook, 

#Save a final merged .csv.gz of all of the tmdb api data
#The file name should be "tmdb_results_combined.csv.gz"
#Make sure this is pushed to your github repository along with all of your code
#One code file for API calls
#One code file for EDA
#Submit the link