# **Movie Production Business Analysis - TMDB ETL**

- Yvon Bilodeau
- May 2022

## **Business Problem**

After investigating the preview of the data from the IMDB ETL, the stakeholder realized that there is no financial information included in the IMDB data (e.g. budget or revenue).

This will be a major roadblock when attempting to analyze which movies are successful and must be addressed before you will be able to determine which movies are successful.

## **The Data**

The stakeholder identified **The Movie Database (TMDB)** as a great source of financial data (https://www.themoviedb.org/). 

### **Specifications**


- The stakeholder would like the budget, revenue, and MPAA Rating (G/PG/PG-13/R), which is also called "Certification", extracted.

- The stakeholder would only like results for movies that meet all of the criteria established in the IMDB ETL extracted. 

- As a proof-of-concept, they requested a test extraction of movies that started in 2000 or 2001. Each year to be saved as a separate .csv.gz file.

## **Import libraries**

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os, json, math, time
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

## **API Credentials**

In [None]:
# Load API Credentials
with open('C:/Users/DELL/.secret/tmdb_api.json') as f: 
    login = json.load(f)

In [None]:
## Display the keys of the loaded dict
login.keys()

In [None]:
# Set the API_KEY variable to the "API Key(v3 auth)"
tmdb.API_KEY =  login['api-key']

 ## **Test API call**

In [None]:
# Create a movie object using the .Movies function from tmdb
movie = tmdb.Movies(603)

In [None]:
# Display the .info dictionary of the movie object
info = movie.info()
info

- Budget and revenue are included in the .info(). However, it does not include the certification.  

In [None]:
info['budget']

In [None]:
info['revenue']

In [None]:
info['imdb_id']

In [None]:
# Test search
movie = tmdb.Movies('tt1361336')
info = movie.info()
info['budget']

In [None]:
# example from package README - the rating of the movie if it is in the US
response = movie.releases()
for c in movie.countries:
    if c['iso_3166_1'] == 'US':
        print(c['certification'])

## **API Call Preparation**

 ### **Data Folder**

In [None]:
# Create a data folder to save API call data if it doesn't exist
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

### **Define the years to retrieve**

In [None]:
YEARS_TO_GET = [2012]

### **Custom Functions**

#### **Movie Rating Function**

In [None]:
# Function to obtain movie rating
def get_movie_with_rating(movie_id):
    # Get movie and release dates
    movie = tmdb.Movies(movie_id)
    # Construct output dictionary
    movie_info = movie.info()
    releases = movie.releases()
    for c in releases['countries']:
        if c['iso_3166_1'] == 'US':
            movie_info['certification'] = c['certification']
    return movie_info

#### **Append Results to JSON Function**

In [None]:
# Append new results to the existing JSON file
# Adapted from: 
# https://www.geeksforgeeks.org/append-to-json-file-using-python/   

def write_json(new_data, filename): 
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

## **API Call Loop**

In [None]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET,desc='YEARS',position=0):
    # Defining the JSON file to store results for year
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    
    # Check if file exists
    file_exists = os.path.isfile(JSON_FILE)
    # If it does not exist: create it
    if file_exists == False:
    # save an empty dict with just "imdb_id" to the new json file.
        with open(JSON_FILE,'w') as f:
            json.dump([{'imdb_id':0}],f)
            
    # Load in the dataframe from project part 1 as basics:
    basics = pd.read_csv("Data/title_basics.csv.gz")
    
    # Saving new year as the current df
    df = basics.loc[ basics['startYear']==YEAR].copy()
    
    # saving movie ids to list
    movie_ids = df['tconst'].copy()#.to_list()
    
    # Load existing data from json into a dataframe called "previous_df"
    previous_df = pd.read_json(JSON_FILE)
    
    # filter out any ids that are already in the JSON_FILE
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

# INNER Loop
    # Get index and movie id from list
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        # Attempt to retrieve the data for the movie id
        try:
            temp = get_movie_with_rating(movie_id)  #This uses your pre-made function
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            
        # If it fails,  make a dict with just the id and None for certification.
        except Exception as e:
            continue
            
    # Save the year's results as csv.gz file
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", 
                         compression="gzip", 
                         index=False)