# Analysis of Successful Movies (Notebook 2)
* Benjamin Grossmann



This notebook retrieves data from TMDB (The Movie Database); not to be confused with IMDB (which is accessed in Notebook 1).

Then it performs some preprocessing and filtering to keep only the movies that meet the desired criteria.
The final steps are to save the reduced data set.

After the reduced data set has been saved, further work on this project should proceed in Notebook 3. This will reduce the time to bring the data into a project-ready state.

If the reduced data set should need to be reset to its initial condition, then re-run Notebook 2.

The information wanted from the movies in TMDB is:
* budget
* revenue
* MPAA Rating, a.k.a. Certification (G/PG/PG-13/R)

# Initial Imports and Loads

In [1]:
import numpy as np
import pandas as pd
import json
import time
import os
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

In [2]:
# Load API credentials
with open('/Users/Benjamin/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
    
tmdb.API_KEY = login['api-key']

# Load pandas dataframe with imbd id and release years
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory = False)

# Definitions

In [3]:
# define function for retrieving certification of a movie released in the US

def tmdb_info_with_certification(imdb_id):
    # Get the movie object for the current id
    movie = tmdb.Movies(imdb_id)

    # save the .info & .releases dictionaries
    info = movie.info()
    releases = movie.releases()

    # loop through countries in releases
    for c in releases['countries']:
        if c['iso_3166_1']== 'US':
            # save certification key in the info dictionary with the certification value
            info['certification'] = c['certification']
    return info

In [4]:
# From Lesson: Efficient TMDB API Calls
# Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/

def write_json(new_data, filename):
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

# Destination Folder

In [5]:
# Define folder for holding data
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok = True)
os.listdir(FOLDER)

['.ipynb_checkpoints',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json',
 'tmdb_data_2000.csv.gz',
 'tmdb_data_2001.csv.gz']

# Data Retrieval Loop

In [6]:
# Define list of years to search
search_years = [2000, 2001] 

In [7]:
for year in tqdm_notebook(search_years, desc='Searching Movies', position = 0):

    # Define file name for the selected year
    JSON_FILE = f"{FOLDER}tmdb_api_results_{year}.json"

    # Check for File existance
    file_exists = os.path.isfile(JSON_FILE)

    # If file does not exist
    if file_exists==False:
        # create empty dict with key 'imdb_id'
        print(f"{JSON_FILE} being created!")
        with open(JSON_FILE, 'w') as f:
            json.dump([{'imdb_id':0}], f)
    # If file exists, inform user
    else:
        print(f"{JSON_FILE} already exists!")
    
    # Create search data frame for only the selected search year
    # Pull out the ids for the movies to be searched
    search_df = basics.loc[ basics['startYear']==year ].copy()
    search_ids = search_df['tconst'].copy()
    
    # Load results json file as dataframe
    results_df = pd.read_json(JSON_FILE)
    
    # Filter out movies that are in the json file
    remaining_search_ids = search_ids[~search_ids.isin(results_df['imdb_id'])]
    
    print(f"{len(search_ids)} movies in search year {year}")
    print(f"{len(remaining_search_ids)} movies not yet found")
    
    not_found = 0
    ##########
    # Loop for API calls to retrieve data for the selected year
    for imdb_id in tqdm_notebook(remaining_search_ids,
                                desc=f"Movies from year {year}",
                                position=1,
                                leave=True):
        # Attempt to retrieve data and save it
        try:
            temp = tmdb_info_with_certification(imdb_id)
            write_json(temp, JSON_FILE)
            time.sleep(0.05)
        except Exception as e:
            not_found += 1
            continue
    ##########
    print(f"Unable to find {not_found} movies in search year {year}\n")
    
    # Create csv.gz file from the json file
    # Load json to dataframe
    tmdb_df = pd.read_json(JSON_FILE)
    # Drop the dummy zero row
    tmdb_df = tmdb_df[tmdb_df['imdb_id']!=0]
    # Save dataframe to csv.gz
    tmdb_df.to_csv(f"{FOLDER}tmdb_data_{year}.csv.gz",
                   compression='gzip',
                   index=False)
print(f"==========\nSearching Complete.")

Searching Movies:   0%|          | 0/2 [00:00<?, ?it/s]

Data/tmdb_api_results_2000.json already exists!
1405 movies in search year 2000
208 movies not yet found


Movies from year 2000:   0%|          | 0/208 [00:00<?, ?it/s]

Unable to find 208 movies
Data/tmdb_api_results_2001.json already exists!
1521 movies in search year 2001
242 movies not yet found


Movies from year 2001:   0%|          | 0/242 [00:00<?, ?it/s]

Unable to find 242 movies
Searching Complete.
