**TMDB API (Practice)** 

This practice assignment will reinforce important learning objectives from the previous lesson(s), and allow you to take on more challenging core assignments, preparing you for graduation.

Practice and tinker with this assignment until you're comfortable performing each of the tasks. Then, be sure to submit your output as described in the steps below.

**TMDB API (Practice):**

**Project Planning**

As discussed in the previous lesson, for the next part of your project, you will extract financial and certification data from TMDB's API for your IMDB data set. You will use an OUTER and INNER loop: a loop within a loop!

The OUTER loop will loop through the start years included in the IMDB data, filter the title basics data for the selected year, and save the list of movie ids from that year to retrieve in the inner loop.

The INNER loop loops through every movie id from the selected year, extracts its results from the TMDB API, and appends them to a JSON file.

# **For this practice assignment**

You will be practicing the inner loop of API calls for a single year's list of movies from your IMDB title basics data. Specifically, you will extract the API results for every movie with a startYear of 2000.

* **Read the instructions below, including the examples in the "Getting Started" section, before starting your work.**

* **Create a new notebook in your project repository called "Practicing TMDB API calls.**

**Preparation BEFORE the loop**
* Designate a folder to save your information.
* Define custom functions you will use for your API calls
* Load your cleaned title basics data from Part 1 of Project 2 (or query your title_basics table in your MySQL database).
* Define the year you wish to retrieve (2010) and create an empty list for appending error messages.

**Prepare the DataFrame and JSON File**
* **Use the selected year to define filenames and filter the data**
    1. Define a JSON_FILE filename to save the results in progress.
    2. Check if the file exists.
        * if it does not exist, create the empty JSON file with with open that just contains the key "imdb_id"
        * if it exists, print a message saying that it already exists.

***Now that the JSON file for the results in progress exists:***
* Filter the IMDB title basics data for the selected year and save the movie IDs from that year as "movies_ids".
* Check the JSON file for previously downloaded movie IDs and filter out the movie ids that already exists in the JSON file ( to prevent duplicate API calls) by:
    * Loading in the contents of the JSON file pd.read_json.
        * Compare the movie_ids that were in the JSON file to your saved movie_ids_to_get.
    * Save the final list of "movie_ids_to_get" by filtering out movies that already exists in the JSON file.

**Perform the Loop of API Calls**

Note: you have already written a function to combine the certification with the rest of the .info() from the TMDB API results in the Intro to TMDB API lesson.

**Create a loop to make API calls for each id** in the YEAR specified. Include a progress bar using tqdm_notebook.

***For each movie id:***
* Extract the current ID from the API and retrieve the dictionary of results
* Append the new results to the list from the JSON file
* Save the updated JSON file back to the disk

**Save the Results to Compressed .csv**
* **After the loop**, save the final results for the year as a csv.gz file with the year in the filename.

Note: at this point, you'll have completed the inner loop that you will need for the next part of the project.

# **Getting Started**

## **Preparation BEFORE the Loop:**

**Install, Import packages**

In [1]:
pip install tmdbsimple

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import packages
import os, time, json
import pandas as pd

In [3]:
from tqdm.notebook import tqdm_notebook

In [4]:
import tmdbsimple as tmdb

**Designate a folder**

You will save API call data in the data folder you created for project Part 1.

In [5]:
# Create the folder for saving files (if it doesn't exist)
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['title-akas-us-only.csv',
 'title-basics-us-only.csv',
 'title-ratings-us-only.csv',
 'title.basics.tsv.gz',
 'title.ratings.tsv.gz']

In [6]:
# If you created the data folder for part 1, you will see your csv files listed here. If not, it will just be empty [].

In [7]:
# Define your functions
def get_movie_with_rating(movie_id):
    # Get the movie object for the current id
    movie = tmdb.Movies(movie_id)
    
    # save the .info .releases dictionaries
    movie_info = movie.info()
    releases = movie.releases()
    
    # Loop through countries in releases
    for c in releases['countries']:
        # if the country abbreviation==US
        if c['iso_3166_1' ] =='US':
            ## save a "certification" key in the info dict with the certification
            movie_info['certification'] = c['certification']
    return movie_info


def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

In [8]:
# Load the cleaned Title Basics (from Part 1)
basics = pd.read_csv('Data/title-basics-us-only.csv')
basics

Unnamed: 0.1,Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,34802,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,61114,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,67666,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
3,86793,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"
4,93930,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002.0,,126,Drama
...,...,...,...,...,...,...,...,...,...,...
86974,10016149,tt9914942,movie,Life Without Sara Amat,La vida sense la Sara Amat,0,2019.0,,74,Drama
86975,10016544,tt9915872,movie,The Last White Witch,My Girlfriend is a Wizard,0,2019.0,,97,"Comedy,Drama,Fantasy"
86976,10016684,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama
86977,10016693,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller"


In [9]:
# Set the year to filter for
YEAR = 2000

# Create an empty list for saving errors
errors = []

In [10]:
# Define the JSON file to store results for the year
JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'


# Check if the JSON file exists
file_exists = os.path.isfile(JSON_FILE)

# If it does not exist: create it
if file_exists == False:
    print(f"Creating {JSON_FILE} for API results for year={YEAR}.")
    
    # save an empty dict with just "imdb_id" to the new json file.
    with open(JSON_FILE,'w') as f:
        json.dump([{'imdb_id':0}],f)

# If it exists, print a message
else:
    print(f'The file {JSON_FILE} already exists.')

Creating Data/tmdb_api_results_2000.json for API results for year=2000.


In [11]:
# Filtering for movies from selected startYear
df = basics.loc[ basics['startYear']==YEAR].copy()
# saving movie ids to list
movie_ids = df['tconst']
movie_ids.head()

8     tt0113026
9     tt0113092
11    tt0115937
12    tt0116391
13    tt0116628
Name: tconst, dtype: object

In [12]:
# Load existing data from json into a dataframe called "previous_df"
previous_df = pd.read_json(JSON_FILE)
previous_df

Unnamed: 0,imdb_id
0,0


In [13]:
# filter out any ids that are already in the JSON_FILE
movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

In [14]:
# Loop through movie_ids_to_get with a tqdm progress bar
for movie_id in tqdm_notebook(movie_ids_to_get, f"Movies from {YEAR}"):

    # Attempt to retrieve then data for the movie id
    try:
        temp = get_movie_with_rating(movie_id)  #This uses your pre-ma    de function
        # Append/extend results to existing file using a pre-made function
        write_json(temp,JSON_FILE)
        # Short 20 ms sleep to prevent overwhelming server
        time.sleep(0.02)

    # If it fails,  make a dict with just the id and None for certification.
    except Exception as e:
        errors.append([movie_id, e])

Movies from 2000:   0%|          | 0/1457 [00:00<?, ?it/s]

In [15]:
print(f"- Total errors: {len(errors)}")

- Total errors: 1457


In [16]:
# Save the final results to a csv.gz file
final_year_df = pd.read_json(JSON_FILE)

csv_fname = f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz"
final_year_df.to_csv(csv_fname, compression="gzip", index=False)