---
title: "Data Collection"
format:
    html: 
        code-fold: true
---


{{< include overview.qmd >}} 

{{< include methods.qmd >}} 

# Code 

Code for this webpage can be found [here.](https://github.com/dsan-5000/project-dcorc7/blob/main/technical-details/data-collection/main.ipynb)

## Python Libraries and API Connections

The first step in the data collection process is to import the appropriate Python libaries and ensure the two API connections are working properly. The Python packages that I imported, as well as the reasons for importing, are as follows:

- **tmdbv3api:** To make using the TMDB API easier and with a simpler syntax
  
- **json:** To load API keys that are stored in a json file at a separate location
  
- **requests:** To obtain movie data from both APIs
  
- **pandas:** To work with and store retrieved data into pandas dataframes

In [1]:
from tmdbv3api import TMDb, Genre, Discover, Movie
import json
import requests
import pandas as pd

# Obtain API Key for TMDB
with open("/Users/DCorc/OneDrive/Documents/Georgetown/1-Fall-2024/DSAN-5000-Data-Science-and-Analytics/Project/movie-api-keys.json") as f:
    keys = json.load(f)

# Call the TMDB API
tmdb = TMDb()        
API_KEY_TMDB = keys["TMDB"]
API_KEY_OMDB = keys["OMDB"]
tmdb.api_key = API_KEY_TMDB

movie = Movie()
genre = Genre()
discover = Discover()

## Accessing the TMDB API

My method of accessing and retreiving data on movies within TMDB can be broken down into four steps:

- Create a list of all genre's available in the database and loop through each genre

- Within the genre loop, create another loop to cycle through a predetermined set of pages in which the movies will be ordered from highest to lowest total box office revenue. In my case, I found that 20 pages was sufficient.

- Within both of the previously mentioned loops, create another loop to sift through all movies on each page, create a dictionary populated with selected movie attributes, and append each dictionary to a common list.

- Use Pandas to convert the list full of movie dictionaries into a dataframe

In [2]:
# Store all available genres in a variable
all_genres = genre.movie_list()

# Establish a page limit to search
page_limit = 20

# Create a blank movie list
movie_list = []

# Loop through all genres
for g in all_genres:
    genre_id = g["id"]
    genre_name = g["name"]
    
    # Loop through the amount of pages previously established
    for page in range(1, page_limit + 1):
        movies = discover.discover_movies({
            "with_genres": genre_id,
            "sort_by": "revenue.desc",
            "page": page,
            "include_adult": False
        })
        
        if not movies:
            break
        
        # Pull movies from each page and up to 20 pages within each genre
        for movie in movies:

            TMDB_url = f"https://api.themoviedb.org/3/movie/{movie.id}?api_key={API_KEY_TMDB}"

            age_rating_url = f"https://api.themoviedb.org/3/movie/{movie.id}/release_dates?api_key={API_KEY_TMDB}"

            keywords_url = f"https://api.themoviedb.org/3/movie/{movie.id}/keywords?api_key={API_KEY_TMDB}"

            # Create requests for gerneal movie details, as well as the fields that aren't pulled with the regular url
            movie_details = requests.get(TMDB_url).json()
            age_rating_response = requests.get(age_rating_url).json()
            keywords_response = requests.get(keywords_url).json()

            # Loop through age ratings for countries to determine if there is a US rating
            age_rating = None
            for country in age_rating_response.get("results", []):
                if country["iso_3166_1"] == "US":  # Change to desired country code if needed
                    age_rating = country["release_dates"][0].get("certification", None)
                    break


            # Extract keywords from the response
            keywords = [kw["name"] for kw in keywords_response.get("keywords", [])]

            # Put data for each movie in a dictionary
            movie_data = {
                "IMDB_ID": movie_details.get("imdb_id", None),
                "Title": movie.title,
                "Release_Date": movie_details.get("release_date", None),
                "Age_Rating": age_rating,
                "Overview": movie.overview,
                "Popularity": movie.popularity,
                "Genre": genre_name,
                "TMDB_Rating": movie.vote_average,
                "Budget": movie_details.get("budget", None),
                "Revenue": movie_details.get("revenue", None),
                "Keywords": keywords
            }

            # Append the movie data to the movies list
            movie_list.append(movie_data)

# Create a movie df from the movie list
columns = ["IMDB_ID", "Title", "Release_Date", "Age_Rating", "Overview", "Popularity", "Genre", "TMDB_Rating", "Budget", "Revenue", "Keywords"]
TMDB_movies_df = pd.DataFrame(movie_list, columns = columns)

Below are some details about the recently created TMDB_movies_df:

In [4]:
print(f"Total Movie Count: {len(TMDB_movies_df)}\n")

print(f"Raw Datatset Shape: {TMDB_movies_df.shape}\n")

pd.set_option("display.max_columns", None)
TMDB_movies_df.head(5)

Total Movie Count: 7600

Raw Datatset Shape: (7600, 11)



Unnamed: 0,IMDB_ID,Title,Release_Date,Age_Rating,Overview,Popularity,Genre,TMDB_Rating,Budget,Revenue,Keywords
0,tt0499549,Avatar,2009-12-15,PG-13,"In the 22nd century, a paraplegic Marine is di...",122.583,Action,7.583,237000000,2923706026,"[paraplegic, attachment to nature, culture cla..."
1,tt4154796,Avengers: Endgame,2019-04-24,PG-13,After the devastating events of Avengers: Infi...,114.534,Action,8.2,356000000,2799439100,"[superhero, time travel, space travel, time ma..."
2,tt1630029,Avatar: The Way of Water,2022-12-14,PG-13,Set more than a decade after the events of the...,137.213,Action,7.62,460000000,2320250281,"[dying and death, loss of loved one, alien lif..."
3,tt2488496,Star Wars: The Force Awakens,2015-12-15,,Thirty years after defeating the Galactic Empi...,58.302,Action,7.272,245000000,2068223624,"[android, spacecraft, space opera]"
4,tt4154756,Avengers: Infinity War,2018-04-25,PG-13,As the Avengers and their allies have continue...,283.091,Action,8.2,300000000,2052415039,"[sacrifice, magic, superhero, based on comic, ..."


## Accessing the OMDB API

Accessing the OMDB API differs slightly. Because a dataframe of TMDB movie details has already been created, a loop will be created to search the OMDB API by the IMDb IDs of those preretreived TMDB movies. If a movie, or movie attribute, does not exist within the OMDB API, it will be returned as "None." A counter was created to track the progress of the loop since this process usually takes around 30 minutes. Every 10% interval of the loop will be printed so that the user knows how far along the loop is. The additional OMDB data is stored in the same method as the TMDB data - within dictionaries and inside of one common list. After the loop is complete, the additional data list will be converted to a Pandas dataframe and the TMDB and OMDB dataframes will be merged together by index. Lastly, the dataframe is saved as a csv file and ready for cleaning.

In [3]:
import time

# Function used to request the additional data from OMDB API
def additional_omdb_data(parameter):
    url = f"http://www.omdbapi.com/?apikey={API_KEY_OMDB}&i={parameter}"

    response = requests.get(url)
    data = response.json()

    if data.get("Response") == "True":

        ratings = data.get("Ratings", [])
        rotten_tomatoes_score = next((r["Value"] for r in ratings if r["Source"] == "Rotten Tomatoes"), None)

        return {
            "Year": data.get("Year", None),
            "Director": data.get("Director", None),
            "Actors": data.get("Actors", None),
            "Runtime": data.get("Runtime", None),
            "Awards": data.get("Awards", None),
            "Metascore_Rating": data.get("Metascore", None),
            "IMDB_Rating": data.get("imdbRating", None),
            "Rotten_Tomatoes_Rating": rotten_tomatoes_score
        }
    
    else:
        return {"Year": None, "Director": None, "Actors": None, "Runtime": None, "Awards": None, "Metascore_Rating": None, 
                "IMDB_Rating": None, "Rotten_Tomatoes_Rating": None}


# Variable to keep track of the df length
df_length = len(TMDB_movies_df)

# Loop counter to keep track of how far along the for loop is
loop_counter = 1

# Empty list to hold the newly obtained data
additional_data = []

# Loop through all titles of the TMDB_movies_df to add OMDB data
for id in TMDB_movies_df["IMDB_ID"]:
    # Append new data into the list using the previously created function for OMDB
    additional_data.append(additional_omdb_data(id))

    # Print percentage complete
    percent_complete = loop_counter / df_length
    if loop_counter % (df_length // 10) == 0:
        print(f"Percent Complete: {percent_complete * 100}%")

    loop_counter += 1


# Convert the additional data to a DataFrame
additional_df = pd.DataFrame(additional_data)

# Append the new data to the existing DataFrame
movies_df = pd.concat([TMDB_movies_df, additional_df], axis = 1)

movies_df.to_csv("../../data/raw-data/movies.csv", index = False)


Percent Complete: 10.0%
Percent Complete: 20.0%
Percent Complete: 30.0%
Percent Complete: 40.0%
Percent Complete: 50.0%
Percent Complete: 60.0%
Percent Complete: 70.0%
Percent Complete: 80.0%
Percent Complete: 90.0%
Percent Complete: 100.0%


## Previewing the Raw Dataframe

After retrieving the data from both APIs, the total movie count, shape of the finalized dataframe, counts per genre, and a preview of the first 5 movies are shown below.

In [5]:
print(f"Total Movie Count: {len(movies_df)}\n")

print(f"Raw Datatset Shape: {movies_df.shape}\n")

print(f"{movies_df["Genre"].value_counts()}\n")

pd.set_option("display.max_columns", None)
movies_df.head(5)

Total Movie Count: 7600

Raw Datatset Shape: (7600, 19)

Genre
Action             400
Adventure          400
Animation          400
Comedy             400
Crime              400
Documentary        400
Drama              400
Family             400
Fantasy            400
History            400
Horror             400
Music              400
Mystery            400
Romance            400
Science Fiction    400
TV Movie           400
Thriller           400
War                400
Western            400
Name: count, dtype: int64



Unnamed: 0,IMDB_ID,Title,Release_Date,Age_Rating,Overview,Popularity,Genre,TMDB_Rating,Budget,Revenue,Keywords,Year,Director,Actors,Runtime,Awards,Metascore_Rating,IMDB_Rating,Rotten_Tomatoes_Rating
0,tt0499549,Avatar,2009-12-15,PG-13,"In the 22nd century, a paraplegic Marine is di...",122.583,Action,7.583,237000000,2923706026,"[paraplegic, attachment to nature, culture cla...",2009,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver",162 min,Won 3 Oscars. 91 wins & 131 nominations total,83,7.9,81%
1,tt4154796,Avengers: Endgame,2019-04-24,PG-13,After the devastating events of Avengers: Infi...,114.534,Action,8.2,356000000,2799439100,"[superhero, time travel, space travel, time ma...",2019,"Anthony Russo, Joe Russo","Robert Downey Jr., Chris Evans, Mark Ruffalo",181 min,Nominated for 1 Oscar. 70 wins & 133 nominatio...,78,8.4,94%
2,tt1630029,Avatar: The Way of Water,2022-12-14,PG-13,Set more than a decade after the events of the...,137.213,Action,7.62,460000000,2320250281,"[dying and death, loss of loved one, alien lif...",2022,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver",192 min,Won 1 Oscar. 75 wins & 152 nominations total,67,7.5,76%
3,tt2488496,Star Wars: The Force Awakens,2015-12-15,,Thirty years after defeating the Galactic Empi...,58.302,Action,7.272,245000000,2068223624,"[android, spacecraft, space opera]",2015,J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac",138 min,Nominated for 5 Oscars. 64 wins & 140 nominati...,80,7.8,93%
4,tt4154756,Avengers: Infinity War,2018-04-25,PG-13,As the Avengers and their allies have continue...,283.091,Action,8.2,300000000,2052415039,"[sacrifice, magic, superhero, based on comic, ...",2018,"Anthony Russo, Joe Russo","Robert Downey Jr., Chris Hemsworth, Mark Ruffalo",149 min,Nominated for 1 Oscar. 48 wins & 81 nomination...,68,8.4,85%


{{< include closing.qmd >}} 