# A. Project Name:  IMDb Successful Movie.
- **Student Name:** Eduardo Galindez.
- **Coding Dojo Bootcamp:** Data Science.
  - **Stack:** Data Enrichment.
- **Date:** September 23th, 2022.

# B. Project Objective
Our Stakeholder Wants More Data!
- After investigating the preview of our data from Part A, the stakeholder realized that there is no financial information included in the IMDB data (e.g. budget or revenue).
    - Our stakeholder identified The Movie Database ([TMDB](https://www.themoviedb.org/)) as a great source of financial data . Thankfully, TMDB offers a free API for programmatic access to their data!
- The stakeholder wants we to extract the budget, revenue, and MPAA Rating (G/PG/PG-13/R), which is also called "Certification".

# C. Project Statement


### Specifications:

Our stakeholder would like you to extract and save the results for movies that meet all of the criteria established in Part A of the project.

As a proof-of-concept, they requested we perform a test extraction of movies that started in 2000 or 2001

Each year should be saved as a separate .csv.gz file

Confirm Your API Function works.

- In order to ensure our function for extracting movie data from TMDB is working, test the function on these 2 movie ids: tt0848228 ("The Avengers") and tt0332280 ("The Notebook"). Make sure that the function runs without error and that it returns the correct movie's data for both test ids.

- Once you have retrieved and saved the final results to 2 separate .csv.gz files, move on to a Exploratory Data Analysis to explore the following questions.

### Exploratory Data Analysis
1. Load in your csv.gz's of results for each year extracted.
 - Concatenate the data into 1 dataframe for the remainder of the analysis.
2. Once you have your data from the API, they would like we to perform some light EDA to show:
 - How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
 - Exclude any movies with 0's for budget AND revenue from the remaining visualizations.
 - How many movies are there in each of the certification categories (G/PG/PG-13/R)?
 - What is the average revenue per certification category?
 - What is the average budget per certification category?

### Deliverable:

After we have joined the tmdb results into 1 dataframe in the EDA Notebook:

- Save a final merged .csv.gz of all of the tmdb api data.
- The file name should be "tmdb_results_combined.csv.gz".
- Make sure this is pushed to our GitHub repository along with all of the code.
- One code file for API calls.
- One code file for EDA.
- Submit the link.

# D. Project Development

## 1.- Libraries & Functions

In [2]:
# Libraries.
import numpy as np
import pandas as pd
import tmdbsimple as tmdb 
import matplotlib.pyplot as plt
import seaborn as sns
import os, time, json
os.makedirs('Data',exist_ok=True)

from tqdm.notebook import tqdm_notebook

In [3]:
# Function to get the certification.
def get_movie_certification(movie_id):
    movie = tmdb.Movies(movie_id)
    info = movie.info()
    releases = movie.releases()
    
    for c in releases['countries']:
        if c['iso_3166_1'] == "US":
            info['certifcation'] = c['certification']     
    return info

In [4]:
# Function to create our .json file.
##  Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/

def write_json(new_data, filename):    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

## 2.-  Data & Connection

### 2.1.- API connection


In [5]:
# Loading API credentials.
with open('/Users/eduar/.secret/tmdb_api.json', 'r') as file:
    login = json.load(file)
login.keys()

dict_keys(['api-key'])

In [6]:
# Import credentials.
tmdb.API_KEY =  login['api-key']

In [7]:
# Checking the connection with 'The Avengers'.
the_avengers_movie = tmdb.Movies('tt0848228')
the_avengers_info = the_avengers_movie.info()
the_avengers_info['budget']

220000000

In [8]:
# Checking the connection with 'The Notebook'.
the_notebook_movie = tmdb.Movies('tt0332280')
the_notebook_info = the_notebook_movie.info()
the_notebook_info['budget']

29000000

### 2.2.- Mount and loading data.

In [9]:
# Specify folder for saving data.
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['.ipynb_checkpoints',
 'Chunk data per database',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'final_tmdb_data_2003.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'final_tmdb_data_2006.csv.gz',
 'final_tmdb_data_2007.csv.gz',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2011.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'final_tmdb_data_2013.csv.gz',
 'final_tmdb_data_2014.csv.gz',
 'final_tmdb_data_2015.csv.gz',
 'final_tmdb_data_2016.csv.gz',
 'final_tmdb_data_2017.csv.gz',
 'final_tmdb_data_2018.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'final_tmdb_data_2021.csv.gz',
 'final_tmdb_data_2022.csv.gz',
 'genres.csv.gz',
 'Original data',
 'title_akas_combined.csv.gz',
 'title_basics.csv.gz',
 'title_basics_combined.csv.gz',
 'title_genres.csv.gz',
 'title_ratings.csv.gz',
 'title_ratings_combined.csv.gz',
 'tmbd_data.csv.gz',
 'tmdb_api_result

In [10]:
# Load in the dataframe from Part A:
basics_df = pd.read_csv('./Data/title_basics_combined.csv.gz', low_memory = False)

In [11]:
# Create Required Lists for the our function.
YEARS_TO_GET = range(2000, 2023)
errors = [ ]

In [12]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    # Defining the JSON file to store results for year.
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    # Check if the file exists.
    file_exists = os.path.isfile(JSON_FILE)
    # If it does exist: notify me.
    if file_exists == True:
        print(f'{YEAR} {JSON_FILE} already exists.')
    # If it does not exist: create it.
    else:
    # Save an empty dict with just "imdb_id" to the new json file.
        with open(JSON_FILE,'w') as file:
            json.dump([{'imdb_id':0}], file)

    # Saving new year as the current df.        
    df = basics_df.loc[basics_df['startYear'] == YEAR].copy()
    # Saving movie ids to list.
    movie_ids = df['tconst'].copy()
    
    # Load existing data from json into a dataframe called "previous_df"
    previous_df = pd.read_json(JSON_FILE)
    
    # Filter out any ids that are already in the JSON_FILE.
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    
    # Start of INNER Loop.
    if file_exists == False:
        for movie_id in tqdm_notebook(movie_ids_to_get,
                                      desc=f'Movies from {YEAR}',
                                      position=1,
                                      leave=True):
            try:
                # Retrieve then data for the movie id.
                temp = get_movie_certification(movie_id)  
                # Append/extend results to existing file using a pre-made function.
                write_json(temp,JSON_FILE)
                # Short 20 ms sleep to prevent overwhelming server.
                time.sleep(0.02)

            except Exception as e:
                errors.append([movie_id, e])

        final_year_df = pd.read_json(JSON_FILE)
        final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

print(f"- Total errors: {len(errors)}")

YEARS:   0%|          | 0/23 [00:00<?, ?it/s]

2000 Data/tmdb_api_results_2000.json already exists.
2001 Data/tmdb_api_results_2001.json already exists.
2002 Data/tmdb_api_results_2002.json already exists.
2003 Data/tmdb_api_results_2003.json already exists.
2004 Data/tmdb_api_results_2004.json already exists.
2005 Data/tmdb_api_results_2005.json already exists.
2006 Data/tmdb_api_results_2006.json already exists.
2007 Data/tmdb_api_results_2007.json already exists.
2008 Data/tmdb_api_results_2008.json already exists.
2009 Data/tmdb_api_results_2009.json already exists.
2010 Data/tmdb_api_results_2010.json already exists.
2011 Data/tmdb_api_results_2011.json already exists.
2012 Data/tmdb_api_results_2012.json already exists.
2013 Data/tmdb_api_results_2013.json already exists.
2014 Data/tmdb_api_results_2014.json already exists.
2015 Data/tmdb_api_results_2015.json already exists.
2016 Data/tmdb_api_results_2016.json already exists.
2017 Data/tmdb_api_results_2017.json already exists.
2018 Data/tmdb_api_results_2018.json already e

In [13]:
# Let's load data from 2000 & 2001.
#movies_from_2000_df = pd.read_csv('./Data/final_tmdb_data_2000.csv.gz', low_memory = False)
#movies_from_2001_df = pd.read_csv('./Data/final_tmdb_data_2001.csv.gz', low_memory = False)

In [14]:
# Let's concatenate both datasets.

## Upload new datasets.
#movies_from_2000 = pd.read_csv('./Data/final_tmdb_data_2000.csv.gz')
#movies_from_2001 = pd.read_csv('./Data/final_tmdb_data_2001.csv.gz')

## Concatenate them.
#movies_from_2000_and_2001_df = pd.concat([movies_from_2000, movies_from_2001])
#movies_from_2000_and_2001_df.head(5)

In [15]:
# Check for invalid data in 'imdb_id'.
#movies_from_2000_and_2001_df[movies_from_2000_and_2001_df['imdb_id'] == '0']#

In [16]:
# Let's drop rows with values=0.
#movies_from_2000_and_2001_df.drop(index=movies_from_2000_df.index[0], axis=0, inplace=True)
#movies_from_2000_and_2001_df

In [17]:
# Have a general look of the data.
#movies_from_2000_and_2001_df.info()

**Notes:**
- At this point we have identified some missing data. Keeping the focus on our target columns, there is no missing data in 'Budget' or ' Revenue' but in 'Certification' we have 1,701 missing data (~68%).
- Let's see how this could affect pot analysis in Section 3.

In [18]:
# Statistical summary.
#movies_from_2000_and_2001_df.describe()

In [19]:
# Download our concatenated database.
#movies_from_2000_and_2001_df.to_csv(f'./Data/tmdb_results_combined.csv.gz', compression='gzip', index=False)

### 2.3.- Get Budget, Revenue and Certification per movie tested in Section 2.1.

#### 2.3.1.- The Avengers

#### 2.3.2.- The Notebook

## 3.- Visual Data Exploration

### 3.1.- How many movies had at least some valid financial information (values > 0 for budget OR revenue)?
Please exclude any movies with 0's for budget AND revenue from the remaining visualizations.

**Notes:**
- We identified there is still missing data in our 'certification' column, even after filtering.
- The impact of these missing data has been reduced to 12.6%.

### 3.2.- How many movies are there in each of the certification categories (G/PG/PG-13/R)?

### 3.3.- What is the average revenue per certification category?

### 3.4.- What is the average budget per certification category?

### 3.5.- What is the average ROI (Return of Investment) per certification category?
- In Section 2.3, we calculated the ROI of The Avengers and The Notebook movies. 
- In this Section we are going to analyze if those financial performance are usual in movies, in order to evaluate which type of movie are more profitable related with MPAA Rating (certification).

# E. Conclusions

- In 2000 and 2001, rating R accounted for almost 67% of the movies made, but was not among the top three in profitability.
- During the same time period, PG movies accounted for 8% of films made, and they were identified as the most profitable.
- There were 8.86% of projects with positive ROI, which raised suspicions about the data's reliability. There might need to be a more in-depth study.