# READ BEFORE EXECUTING
https://www.themoviedb.org/
* *You will need to request an api key from the above website*
* *You will need to create a config.py file* to hold your api key for TMDB. 
    * Have it read tmdb_key = "YOUR KEY"
* *You will need to unzip "movie_title.csv.zip".* 
    * movie_title.csv was too large for me to push. 
    * You can store it in you local branch, then delete it once you are done with it. 
    * When you unzip it, the csv file will populate in the same folder (Resources) as the zip...which is where the script will think it is.
<br><br>
* "PUT YOUR YEAR HERE"
    * This is in the 3rd cell down. *You will need to update this with your assigned year.*
* You will need to do two pulls: 
    * 1) to get the TMDB ID
    * 2) to get the movie info based off the TMBD ID.
    * This will take a while
        * I'm guessing let it run and check back in an hour to make sure there are now errors
        * If there are no errors, it should be able to run clean through.

#### IF THERE ARE ERRORS
* Please take a screenshot of the ENTIRE error message
* For Pull 1:
    * print(response_tmdb_id.count())
    * print(error_count.count())
    * Add the 2 count numbers together
    * print(movies[THE NUMBER HERE])
        * That should be the title that thew it off
* For Pull 2:
    * Please go into the cell below where the CSV is saved and run it
    * Open it up and see what the last title that rna successfully and see what comes next
        * That should be the title that thew it off

#### Make sure to zip all 3 CSVs this script will create. 
* We need all 3 so we can tell what was dropped when/where.
* If they are not zipped, there might be issues pushing to the main branch due to size. You can right click and there should be a compress or zip option.

### FOR THE LOVE OF ALL THAT IS GOOD AND HOLY MAKE SURE YOU ARE ON YOUR BRANCH OF GIT (please and thank you!)

In [26]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import requests
from config import tmdb_key

In [27]:
file_path = 'Resources/movie_title.csv'

In [28]:
# *****2015*****

year = 2015 # <------ YOUR YEAR GOES HERE!!!!!

In [31]:
movie_titles_df = pd.read_csv(file_path)
movie_titles_df.head()

movies_df = movie_titles_df[['tconst', 'primaryTitle', 'startYear']]
movies_df = movies_df[(movies_df.startYear == year)]

beginning_number = movies_df.tconst.count()
movies_df.head()

Unnamed: 0,tconst,primaryTitle,startYear
10,tt0191476,Fed Up,2015
15,tt0283440,Short Time Heroes,2015
16,tt0297400,Snowblind,2015
23,tt0337926,Chatô - The King of Brazil,2015
25,tt0346045,Transeúntes,2015


# Making the dataframe into a list + editing for error prevention

In [32]:
# change title name to have + instead of ' '
movies_df['primaryTitle'] = movies_df['primaryTitle'].str.replace(" ", "+")


# ******Error 1: need to remove # from the beginning of titles for TMDB to work


# variable cause starswith() wasn't happy with '#'
pound_sign = '#'

# make dataframe for pound sign = True (startswith() returns True/False)
replace_pound_df = movies_df.iloc[:, 0:3]
replace_pound_df.primaryTitle = replace_pound_df.primaryTitle.str.startswith(pound_sign)

# make df for ONLY the True values + primaryTitle from movie_df
pound_true_df = replace_pound_df.loc[replace_pound_df.primaryTitle == True]
pound_true_df['TITLE'] = movies_df['primaryTitle']

# Fix titles to not have # in the front & clean up columns
pound_true_df['TITLE'] = pound_true_df['TITLE'].str.replace(pound_sign, "")
pound_true_clean_df = pound_true_df.drop(columns=['primaryTitle', 'startYear'])
pound_true_clean_df = pound_true_clean_df.rename(columns={'TITLE': 'primaryTitle'})

# Merge 2 dfs, replace blank primaryTitle_y values with na so you can do fillna into a 
# nice new clean has correct info column & delete primaryTitle_y/x
titles_combined_df = pd.merge(movies_df, pound_true_clean_df, how='outer', on='tconst')
titles_combined_df['primaryTitle_y'] = titles_combined_df['primaryTitle_y'].str.replace(" ", "nan")
titles_combined_df["primaryTitle"] = titles_combined_df["primaryTitle_y"].fillna(titles_combined_df["primaryTitle_x"])
titles_fixed_df = titles_combined_df.drop(columns=['primaryTitle_y', 'primaryTitle_x'])


# FINALLY make movie titles into a list so you can run it
movies = titles_fixed_df['primaryTitle'].tolist()


# TMDB call 1 - get TMDB ID numbers

In [33]:
url_tmdb_id = "https://api.themoviedb.org/3/search/movie?api_key=" + tmdb_key + "&query="

response_tmdb_id = []
str_year = "&y=" + str(year)

error_count = 0

for movie in movies: 
    movie_data = requests.get(url_tmdb_id + movie + str_year).json()
    
    if (movie_data['total_results'] == 1):
        response_tmdb_id.append(movie_data['results'][0]['id']) 
    else:
        error_count += 1
        
print(f"A total of {error_count} movies could not be found.")

A total of 10438 movies could not be found.


#### Save results as a CSV

In [34]:
file_outpath = f"Resources/TMDB_pull_1_{year}_error_count{error_count}.csv"

TMDB_df = pd.DataFrame(response_tmdb_id,columns=['ID'],dtype=object)
TMDB_df.to_csv(file_outpath)

# TMDB call 2 - use TMDB ID numbers to get movie info

In [35]:
url_tmdb_movie = "https://api.themoviedb.org/3/movie/"

# Make columns to import info into
TMDB_df['imdb_id'] = " "
TMDB_df['release_date'] = " "
TMDB_df['budget'] = " "
TMDB_df['revenue'] = " "
TMDB_df['genres'] = " "
TMDB_df['original_language'] = " "
TMDB_df['original_title'] = " "
TMDB_df['origin_country'] = " "
TMDB_df['production_countries name'] = " "
TMDB_df['spoken_languages name'] = " "

In [36]:
error_count_info = 0

for index, row in TMDB_df.iterrows(): 
    movie_data = requests.get(url_tmdb_movie + str(TMDB_df.ID[index]) + "?api_key=" + tmdb_key).json()
    try:
        TMDB_df.loc[index, 'imdb_id'] = movie_data['imdb_id']
        TMDB_df.loc[index, 'release_date'] = movie_data['release_date']
        TMDB_df.loc[index, 'budget'] = movie_data['budget']
        TMDB_df.loc[index, 'revenue'] = movie_data['revenue']
        TMDB_df.loc[index, 'original_language'] = movie_data['spoken_languages'][0]['name']
        TMDB_df.loc[index, 'original_title'] = movie_data['original_title']
        TMDB_df.loc[index, 'origin_country'] = movie_data['production_countries'][0]['iso_3166_1']
        TMDB_df.loc[index, 'production_countries name'] = movie_data['production_countries'][0]['name']
        TMDB_df.loc[index, 'spoken_languages name'] = movie_data['spoken_languages'][0]['name']
        TMDB_df.loc[index, 'genres'] = movie_data['genres'][0]['name']    
    except (IndexError, KeyError):
        error_count_info +=1

#### Save results as a CSV

In [37]:
file_outpath_2 = f"Resources/TMDB_pull_2_{year}_error_count{error_count_info}.csv"

TMDB_df.to_csv(file_outpath_2)

# CLEANING if budget = 0, revenue = 0, IMDB_id not found
* This is to help keep the file size down by dropping rows we cannot use or cannot match up

In [38]:
movie_info_pulled_df = TMDB_df.copy()
movie_info_pulled_df.head()

Unnamed: 0,ID,imdb_id,release_date,budget,revenue,genres,original_language,original_title,origin_country,production_countries name,spoken_languages name
0,540873,tt0283440,2015-10-21,0,0,Science Fiction,Deutsch,Kurzzeithelden,DE,Germany,Deutsch
1,368103,tt6449458,2015-11-19,0,0,Drama,Português,"Chatô, O Rei do Brasil",BR,Brazil,Português
2,353326,tt0787524,2016-04-08,0,11472454,Drama,,The Man Who Knew Infinity,GB,United Kingdom,
3,306819,tt0810819,2015-01-01,15000000,64191523,Drama,Français,The Danish Girl,DE,Germany,Français
4,252838,tt0884732,2015-01-16,23000000,79799880,Comedy,English,The Wedding Ringer,US,United States of America,English


In [39]:
movie_info_pulled_df = movie_info_pulled_df[movie_info_pulled_df.budget != 0]
movie_info_pulled_df = movie_info_pulled_df[movie_info_pulled_df.revenue != 0]
movie_info_pulled_df = movie_info_pulled_df.dropna(subset=['imdb_id'])

final_number = movie_info_pulled_df.imdb_id.count()

#### Save results as a CSV

In [40]:
total_errors = beginning_number - final_number

file_outpath_FINAL = f"Resources/TMDB_pull_FINAL_{year}_dropped_movies_{total_errors}.csv"
movie_info_pulled_df.to_csv(file_outpath_FINAL)