# Phase 1 Final Project 


* Student Names: Adina Steinman, Chinh Ho, Andrew Banner 
* Instructor Name: Yish Lim

## Introduction 

Throughout Phase 1, we have learned various ways to explore datasets. This notebook will focus on the data cleaning portion of our Project; learning how to use the TMDB API, navigating and exploring data from the API, selecting relevant data for our analyses, and constructing DataFrames. 


## Part I: Getting Started

Our first step will involve importing all the necessary packages for our analyses. We will import "requests" in order to use the requests.get() method to investigate the API, import "json" to view our data in json format, import "pandas" to transfer our data into DataFrames, import "time" to include a delay in our API call when pulling large sets of data, and lastly but most importantly, import "tmdbsimple", the TMDB specific wrapper that will assist in exploring and analyzing this specfiic API.

We will then create a function that allows us to open the API call in json format, as well as retrieve the API key we have created, without displaying the precise API key that we used (as we don't want to give our password away!).



In [1]:
# import the necessary packages for the data cleaning process
import requests
import json
import pandas as pd
import time 
import tmdbsimple as tmdb

In [2]:
# create a function "get_keys" to retrieve API key from file on local computer

def get_keys(path):
    with open(path) as f:
        return json.load(f)

In [3]:
# retrieve API key

keys = get_keys("/Users/adinasteinman/.secret/movies_api.json")
api_key = keys['api_key']

In [4]:
# set API key equal to TMDB key to be used in tmdbsimple wrapper 

tmdb.API_KEY = api_key

## Part II: Data Exploration

Our next step will be to get to know the data from TMDB Simple. We will investigate this data using different techniques, such as the "movies", "search" and "discover" calls. We will also look at how to call the API using the requests.get method, and will compare the two exploration methods and determine which we would prefer to use for the remainder of our analysis. The goal of this section is to find different ways to extract different data and to use this exercise to assist with determining which data we would like to analyze.

In [5]:
# explore the data available when use the "tmdb.Movies call"

tmdb.Movies(5).info()

{'adult': False,
 'backdrop_path': '/5J9lhMZHVQRfH8BV4hyehwptzKp.jpg',
 'belongs_to_collection': None,
 'budget': 4000000,
 'genres': [{'id': 80, 'name': 'Crime'}, {'id': 35, 'name': 'Comedy'}],
 'homepage': '',
 'id': 5,
 'imdb_id': 'tt0113101',
 'original_language': 'en',
 'original_title': 'Four Rooms',
 'overview': "It's Ted the Bellhop's first night on the job...and the hotel's very unusual guests are about to place him in some outrageous predicaments. It seems that this evening's room service is serving up one unbelievable happening after another.",
 'popularity': 14.543,
 'poster_path': '/uZSmxBLIuZ8gpadjAWNdA5aQDAc.jpg',
 'production_companies': [{'id': 14,
   'logo_path': '/m6AHu84oZQxvq7n1rsvMNJIAsMu.png',
   'name': 'Miramax',
   'origin_country': 'US'},
  {'id': 59,
   'logo_path': '/yH7OMeSxhfP0AVM6iT0rsF3F4ZC.png',
   'name': 'A Band Apart',
   'origin_country': 'US'}],
 'production_countries': [{'iso_3166_1': 'US',
   'name': 'United States of America'}],
 'release_date'

In [6]:
# input the TMDB "search" function

search = tmdb.Search()

In [7]:
# explore the TMDB search function by searching for all "Sony Pictures" films

response1 = search.company(query='Sony Pictures')
response1

{'page': 1,
 'results': [{'id': 82346,
   'logo_path': '/jqgK6CSkPrEsIv6Nk390JaBcXYF.png',
   'name': 'Sony Pictures',
   'origin_country': 'JP'},
  {'id': 34,
   'logo_path': '/GagSvqWlyPdkFHMfQ3pNq6ix9P.png',
   'name': 'Sony Pictures',
   'origin_country': 'US'},
  {'id': 134941,
   'logo_path': None,
   'name': 'Sony Pictures',
   'origin_country': ''},
  {'id': 11073,
   'logo_path': '/wHs44fktdoj6c378ZbSWfzKsM2Z.png',
   'name': 'Sony Pictures Television',
   'origin_country': 'US'},
  {'id': 30692,
   'logo_path': '/4ZSWXR7TfxwJbXGcnOAf617AC3h.png',
   'name': 'Sony Pictures Imageworks',
   'origin_country': 'CA'},
  {'id': 8285,
   'logo_path': None,
   'name': 'Sony Pictures Studio',
   'origin_country': ''},
  {'id': 58,
   'logo_path': '/voYCwlBHJQANtjvm5MNIkCF1dDH.png',
   'name': 'Sony Pictures Classics',
   'origin_country': 'US'},
  {'id': 2251,
   'logo_path': '/6l16UFSkZ1oPpyBYaILgffFZlTc.png',
   'name': 'Sony Pictures Animation',
   'origin_country': 'US'},
  {'id': 

In [8]:
# use for loop to explore 'origin country' of the results

for movie in response1['results']:
    print(movie['origin_country'])

JP
US

US
CA

US
US

US

US
US
US
FR

KR
BR
IN
US


In [9]:
# input TMDB discover method

discover = tmdb.Discover()

In [10]:
# test the discover method to find titles for films in the year 2016

response2 = discover.movie(year=2016)
response2
for value in response2['results']:
    print(value['title'])

The Last: Naruto the Movie
Terrifier
Your Name.
Harry Potter and the Prisoner of Azkaban
A Silent Voice
Dragon Ball Z: Resurrection 'F'
Harry Potter and the Half-Blood Prince
ABCs of Death 2 1/2
Deadpool
Yu-Gi-Oh!: The Dark Side of Dimensions
Hacksaw Ridge
Scream at the Devil
Doctor Strange
The Maze Runner
Harry Potter and the Deathly Hallows: Part 2
The Purge: Election Year
Harry Potter and the Deathly Hallows: Part 1
The Hobbit: The Battle of the Five Armies
Sing
Containment


In [11]:
# Find the length of the results 
# This should tell us how many results we are able to explore in each API call 
len(response2['results'])

20

In [12]:
#Explore a specific genre. Look at comedy for example, which has the genre id code 35

comedy = discover.movie(with_genres=35)
comedy['results'][0:2]

[{'popularity': 872.93,
  'vote_count': 43,
  'video': False,
  'poster_path': '/5aL71e0XBgHZ6zdWcWeuEhwD2Gw.jpg',
  'id': 721656,
  'adult': False,
  'backdrop_path': '/5gTQmnGYKxDfmUWJ9GUWqrszRxN.jpg',
  'original_language': 'en',
  'original_title': 'Happy Halloween Scooby-Doo!',
  'genre_ids': [16, 35, 80, 9648, 10751],
  'title': 'Happy Halloween Scooby-Doo!',
  'vote_average': 7.8,
  'overview': 'Scooby-Doo and the gang team up with their pals, Bill Nye The Science Guy and Elvira Mistress of the Dark, to solve this mystery of gigantic proportions and save Crystal Cove!',
  'release_date': '2020-10-06'},
 {'popularity': 772.926,
  'vote_count': 91,
  'video': False,
  'poster_path': '/xqvX5A24dbIWaeYsMTxxKX5qOfz.jpg',
  'id': 660982,
  'adult': False,
  'backdrop_path': '/75ooojtgiKYm5LcCczbCexioZze.jpg',
  'original_language': 'en',
  'original_title': "American Pie Presents: Girls' Rules",
  'genre_ids': [35],
  'title': 'American Pie Presents: Girls Rules',
  'vote_average': 6.

In [13]:
# Explore how to use the TMDb API using the requests.get method 
response3 = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' +  api_key + '&sort_by=popularity.desc')
popularity = response3.json()
top_popular = popularity['results']
top_popular[0:2]

[{'popularity': 3130.643,
  'vote_count': 143,
  'video': False,
  'poster_path': '/7D430eqZj8y3oVkLFfsWXGRcpEG.jpg',
  'id': 528085,
  'adult': False,
  'backdrop_path': '/5UkzNSOK561c2QRy2Zr4AkADzLT.jpg',
  'original_language': 'en',
  'original_title': '2067',
  'genre_ids': [18, 878, 53],
  'title': '2067',
  'vote_average': 5.8,
  'overview': 'A lowly utility worker is called to the future by a mysterious radio signal, he must leave his dying wife to embark on a journey that will force him to face his deepest fears in an attempt to change the fabric of reality and save humankind from its greatest environmental crisis yet.',
  'release_date': '2020-10-01'},
 {'popularity': 1724.443,
  'vote_count': 114,
  'video': False,
  'poster_path': '/elZ6JCzSEvFOq4gNjNeZsnRFsvj.jpg',
  'id': 741067,
  'adult': False,
  'backdrop_path': '/aO5ILS7qnqtFIprbJ40zla0jhpu.jpg',
  'original_language': 'en',
  'original_title': 'Welcome to Sudden Death',
  'genre_ids': [28, 12, 18, 53],
  'title': 'We

## Part III: Accessing the Data

We will now access the data that we will use for the remainder of our project. During our exploration phase, we learned that we can pull one page at a time, which will provide us with 20 results. We want, however, to be able to access **10,000 rows** of data across 500 pages. Our first step in this phase will be to create a loop that will append all 500 pages of results to a list. 

In [14]:
# Pull the data of all 10,000 rows of data across all 500 pages. Append this to a lit called "finallist" which will contain the data of movies, sorted by the most popular first.

finallist=[]
n=0
while n<500:
    n+=1
    url='http://api.themoviedb.org/3/discover/movie?&sort_by=popularity.desc&offset=20&page={}&api_key='.format(n)
    req=requests.get(url+api_key).json()
    results=req['results']
    finallist.extend(results)
    time.sleep(2)

In [15]:
# Find the length of this new list to confirm our 10,000 rows were retrieved 
len(finallist)

10000

We see here that this worked! The length of our final list is 10,000. We will now turn this into a DataFrame and explore what variables are available to us.

In [18]:
# Put this new data into a dataframe and explore the information available
df = pd.DataFrame(finallist)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
popularity           10000 non-null float64
vote_count           10000 non-null int64
video                10000 non-null bool
poster_path          9861 non-null object
id                   10000 non-null int64
adult                10000 non-null bool
backdrop_path        9432 non-null object
original_language    10000 non-null object
original_title       10000 non-null object
genre_ids            10000 non-null object
title                10000 non-null object
vote_average         10000 non-null float64
overview             10000 non-null object
release_date         9993 non-null object
dtypes: bool(2), float64(2), int64(2), object(8)
memory usage: 957.2+ KB


While this is a comprehensive amount of variables, we are still missing an important feature: movie revenue. We will now use another for loop in order to create a dictionary of key-value pairs, where the key is the movie ID and the value is that movie's revenue.

In [19]:
# note that the "revenue" column is not in the above dataframe. 
# Retrieve the revenue data for the items in this list 
# Note that the 'id's 627494 and 726682 are excluded. This is because these URLs contain corrupt data 
# The corrupt data caused an error in our for loop; when excluded, we are able to retrieve revenue 
# data for 9,998 films

movie_dict_final = {}

for item in finallist:
    if item['id'] != 627494 and item['id'] != 726682:
        tmp = tmdb.Movies(item['id']).info()
        movie_dict_final[tmp['id']] = tmp['revenue']
    

In [20]:
print(movie_dict_final)

{528085: 0, 741067: 0, 497582: 0, 337401: 57000000, 724989: 0, 721656: 0, 694919: 0, 660982: 0, 718444: 139757, 697064: 0, 734309: 0, 539885: 152812, 581392: 35878266, 592350: 29900850, 438396: 0, 475430: 0, 635302: 0, 606234: 0, 677638: 0, 721452: 0, 605116: 0, 621870: 0, 474350: 473093228, 594328: 0, 726739: 0, 495764: 201858461, 722603: 0, 617505: 0, 590223: 0, 532067: 0, 746957: 0, 475557: 1074251311, 613504: 0, 38700: 419074646, 703134: 0, 385103: 9430580, 547016: 0, 446893: 1946164, 425001: 8982106, 454626: 306766470, 531499: 0, 611395: 0, 603119: 0, 632618: 0, 601165: 0, 572154: 0, 667141: 0, 516486: 0, 715658: 0, 508439: 103181419, 493065: 0, 601844: 0, 338762: 30234182, 664767: 0, 520763: 0, 330457: 1450026933, 713825: 0, 489326: 106270, 640882: 0, 611605: 0, 704630: 0, 440249: 0, 347201: 0, 744676: 0, 631132: 0, 735110: 0, 643550: 0, 604578: 0, 738215: 0, 716258: 0, 354912: 800526015, 709621: 0, 655431: 0, 671145: 0, 568160: 186965409, 619592: 215668, 299536: 2046239637, 6802

In [21]:
# Find the length of the dictionary that contains the key:value pairs for movie id and revenue

len(movie_dict_final)

9998

Great! We now have a dictionary that includes 9,998 rows of data for revenue. We will now turn this information into a DataFrame as well.

In [22]:
# Convert this dictionary to a DataFrame and explore the first 10 rows 
revenue = pd.DataFrame.from_dict(movie_dict_final,orient='index', columns=['revenue'])
revenue.head(10)

Unnamed: 0,revenue
528085,0
741067,0
497582,0
337401,57000000
724989,0
721656,0
694919,0
660982,0
718444,139757
697064,0


In [23]:
# Find information on this dataframe
revenue.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9998 entries, 528085 to 20309
Data columns (total 1 columns):
revenue    9998 non-null int64
dtypes: int64(1)
memory usage: 156.2 KB



## Part IV: Joining Tables and Cleaning DataFrames

Our final step in the data cleaning process will be to join these two newly created DataFrames together, in order to have one clear DataFrames that includes all of the information that we would like to consider using for our analysis. Lastly, we will remove unnecessary columns from our DataFrames in order to provide a final clean table. Then, we will save these DataFrames to a new csv file so that they can be accessed in our next notebook, which is where we will conduct our exploratory analysis and create data visualizations.

In [None]:
# Join the two tables together on movie id
total_df = df.merge(revenue, left_on="id", right_index=True)

In [None]:
# View the first two rows of the DataFrame to see that revenue was joined correctly
total_df.head(2)

In [None]:
# Drop poster path, backdrop path and overview as these columns contain text we will not use. 
# Drop original_title since we will use the "title" column as a reference instead.

total_df = total_df.drop(['poster_path', 'backdrop_path', 'overview', 'original_title'], axis=1)

In [None]:
total_df.head()

In [None]:
# Save the new DataFrame to csv file 
total_df.to_csv('Moviesdata_revenue.csv')


## Part V: Data Visualization and Analysis

In [None]:
new_df = total_df[total_df['revenue'] != 0]