# Tasks

The stakeholder's first question is: does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?

    - They want you to perform a statistical test to get a mathematically-supported answer.
    - They want you to report if you found a significant difference between ratings.
        - If so, what was the p-value of your analysis?
        - And which rating earns the most revenue?
    - They want you to prepare a visualization that supports your finding.
    
    
     - It is then up to you to think of 2 additional hypotheses to test that your stakeholder may want to know.

Some example hypotheses you could test:

- Do movies that are over 2.5 hours long earn more revenue than movies that are 1.5 hours long (or less)?
- Do movies released in 2020 earn less revenue than movies released in 2018?
    - How do the years compare for movie ratings?
- Do some movie genres earn more revenue than others?
- Are some genres higher rated than others?


In [1]:
import pandas as pd
import os, time,json
import tmdbsimple as tmdb 
from tqdm.notebook import tqdm_notebook
import json
with open('/Users/clove/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)

login.keys()

dict_keys(['client-id', 'api-key'])

In [2]:
tmdb.API_KEY =  login['api-key']

In [3]:
FOLDER = "Data/"
os.listdir(FOLDER)

['.ipynb_checkpoints', 'title_basics.csv.gz', 'tmdb_api_results.json']

In [4]:
basics = pd.read_csv('Data/title_basics.csv.gz', low_memory=False)
basics.head(3)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama


In [5]:
# fetch movie  api data from last 10 years
errors = [ ]
YEARS_TO_GET = [2010,2011,2012,2013,2014,2015,2016,2017,2018,2019]

#define functions
def get_movie_with_rating(movie_id):
    movie=tmdb.Movies(movie_id)
    info=movie.info()
    releases=movie.releases()
    for c in releases['countries']:
        if c['iso_3166_1' ]=='US':
            info['certification']=c['certification']
    return info

def write_json(new_data, filename):  
    with open(filename,'r+') as file:
        file_data = json.load(file)
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        file.seek(0)
        json.dump(file_data, file)

In [6]:
# confirm api call works
test_ids = ["tt0848228", "tt0332280",]
results = []
for movie_id in test_ids:
    movie_info = get_movie_with_rating(movie_id)
    results.append(movie_info)    
pd.DataFrame(results)

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,"{'id': 86311, 'name': 'The Avengers Collection...",220000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",https://www.marvel.com/movies/the-avengers,24428,tt0848228,en,The Avengers,...,1518815515,143,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Some assembly required.,The Avengers,False,7.71,29107,PG-13
1,False,/qom1SZSENdmHFNZBXbtJAU0WTlC.jpg,,29000000,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",http://www.newline.com/properties/notebookthe....,11036,tt0332280,en,The Notebook,...,115603229,123,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Behind every great love is a great story.,The Notebook,False,7.879,10561,PG-13


In [9]:
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):   
    JSON_FILE = f'{FOLDER}tmdb_api_results{YEAR}.json'
    file_exists = os.path.isfile(JSON_FILE)
    if file_exists == False:
        with open(JSON_FILE,'w') as f:  
            json.dump([{'imdb_id':0}],f) 

        df = basics.loc[ basics['startYear']==YEAR].copy()
        movie_ids = df['tconst'].copy()
        previous_df = pd.read_json(JSON_FILE)
        movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
        
    for movie_id in tqdm_notebook(movie_ids_to_get ,
                            desc=f'Movies from {YEAR}',
                            position=1,
                            leave=True):
        try:
            temp = get_movie_with_rating(movie_id)  
            write_json(temp,JSON_FILE)
            time.sleep(0.02)
        except Exception as e:
            errors.append([movie_id, e])

# save new df

    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}part_4_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)
print(f"- Total errors: {len(errors)}")

YEARS:   0%|          | 0/10 [00:00<?, ?it/s]

Movies from 2010:   0%|          | 0/3867 [00:00<?, ?it/s]

Movies from 2011:   0%|          | 0/4230 [00:00<?, ?it/s]

Movies from 2012:   0%|          | 0/4525 [00:00<?, ?it/s]

Movies from 2013:   0%|          | 0/4717 [00:00<?, ?it/s]

Movies from 2014:   0%|          | 0/4915 [00:00<?, ?it/s]

Movies from 2015:   0%|          | 0/5062 [00:00<?, ?it/s]

Movies from 2016:   0%|          | 0/5261 [00:00<?, ?it/s]

Movies from 2017:   0%|          | 0/5646 [00:00<?, ?it/s]

Movies from 2018:   0%|          | 0/5797 [00:00<?, ?it/s]

Movies from 2019:   0%|          | 0/5880 [00:00<?, ?it/s]

- Total errors: 11638


# API extraction complete