# An Examination of Trends in Film



## Introduction

# Data Retrieval

## Libraries and Data Retrieval 

### Library Importations

In [329]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tmdbsimple as tmdb
import requests
from tqdm.notebook import tqdm
import pprint
import time
import csv
tmdb.API_KEY = '26fb51ea2574691bcb3a7e5d0ce700a2'

pp = pprint.PrettyPrinter(indent=2)

### Datasets and Retrieval

In [330]:
discover = tmdb.Discover()

id_list = []

for pages in tqdm(range(1,501), desc='ID Append'):
    page = discover.movie(language='en-US', sort_by='revenue.desc', page=pages, 
                          certification_country='US', certification_lte='NC-17', 
                          certification_gte='G', primary_release_date_gte='2010-01-01', 
                          with_original_language='en')
    for film in page['results']:
        id_list.append(film['id'])
    time.sleep(.16)

HBox(children=(FloatProgress(value=0.0, description='ID Append', max=500.0, style=ProgressStyle(description_wi…




In [331]:
def movie_info(list_id):
    movie = tmdb.Movies(list_id).info()
    ratings = tmdb.Movies(list_id).release_dates()

    iso = 'iso_3166_1'
    for result in ratings['results']:
        if result[iso] == 'US':
            if result['release_dates'][0]['certification'] == '':
                rating = result['release_dates'][1]['certification']
            else: 
                rating = result['release_dates'][0]['certification']
    
    time.sleep(.05)
    movie_dict = {
        'title': movie['original_title'],
        'release_date': movie['release_date'],
        'rating': rating,
        'imdb_id': movie['imdb_id'],
        'tmbd_id': movie['id'],
        'runtime': movie['runtime'],
        'genres': [[genre.get('name') for genre in movie['genres']]],
        'budget': movie['budget'],
        'revenue': movie['revenue']
    }
    
    return movie_dict

In [332]:
film_dict = []

for movie in tqdm(id_list):
    try:
        film_dict.append(movie_info(movie))
    except:
        film_dict.append('Missing')
        


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




In [333]:
len(film_dict)

1000

In [343]:
miss = 'Missing'
for count, film in enumerate(film_dict):
    if film == miss:
        film_dict.pop(count)
    if film_dict[count]['rating'] == '':
        film_dict.pop(count)

In [344]:
len(film_dict)

490

In [345]:
film = pd.DataFrame.from_records(film_dict)

In [337]:
film['release_date'] = pd.to_datetime(film['release_date'])

In [338]:
print(film['rating'].value_counts())
display(film)

PG-13    446
R        334
PG       185
G         17
           1
Name: rating, dtype: int64


Unnamed: 0,title,release_date,rating,imdb_id,tmbd_id,runtime,genres,budget,revenue
0,Avengers: Endgame,2019-04-24,PG-13,tt4154796,299534,181,"[[Adventure, Science Fiction, Action]]",356000000,2797800564
1,Star Wars: The Force Awakens,2015-12-15,PG-13,tt2488496,140607,136,"[[Action, Adventure, Science Fiction, Fantasy]]",245000000,2068223624
2,Avengers: Infinity War,2018-04-25,PG-13,tt4154756,299536,149,"[[Adventure, Action, Science Fiction]]",300000000,2046239637
3,Jurassic World,2015-06-06,PG-13,tt0369610,135397,124,"[[Action, Adventure, Science Fiction, Thriller]]",150000000,1671713208
4,The Lion King,2019-07-12,PG,tt6105098,420818,118,"[[Adventure, Family, Music]]",260000000,1656943394
...,...,...,...,...,...,...,...,...,...
978,Diary of a Wimpy Kid: The Long Haul,2017-05-19,PG,tt6003368,417830,91,"[[Comedy, Family]]",22000000,40120144
979,"As Above, So Below",2014-08-14,R,tt2870612,256274,93,"[[Horror, Thriller]]",5000000,40100000
980,Monte Carlo,2011-07-01,PG,tt1067774,59860,109,"[[Adventure, Comedy, Romance]]",20000000,39667665
981,Sin City: A Dame to Kill For,2014-08-20,R,tt0458481,189,102,"[[Crime, Action, Thriller]]",65000000,39407616


# Data Analysis

## Exploratory Data Analysis

The data has been stored into the `film` dataframe and the `release_date` column as been converted into a `datetime object`. Taking a look at the `data types` and `Non-Null Count` of the dataframe gives a good idea about what the first steps to further clean the data will be. 

First we see that after dropping the missing values, before being converted into a dataframe, that we are left with a total of 9,955 entries of the 10,000 scraped.

Next we see that in the `release_date` and `runtime` columns we are facing a handful of `Non-Null` entries. 

In [207]:
film.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 483 entries, 0 to 482
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         483 non-null    object        
 1   release_date  483 non-null    datetime64[ns]
 2   rating        483 non-null    object        
 3   runtime       483 non-null    int64         
 4   genres        483 non-null    object        
 5   budget        483 non-null    int64         
 6   revenue       483 non-null    int64         
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 26.5+ KB


Finding the `Null` count within the two columns missing values as well as the percentage of `Null` values. The dataset is rather large and therefore having a relatively few `null` values will open options for further refining. 

In [63]:
print('release_date Null count and percentage null:', film['release_date'].isna().sum(), 
      round(film['release_date'].isna().sum()/ len(film)*100, 3))
print('runtime Null count and percentage null:', film['runtime'].isna().sum(), 
      round(film['runtime'].isna().sum()/ len(film)*100, 3))

release_date Null count and percentage null: 7 0.07
runtime Null count and percentage null: 63 0.633


With consistency of data taken into consideration we can refer to the percentage counts of `Null` values within the columns observed to make decision on what actions to take to rectify the values. 

Considering the relatively low count of `null` values when compared to the dataset as a whole it is reasonable to consider dropping the `null` values as a valid strategy to deal with them. 

In [67]:
film.dropna(inplace=True)
len(film)

9885

In [73]:
print(round((len(film)/9955)*100, 2),'% of dataset remaining')

99.3 % of dataset remaining


In [74]:
film.head()

Unnamed: 0,title,release_date,runtime,genres,budget,revenue
0,Avengers: Endgame,2019-04-24,181.0,"[[Adventure, Science Fiction, Action]]",356000000,2797800564
1,Star Wars: The Force Awakens,2015-12-15,136.0,"[[Action, Adventure, Science Fiction, Fantasy]]",245000000,2068223624
2,Avengers: Infinity War,2018-04-25,149.0,"[[Adventure, Action, Science Fiction]]",300000000,2046239637
3,Jurassic World,2015-06-06,124.0,"[[Action, Adventure, Science Fiction, Thriller]]",150000000,1671713208
4,The Lion King,2019-07-12,118.0,"[[Adventure, Family, Music]]",260000000,1656943394


Now the `null` values within the dataframe has been dropped and we are left with over 99% of our data left. The next task in order to get the data into a usable form is to convert the `runtime` column from a `float` type into an `int` type. 

In [75]:
film.astype({'runtime': 'int64'})

Unnamed: 0,title,release_date,runtime,genres,budget,revenue
0,Avengers: Endgame,2019-04-24,181,"[[Adventure, Science Fiction, Action]]",356000000,2797800564
1,Star Wars: The Force Awakens,2015-12-15,136,"[[Action, Adventure, Science Fiction, Fantasy]]",245000000,2068223624
2,Avengers: Infinity War,2018-04-25,149,"[[Adventure, Action, Science Fiction]]",300000000,2046239637
3,Jurassic World,2015-06-06,124,"[[Action, Adventure, Science Fiction, Thriller]]",150000000,1671713208
4,The Lion King,2019-07-12,118,"[[Adventure, Family, Music]]",260000000,1656943394
...,...,...,...,...,...,...
9950,Untitled George Carlin Documentary,2022-12-31,0,[[Documentary]],0,0
9951,Instrucciones Para Recuperar la Memoria,2016-06-01,9,"[[Drama, Documentary]]",0,0
9952,Pretend That You Love Me,2020-06-12,94,[[]],50,0
9953,Iraq: A Veritable Imposture,2017-12-17,0,[[]],0,0


To further our goal of cleaning for an accurate representation of our data it is imperative to find further discrepancies and outliers within our data that will invalidate further calculations that we wish to perform on the dataset.

In [77]:
film.describe()

Unnamed: 0,runtime,budget,revenue
count,9885.0,9885.0,9885.0
mean,64.909762,8765671.0,29240870.0
std,91.067874,28972650.0,121320100.0
min,0.0,0.0,0.0
25%,6.0,0.0,0.0
50%,81.0,0.0,0.0
75%,105.0,700000.0,3324330.0
max,7200.0,380000000.0,2797801000.0
