# Milestone 4

Author: Berkeley Willis  
Class: Data Preparation 540-T303  
Date: 2020-05-07  


### Setup to pull data from the API

First we are going to set some basic paramters set to give a starting point, and get any required libraries.

In [1]:
# Libraries to get first
import pandas as pd
import numpy as np
import re
import datetime
import json
import requests

API_KEY="3a91d615aae2025a8c553ec5b95ea810"


### Helper functions

Make a quick helper functions that will facilitate the downloading and parsing of each records and bind them all together.

In [2]:
# Function that will parse a response
def parseResponse(p_json):
    # Set up all the lists that will create the data frame
    popularity_l = []
    ids = []
    video_l = []
    vote_counts = []
    vote_avgs = []
    titles = []
    release_dates = []
    original_langauges = []
    adult_indicators = []
    overviews = []
    
    # For each json object, get the information that is relevant
    for item in p_json['results']:
        popularity_l.append(item['popularity'])
        ids.append(item['id'])
        video_l.append(item['video'])
        vote_counts.append(item['vote_count'])
        vote_avgs.append(item['vote_average'])
        titles.append(item['title'])
        release_dates.append(item['release_date'])
        original_langauges.append(item['original_language'])
        adult_indicators.append(item['adult'])
        overviews.append(item['overview'])
    
    # Then combine them into a single data frame
    tmp_dict = {
        "id":ids,"popularity":popularity_l,"video":video_l,"votes":vote_counts,"vote_average":vote_avgs,
        "title":titles,"release_date":release_dates,"original_langauge":original_langauges,"adult":adult_indicators,
        "overview":overviews
    }
    return pd.DataFrame(tmp_dict)

# Helper to pull a some number of records, but default will be 1000
def downloadMovieData(api_key, num_records=1000):
    # Setup some basic tracking variables 
    current_page = 1
    record_count = 0
    # Base URI that will be used
    API_ENDPOINT = "https://api.themoviedb.org/3/movie/top_rated"
    # Lastly a list to store the resulting dataframes that will be combined later
    pd_list = []
    # Going to repeat until we have 1000 or how ever many records
    while record_count < num_records:
        print("Downloading page {}".format(current_page))
        # Going to make the request and automatically dump its contents into a json object
        json_obj = json.loads(requests.get(API_ENDPOINT+"?api_key="+api_key+"&language=en-US&page="+str(current_page)).content)
        pd_list.append(parseResponse(json_obj))
        record_count += len(json_obj['results'])
        current_page += 1
    return pd.concat(pd_list, sort=False)

### Fun the download to pull intial data

For this I am going to stick to the default count and get the top 1000 movies from this different community source.

In [3]:
moviedb_df = downloadMovieData(API_KEY)

# Let's look at the data bit and see how we want to operate on it
display(moviedb_df.head())

Downloading page 1
Downloading page 2
Downloading page 3
Downloading page 4
Downloading page 5
Downloading page 6
Downloading page 7
Downloading page 8
Downloading page 9
Downloading page 10
Downloading page 11
Downloading page 12
Downloading page 13
Downloading page 14
Downloading page 15
Downloading page 16
Downloading page 17
Downloading page 18
Downloading page 19
Downloading page 20
Downloading page 21
Downloading page 22
Downloading page 23
Downloading page 24
Downloading page 25
Downloading page 26
Downloading page 27
Downloading page 28
Downloading page 29
Downloading page 30
Downloading page 31
Downloading page 32
Downloading page 33
Downloading page 34
Downloading page 35
Downloading page 36
Downloading page 37
Downloading page 38
Downloading page 39
Downloading page 40
Downloading page 41
Downloading page 42
Downloading page 43
Downloading page 44
Downloading page 45
Downloading page 46
Downloading page 47
Downloading page 48
Downloading page 49
Downloading page 50


Unnamed: 0,id,popularity,video,votes,vote_average,title,release_date,original_langauge,adult,overview
0,618344,132.19,False,303,8.8,Justice League Dark: Apokolips War,2020-05-05,en,False,Earth is decimated after intergalactic tyrant ...
1,640344,14.401,False,260,8.8,Me contro Te: Il film - La vendetta del Signor S,2020-01-17,it,False,Luì and Sofì fight the terrible Signor S once ...
2,19404,21.785,False,2277,8.8,Dilwale Dulhania Le Jayenge,1995-10-20,hi,False,"Raj is a rich, carefree, happy-go-lucky second..."
3,278,39.008,False,16106,8.7,The Shawshank Redemption,1994-09-23,en,False,Framed in the 1940s for the double murder of h...
4,537061,15.44,False,272,8.7,Steven Universe: The Movie,2019-09-02,en,False,"Two years after the events of ""Change Your Min..."


#### 1. Add a few extra and required fields.

We are again going to need a clean version fo the title that we will be able to merge on, so gotta use the same function on the title field, and need to add ranking numbers.

In [4]:
# Attach the ranking
moviedb_df['ranking'] = range(1,moviedb_df.shape[0]+1)

# Create clean version of the title once more
def sanitize_str(dirty_string):
    # Taking the titles, removing special characters, trimming whitespace, and subbing out for '_' chars
    return(re.sub(' ', '_', re.sub(' +', ' ', re.sub('[^A-Za-z0-9\s]+', '', dirty_string).strip().lower())))

# Run full sanitization on the tiel field, and run strip o nthe rest of the fields
moviedb_df['clean_title'] = moviedb_df.title.apply(sanitize_str)

# Take a quick look at what we've got now
display(moviedb_df.head())

Unnamed: 0,id,popularity,video,votes,vote_average,title,release_date,original_langauge,adult,overview,ranking,clean_title
0,618344,132.19,False,303,8.8,Justice League Dark: Apokolips War,2020-05-05,en,False,Earth is decimated after intergalactic tyrant ...,1,justice_league_dark_apokolips_war
1,640344,14.401,False,260,8.8,Me contro Te: Il film - La vendetta del Signor S,2020-01-17,it,False,Luì and Sofì fight the terrible Signor S once ...,2,me_contro_te_il_film_la_vendetta_del_signor_s
2,19404,21.785,False,2277,8.8,Dilwale Dulhania Le Jayenge,1995-10-20,hi,False,"Raj is a rich, carefree, happy-go-lucky second...",3,dilwale_dulhania_le_jayenge
3,278,39.008,False,16106,8.7,The Shawshank Redemption,1994-09-23,en,False,Framed in the 1940s for the double murder of h...,4,the_shawshank_redemption
4,537061,15.44,False,272,8.7,Steven Universe: The Movie,2019-09-02,en,False,"Two years after the events of ""Change Your Min...",5,steven_universe_the_movie


#### 2. Confirm that there are no duplicates

The ID field in the movie DB data is the primary identifier so that we con confirm that we don't have duplicates based of their data.

In [5]:
print("Currently have {} records before de-dup".format(moviedb_df.shape[0]))
moviedb_df = moviedb_df.drop_duplicates(subset=["id"])
print("Left with {} records after de-dup".format(moviedb_df.shape[0]))

# So we still have all our records but we can confirm that there are no duplicates

Currently have 1000 records before de-dup
Left with 1000 records after de-dup


#### 3. Re-ordering and adding new field

Something that I am noticing here is that is that the average vote isn't exactly the best indicator since there are some that have high average votes but a low number of botes and low popularity.

In [6]:
# Going to assign ranking by popularity and then by number of votes
moviedb_df = moviedb_df.sort_values(by=['popularity'],ascending=False)
moviedb_df['popularity_ranking'] = range(1,moviedb_df.shape[0]+1)

# By vote counts
moviedb_df = moviedb_df.sort_values(by=['votes'],ascending=False)
moviedb_df['vote_count_ranking'] = range(1,moviedb_df.shape[0]+1)

# re-sort and order by ranking given by the db
moviedb_df = moviedb_df.sort_values(by=['ranking'],ascending=False)

display(moviedb_df.head())

Unnamed: 0,id,popularity,video,votes,vote_average,title,release_date,original_langauge,adult,overview,ranking,clean_title,popularity_ranking,vote_count_ranking
19,5967,10.439,False,323,7.5,The Umbrellas of Cherbourg,1964-02-19,fr,False,This simple romantic tragedy begins in 1957. G...,1000,the_umbrellas_of_cherbourg,680,808
18,11471,10.912,False,212,7.5,A Better Tomorrow,1986-08-02,zh,False,A reforming ex-gangster tries to reconcile wit...,999,a_better_tomorrow,631,976
17,522,11.399,False,1378,7.5,Ed Wood,1994-09-27,en,False,"The mostly true story of the legendary ""worst ...",998,ed_wood,598,404
16,1052,10.51,False,647,7.5,Blow-Up,1966-12-18,en,False,A successful mod photographer in London whose ...,997,blowup,671,597
15,3035,20.187,False,775,7.5,Frankenstein,1931-11-21,en,False,Dr Henry Frankenstein is obsessed with assembl...,996,frankenstein,226,552


#### 4. Convert values

Some of the objects aren't typed or interpreted correctly, the main problem is primarily with the date fields and the rest of them are fine

In [7]:
# Dtypes before conversion
print("Data types before conversion")
print(moviedb_df.dtypes)
print("")

# Date field need to converted to a a data type where it being treated as such
moviedb_df['release_date'] = pd.to_datetime(moviedb_df.release_date)

# Should be the same but now pandas is treating it like a number/datetime
print("Data types after conversion")
print(moviedb_df.dtypes)
display(moviedb_df.head())

Data types before conversion
id                      int64
popularity            float64
video                    bool
votes                   int64
vote_average          float64
title                  object
release_date           object
original_langauge      object
adult                    bool
overview               object
ranking                 int64
clean_title            object
popularity_ranking      int64
vote_count_ranking      int64
dtype: object

Data types after conversion
id                             int64
popularity                   float64
video                           bool
votes                          int64
vote_average                 float64
title                         object
release_date          datetime64[ns]
original_langauge             object
adult                           bool
overview                      object
ranking                        int64
clean_title                   object
popularity_ranking             int64
vote_count_ranking         

Unnamed: 0,id,popularity,video,votes,vote_average,title,release_date,original_langauge,adult,overview,ranking,clean_title,popularity_ranking,vote_count_ranking
19,5967,10.439,False,323,7.5,The Umbrellas of Cherbourg,1964-02-19,fr,False,This simple romantic tragedy begins in 1957. G...,1000,the_umbrellas_of_cherbourg,680,808
18,11471,10.912,False,212,7.5,A Better Tomorrow,1986-08-02,zh,False,A reforming ex-gangster tries to reconcile wit...,999,a_better_tomorrow,631,976
17,522,11.399,False,1378,7.5,Ed Wood,1994-09-27,en,False,"The mostly true story of the legendary ""worst ...",998,ed_wood,598,404
16,1052,10.51,False,647,7.5,Blow-Up,1966-12-18,en,False,A successful mod photographer in London whose ...,997,blowup,671,597
15,3035,20.187,False,775,7.5,Frankenstein,1931-11-21,en,False,Dr Henry Frankenstein is obsessed with assembl...,996,frankenstein,226,552


#### 5. Limiting columns and renaming columns

Again we have a small number of fields that we need to rename the fields and would remove some if they were excess but we don't have that issue this time. 

In [8]:
# Now finalize the structure and the naming of everything

final_moviesdb_df = moviedb_df[["id","ranking","popularity_ranking","vote_count_ranking","popularity","votes","vote_average",
                                      "title","clean_title","release_date","original_langauge","adult","overview"]]

final_moviesdb_df.columns = ["moviedb_id","moviedb_ranking","moviedb_pop_ranking","movie_dbvote_count_ranking","moviedb_popularity","moviedb_votes","moviedb_vote_average",
                                      "moviedb_title","clean_title","moviedb_release_date","original_langauge","adult","moviedb_overview"]

display(final_moviesdb_df.head())

Unnamed: 0,moviedb_id,moviedb_ranking,moviedb_pop_ranking,movie_dbvote_count_ranking,moviedb_popularity,moviedb_votes,moviedb_vote_average,moviedb_title,clean_title,moviedb_release_date,original_langauge,adult,moviedb_overview
19,5967,1000,680,808,10.439,323,7.5,The Umbrellas of Cherbourg,the_umbrellas_of_cherbourg,1964-02-19,fr,False,This simple romantic tragedy begins in 1957. G...
18,11471,999,631,976,10.912,212,7.5,A Better Tomorrow,a_better_tomorrow,1986-08-02,zh,False,A reforming ex-gangster tries to reconcile wit...
17,522,998,598,404,11.399,1378,7.5,Ed Wood,ed_wood,1994-09-27,en,False,"The mostly true story of the legendary ""worst ..."
16,1052,997,671,597,10.51,647,7.5,Blow-Up,blowup,1966-12-18,en,False,A successful mod photographer in London whose ...
15,3035,996,226,552,20.187,775,7.5,Frankenstein,frankenstein,1931-11-21,en,False,Dr Henry Frankenstein is obsessed with assembl...
