### Milestone 2: Assembling training data, due Wednesday, April 12, 2017

We are aware that you have little time this week, due to the midterm. So this milestone is a bit easier to achieve than the others. The goal for this week is to prepare the data for the modeling phase of the project. You should end up with a typical data setup of training data X and data labels Y.

The exact form of X and Y depends on the ideas you had previously. In general though Y should involve the genre of a movie, and X the features you want to include to predict the genre. Remember from the lecture that more features does not necessarily equal better prediction performance. Use your application knowledge and the insight you gathered from your genre pair analysis and additional EDA to design Y. Do you want to include all genres? Are there genres that you assume to be easier to separate than others? Are there genres that could be grouped together? There is no one right answer here. We are looking for your insight, so be sure to describe your decision process in your notebook. 

In preparation for the deep learning part we strongly encourage you to have two sets of training data X, one with the metadata and one with the movie posters. Make sure to have a common key, like the movie ID, to be able to link the two sets together. Also be mindful of the data rate when you obtain the posters. Time your requests and choose which poster resolution you need. In most cases w500 should be sufficient, and probably a lower resolution will be fine.

The notebook to submit this week should at least include:

- Discussion about the imbalanced nature of the data and how you want to address it
- Description of your data
- What does your choice of Y look like?
- Which features do you choose for X and why? 
- How do you sample your data, how many samples, and why?

*Important*: You do not need to upload the data itself to Canvas.

In [2]:
### Let's import some libraries!
import imdb
# import httplib
import json
import requests
import pandas as pd
import numpy as np
import time
import wget
import matplotlib
# import random
# from sklearn import cluster
# from sklearn import metrics
# from IPython.display import IFrame
# from sklearn.decomposition import PCA
%matplotlib inline

In [115]:
## Get the genre codes from IMDB
payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0'}
r = requests.get('https://api.themoviedb.org/3/genre/movie/list', params=payload)

genres = pd.DataFrame.from_dict(r.json()["genres"])
genres = genres.set_index("id")
print genres

                  name
id                    
28              Action
12           Adventure
16           Animation
35              Comedy
80               Crime
99         Documentary
18               Drama
10751           Family
14             Fantasy
36             History
27              Horror
10402            Music
9648           Mystery
10749          Romance
878    Science Fiction
10770         TV Movie
53            Thriller
10752              War
37             Western


In [116]:
genres = genres["name"].to_dict()
genres

{12: u'Adventure',
 14: u'Fantasy',
 16: u'Animation',
 18: u'Drama',
 27: u'Horror',
 28: u'Action',
 35: u'Comedy',
 36: u'History',
 37: u'Western',
 53: u'Thriller',
 80: u'Crime',
 99: u'Documentary',
 878: u'Science Fiction',
 9648: u'Mystery',
 10402: u'Music',
 10749: u'Romance',
 10751: u'Family',
 10752: u'War',
 10770: u'TV Movie'}

In [103]:
### Queries to TMDB
    
# intial API parameters
def get_movies(years, page_limit):
    # Outer loop for ever year in range
    for i, year in enumerate(years):
        start = time.time()
        
        # Define initial API parameters for genre and year 
        payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0',
                   'sort_by':'popularity.desc',
                   'primary_release_year': year,
                   'page': 1,
                   'language':'en-US',
                   'with_genres':"878|27|10749"}

        r = requests.get('https://api.themoviedb.org/3/discover/movie?', params=payload)
        print 'For ', year, ' there are ', r.json()['total_results'], ' total results across ', r.json()['total_pages'], ' total pages.'
        
        # For first year, create the data frame. Otherwise, add first page to it.
        if i == 0:
            tmdb_movies = pd.io.json.json_normalize(r.json()['results'])
        else:
            tmdb_movies = pd.concat([tmdb_movies, pd.io.json.json_normalize(r.json()['results'])])
        
        # Set max pages to smaller of five or total number
        if r.json()['total_pages'] < page_limit:
            page_max = r.json()['total_pages']
        else:
            page_max = page_limit
        
        # Wait function for polite API querying
        delay = time.time()-start
        if delay < 0.25:
            time.sleep(0.25-delay)
        
        if page_max > 1:
            # Inner loop for every page up to max, startigng with page 2.
            for page in range(2, page_max+1):
                start = time.time()

                payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0',
                       'sort_by':'popularity.desc',
                       'primary_release_year': year,
                       'page': page,
                       'language':'en-US',
                       'with_genres':"878|27|10749"}
                
                r = requests.get('https://api.themoviedb.org/3/discover/movie?', params=payload)
                
                tmdb_movies = pd.concat([tmdb_movies, pd.io.json.json_normalize(r.json()['results'])])
                
                delay = time.time()-start
                if delay < 0.25:
                    time.sleep(0.25-delay)

    return tmdb_movies

In [104]:
movies = get_movies(range(1930,2017), 5)
movies.shape

For  1930  there are  122  total results across  7  total pages.
For  1931  there are  122  total results across  7  total pages.
For  1932  there are  139  total results across  7  total pages.
For  1933  there are  149  total results across  8  total pages.
For  1934  there are  146  total results across  8  total pages.
For  1935  there are  166  total results across  9  total pages.
For  1936  there are  141  total results across  8  total pages.
For  1937  there are  168  total results across  9  total pages.
For  1938  there are  138  total results across  7  total pages.
For  1939  there are  123  total results across  7  total pages.
For  1940  there are  126  total results across  7  total pages.
For  1941  there are  144  total results across  8  total pages.
For  1942  there are  132  total results across  7  total pages.
For  1943  there are  103  total results across  6  total pages.
For  1944  there are  104  total results across  6  total pages.
For  1945  there are  86 

(8662, 14)

In [126]:
movies.tail(20)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/2pqYl9JHfzBuZbnwf3m59s4tvDT.jpg,"[18, 9648, 10749]",316021,en,Frank & Lola,"A psychosexual noir love story, set in Las Veg...",2.0406,/9ZdlnjePpUhxRWz69KjqhIe7H5U.jpg,2016-12-09,Frank & Lola,False,6.1,25
1,False,/wZKnVnWOFYUy8U58btL5uG4U372.jpg,"[35, 10749]",366514,fr,Un homme à la hauteur,"Diane is a well-known lawyer, divorced for thr...",2.028859,/jJUKI7seMw2S5DKZSBfUPiWKzfU.jpg,2016-05-04,Up for Love,False,5.9,135
2,False,/6OmrokBxCpkjGmBbYu9WW5sRPha.jpg,"[28, 27]",410988,en,The Stakelander,When his home of New Eden is destroyed by a re...,2.022309,/fded6orgVzm4EFwZYcNfZJU3Ven.jpg,2016-10-14,Stake Land II: The Stakelander,False,5.3,14
3,False,/6nnK9s4z96lh5KnNOcLMjXo9je5.jpg,"[18, 10749]",332872,es,Julieta,The film spans 30 years in Julieta’s life from...,2.021858,/mwG0YDtmDtEw7371WLIBxmU0xXT.jpg,2016-04-08,Julieta,False,6.8,172
4,False,/jw15rAjCHoNJkovcLOu6xb7ecmO.jpg,"[35, 10749]",369299,es,¿Qué culpa tiene el niño?,"Maru, after becoming pregnant from a drunken o...",2.016717,/eWPD4RO7QQjiD8evdnQ0pUmGC33.jpg,2016-05-13,Why Blame It on the Child?,False,6.2,46
5,False,/gLMrWlscDZCEB72PZF4eqaRphEM.jpg,"[53, 12, 878, 9648]",401222,en,ISRA 88,When a scientist and a pilot volunteer for a h...,2.012973,/rZTJZkSnbEOiXLxdh2Wrn6mdCSJ.jpg,2016-08-08,ISRA 88,False,4.8,8
6,False,/ts2FEI39w6NBOhyqGxPJTidNLuz.jpg,"[35, 27, 53]",320413,en,The Greasy Strangler,"Ronnie runs a Disco walking tour with his son,...",2.003028,/6ZBHD1ZvYuP0uhYuOeTMN4Kgxvl.jpg,2016-10-07,The Greasy Strangler,False,5.2,29
7,False,/eafe4lNtWchWfCRXFnpcpKgLomk.jpg,"[53, 27]",364116,en,The Other Side of the Door,"Grieving over the loss of her son, a mother st...",2.000022,/hLpbfpkh5MBt86LokeyoReK0yCH.jpg,2016-02-25,The Other Side of the Door,False,5.1,202
8,False,/pT9vBaWZqrTC0ExZF9OCsa6p5zK.jpg,[27],421601,en,Bornless Ones,"With the help of her friends, Emily moves to a...",1.97414,/4V52bmZg4tR96hiuNX3r5h08GvN.jpg,2016-10-06,Bornless Ones,False,5.1,10
9,False,/iig2oRVfYZdIuY5ygT16pi2G5oz.jpg,"[18, 10749]",376166,fr,Mal de pierres,An adaptation of Milena Agus' eponymous novel ...,1.941451,/8XAykE6hBjPvg1hbZoPruRSQEW4.jpg,2016-10-19,From the Land of the Moon,False,6.6,36


In [137]:
movies.to_csv("movies_from_1930.csv", encoding = "utf-8")

In [106]:
movies["original_language"].value_counts()

en    6842
fr     355
ja     274
it     267
es     198
de     183
ru     122
cn      47
da      38
sv      37
hi      35
cs      34
zh      30
ko      26
pt      24
tr      23
el      17
nl      14
pl      11
hu       8
th       7
ta       7
sr       7
fi       6
tl       6
ab       6
ar       5
he       5
ml       4
no       3
sh       2
id       2
nb       2
ka       2
mr       1
bn       1
sl       1
te       1
sk       1
wo       1
ms       1
bs       1
ca       1
ur       1
is       1
vi       1
eo       1
Name: original_language, dtype: int64

In [131]:
def name_genre(genre_ids):
    names = []
    for genre_id in genre_ids:
        if genre_id not in genres:
            return
        else:
            names.append(genres[genre_id])
    return names

In [132]:
name_genre([27, 28])

[u'Horror', u'Action']

In [135]:
movies["genre_names"] = movies.apply(lambda x: name_genre(x["genre_ids"]),axis=1)

In [149]:
def includes_genre(target_genre, genre_names):
    if target_genre in genre_names:
        return 1
    else:
        return 0

In [151]:
movies["Science Fiction"] = movies.apply(lambda x: includes_genre("Science Fiction", x["genre_names"]),axis=1)

TypeError: ("argument of type 'NoneType' is not iterable", u'occurred at index 1')

In [140]:
genres

{12: u'Adventure',
 14: u'Fantasy',
 16: u'Animation',
 18: u'Drama',
 27: u'Horror',
 28: u'Action',
 35: u'Comedy',
 36: u'History',
 37: u'Western',
 53: u'Thriller',
 80: u'Crime',
 99: u'Documentary',
 878: u'Science Fiction',
 9648: u'Mystery',
 10402: u'Music',
 10749: u'Romance',
 10751: u'Family',
 10752: u'War',
 10770: u'TV Movie'}