## AC209b / CS109b Final Project - Milestone 2 Part 01
Yujiao Chen, Brian Ho, Jonathan Jay // 04/12/2017

_We are aware that you have little time this week, due to the midterm. So this milestone is a bit easier to achieve than the others. The goal for this week is to prepare the data for the modeling phase of the project. You should end up with a typical data setup of training data X and data labels Y._

_The exact form of X and Y depends on the ideas you had previously. In general though Y should involve the genre of a movie, and X the features you want to include to predict the genre. Remember from the lecture that more features does not necessarily equal better prediction performance. Use your application knowledge and the insight you gathered from your genre pair analysis and additional EDA to design Y. Do you want to include all genres? Are there genres that you assume to be easier to separate than others? Are there genres that could be grouped together? There is no one right answer here. We are looking for your insight, so be sure to describe your decision process in your notebook._

_In preparation for the deep learning part we strongly encourage you to have two sets of training data X, one with the metadata and one with the movie posters. Make sure to have a common key, like the movie ID, to be able to link the two sets together. Also be mindful of the data rate when you obtain the posters. Time your requests and choose which poster resolution you need. In most cases w500 should be sufficient, and probably a lower resolution will be fine_

The notebook to submit this week should at least include:

- Discussion about the imbalanced nature of the data and how you want to address it
- Description of your data
- What does your choice of Y look like?
- Which features do you choose for X and why? 
- How do you sample your data, how many samples, and why?

*Important*: You do not need to upload the data itself to Canvas.

---

## Proposal

### 1. Primary research question: genre prediction
 - How well do poster images predict three genres (romance, horror and sci-fi)?
 - How well do the text of movie titles and descriptions predict these genres?

**Data setup (Y):**
In Milestone 1 we found that the movie databases contain around 20 genre classes, some of which are much more common than others; that many movies are assigned multiple genre classes; and that a movie's genre assignments can vary across databases. For this project, however, we will sample only movies assigned to one of these three classes (excluding movies assigned to more than one of the three, but including movies which have also been assigned additional genres). Each genre will constitute exactly 1/3 of our sample. These movies are selected based on popularity, based on our assumption that more-popular movies are more representative of their genres, and balanced by year from 1954-present, for reasons of interest in part 2 below. Although the full dataset is over 8000 movies, we will select only the top 2000 movies by popularity, balanced across years, and with 20% reserved for testing. The subsampled size is based on (i) the maximum we believe is computationally feasible for question 1a, and (ii) our a priori belief that less-popular movies from within each genre may be less "pure" representations of the genre. The 80-20 split between training and testing data reflects accepted practice within the discpline. We will account for genre in assigning movies to the testing set, yielding perfectly balanced training and testing sets that will make it easier to detect trends in classifier performance.

**Comments:**
This sampling method eliminates the imbalance problem in the broader database, while allowing us to answer the focused research question of how well CNNs can learn to distinguish among these genres. The particular genres were chosen using the correlation matrix we produced in Milestone 1, finding little overlap among these genres, and based on their sociological relevance. Choosing three classes allows us to consider relative distances among the classes and compare predictor performance in distinguishing among them--e.g., our hypothesis is that among these classes, sci-fi and horror are comparatively closer to each other than to romance, and will therefore be slightly harder to predict accurately.

**Data setup (X):**
We are most interested in movie posters and movie titles & descriptions as predictors. We propose to optimize predictions using primarily (or exclusively) these predictors, rather than attempting to optimize predictions using whatever additional predictor data we might be able to access. Our initial thinking is to run models using these predictors separately (i.e. (a) poster vs. genre and (b) description vs. genre).  

**Comments:**
We prefer this approach because it will allow us to consider, in more depth, the relationship between each predictor and the genre classification. These features are all constructed with the intention of conveying information about the movie to prospective viewers (as opposed to, for example, language or director). We expect these features have a true relationship with genre, allowing the possibility of good classification accuracy — our submission this week includes an exploratory modeling exercise to predict horror vs. romance using PCA/SVM with test set accuracy of 87%, demonstrating the feasibility of the general approach. We also believe that the nature of this relationship represents an interesting research question: i.e. how effectively do title/descriptions and posters convey genre, and how well can algorithms learn to detect this relationship? Thus, while including additional features might (or might not) improve classification accuracy, they are not as relevant to the research questions that interest us most. 



### 2. Secondary research question: poster age identification
 - Can CNNs predict a movie's release decade based on its poster?

**Brief discussion:**
Time permitting, we would like to set up a distinct classification task in which we sample from one genre (most likely sci-fi) and train a CNN to predict decade (e.g. 1960s/70s/80s/90s/00s/10s). We think this task may be of greatest substantive interest for science fiction movies, where we believe posters are representative of the era's visualizations of alternative realities. Have these changed over times in ways that a CNN can learn to identify? A priori we think color combinations may be especially predictive of decade. 

## Data Collection
Getting TMDB metadata — poster collection is occuring separately, and for reasons of size not uploaded to Canvas.

In [1]:
## Some code to get data
## Let's import some libraries!
import imdb
import json
import requests
import pandas as pd
import numpy as np
import time
import matplotlib
%matplotlib inline

In [2]:
## Get the genre codes from IMDB
payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0'}
r = requests.get('https://api.themoviedb.org/3/genre/movie/list', params=payload)

genres = pd.DataFrame.from_dict(r.json()["genres"])
genres = genres.set_index("id")
print genres

                  name
id                    
28              Action
12           Adventure
16           Animation
35              Comedy
80               Crime
99         Documentary
18               Drama
10751           Family
14             Fantasy
36             History
27              Horror
10402            Music
9648           Mystery
10749          Romance
878    Science Fiction
10770         TV Movie
53            Thriller
10752              War
37             Western


In [3]:
genres = genres["name"].to_dict()
genres

{12: u'Adventure',
 14: u'Fantasy',
 16: u'Animation',
 18: u'Drama',
 27: u'Horror',
 28: u'Action',
 35: u'Comedy',
 36: u'History',
 37: u'Western',
 53: u'Thriller',
 80: u'Crime',
 99: u'Documentary',
 878: u'Science Fiction',
 9648: u'Mystery',
 10402: u'Music',
 10749: u'Romance',
 10751: u'Family',
 10752: u'War',
 10770: u'TV Movie'}

In [4]:
### Queries to TMDB
    
# intial API parameters
def get_movies(years, page_limit, genre):
    # Outer loop for ever year in range
    for i, year in enumerate(years):
        start = time.time()
        
        # Define initial API parameters for genre and year 
        payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0',
                   'sort_by':'popularity.desc',
                   'primary_release_year': year,
                   'page': 1,
                   'language':'en-US',
                   'with_genres': genre} #"878|27|10749"

        r = requests.get('https://api.themoviedb.org/3/discover/movie?', params=payload)
        print 'For ', year, ' there are ', r.json()['total_results'], ' total results across ', r.json()['total_pages'], ' total pages.'
        
        # For first year, create the data frame. Otherwise, add first page to it.
        if i == 0:
            tmdb_movies = pd.io.json.json_normalize(r.json()['results'])
        else:
            tmdb_movies = pd.concat([tmdb_movies, pd.io.json.json_normalize(r.json()['results'])])
        
        # Set max pages to smaller of five or total number
        if r.json()['total_pages'] < page_limit:
            page_max = r.json()['total_pages']
        else:
            page_max = page_limit
        
        # Wait function for polite API querying
        delay = time.time()-start
        if delay < 0.25:
            time.sleep(0.25-delay)
        
        if page_max > 1:
            # Inner loop for every page up to max, startigng with page 2.
            for page in range(2, page_max+1):
                start = time.time()

                payload = {'api_key': '9290a6fe9125b32e7bbe5512036be0d0',
                       'sort_by':'popularity.desc',
                       'primary_release_year': year,
                       'page': page,
                       'language':'en-US',
                       'with_genres': genre}#"878|27|10749"}
                
                r = requests.get('https://api.themoviedb.org/3/discover/movie?', params=payload)
                
                tmdb_movies = pd.concat([tmdb_movies, pd.io.json.json_normalize(r.json()['results'])])
                
                delay = time.time()-start
                if delay < 0.25:
                    time.sleep(0.25-delay)

    return tmdb_movies

In [5]:
# Get science fiction movies from 1930
movies_scifi = get_movies(range(1930,2017), 2, 878)
movies_scifi.shape

For  1930  there are  2  total results across  1  total pages.
For  1931  there are  4  total results across  1  total pages.
For  1932  there are  6  total results across  1  total pages.
For  1933  there are  7  total results across  1  total pages.
For  1934  there are  7  total results across  1  total pages.
For  1935  there are  10  total results across  1  total pages.
For  1936  there are  13  total results across  1  total pages.
For  1937  there are  7  total results across  1  total pages.
For  1938  there are  8  total results across  1  total pages.
For  1939  there are  9  total results across  1  total pages.
For  1940  there are  11  total results across  1  total pages.
For  1941  there are  6  total results across  1  total pages.
For  1942  there are  5  total results across  1  total pages.
For  1943  there are  7  total results across  1  total pages.
For  1944  there are  7  total results across  1  total pages.
For  1945  there are  5  total results across  1  to

(2613, 14)

In [6]:
# Get horror movies from 1930
movies_horror = get_movies(range(1930,2017), 2, 27)
movies_horror.shape

For  1930  there are  2  total results across  1  total pages.
For  1931  there are  10  total results across  1  total pages.
For  1932  there are  23  total results across  2  total pages.
For  1933  there are  19  total results across  1  total pages.
For  1934  there are  10  total results across  1  total pages.
For  1935  there are  16  total results across  1  total pages.
For  1936  there are  15  total results across  1  total pages.
For  1937  there are  4  total results across  1  total pages.
For  1938  there are  4  total results across  1  total pages.
For  1939  there are  15  total results across  1  total pages.
For  1940  there are  17  total results across  1  total pages.
For  1941  there are  12  total results across  1  total pages.
For  1942  there are  18  total results across  1  total pages.
For  1943  there are  16  total results across  1  total pages.
For  1944  there are  19  total results across  1  total pages.
For  1945  there are  17  total results acr

(2771, 14)

In [7]:
# Get romance movies from 1930
movies_romance = get_movies(range(1930,2017), 2, 10749)
movies_romance.shape

For  1930  there are  119  total results across  6  total pages.
For  1931  there are  111  total results across  6  total pages.
For  1932  there are  113  total results across  6  total pages.
For  1933  there are  125  total results across  7  total pages.
For  1934  there are  131  total results across  7  total pages.
For  1935  there are  146  total results across  8  total pages.
For  1936  there are  121  total results across  7  total pages.
For  1937  there are  160  total results across  8  total pages.
For  1938  there are  126  total results across  7  total pages.
For  1939  there are  104  total results across  6  total pages.
For  1940  there are  106  total results across  6  total pages.
For  1941  there are  129  total results across  7  total pages.
For  1942  there are  113  total results across  6  total pages.
For  1943  there are  87  total results across  5  total pages.
For  1944  there are  83  total results across  5  total pages.
For  1945  there are  66  t

(3480, 14)

In [8]:
# Join our individual genre data frames
movies = pd.concat([movies_scifi, movies_horror, movies_romance])
print movies.shape
movies.head()

# To balance classes, let's only use data from years where there are at least 20 films per genre
movies["release_date"] = pd.to_datetime(movies["release_date"])
movies = movies[movies["release_date"] >= "1954"]
movies.head()

(8864, 14)


Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/AvYDpn52RVu9bRfbtVka9mpyYOt.jpg,"[53, 27, 878]",1678,ja,ゴジラ,Japan is thrown into a panic after several shi...,2.604126,/qGTJmSFytUDBtbJ7jFSYdQF1lNM.jpg,1954-11-03,Godzilla,False,6.7,152
1,False,/q6U0bHOWjD0F1fuzhNoH2a2ARr9.jpg,"[27, 9648, 878]",11071,en,Them!,Nuclear tests in the desert result in the grow...,1.84608,/ndYEgAzS01AlrUr4yetiJYOVMz1.jpg,1954-06-16,Them!,False,6.9,92
2,False,/jUfk0M6RQEJngy89afkAqxlASeI.jpg,"[12, 18, 878]",173,en,"20,000 Leagues Under the Sea",A ship sent to investigate a wave of mysteriou...,1.661457,/zANct2xGMgj6qZbCBetYfeOaFP.jpg,1954-12-23,"20,000 Leagues Under the Sea",False,6.8,128
3,False,/qNYG1cUHjFzSnkag3UNsWPxUMx1.jpg,[878],24212,en,Devil Girl from Mars,"An uptight, leather-clad female alien, armed w...",1.264081,/8L8gcfjM6vnRnYOig6GDOxADU7A.jpg,1954-05-01,Devil Girl from Mars,False,3.5,6
4,False,/kPhQN685slqtnhnFA7DuGe1ty2f.jpg,"[53, 878]",63333,en,Gog,A mechanical brain is programmed to sabotage t...,1.214214,/beMHrXHnf2UC3AF05R1KggJlRGf.jpg,1954-06-04,Gog,False,5.8,10


In [10]:
# Export dataframe
movies.to_csv("movies_from_1954.csv", encoding = "utf-8")

## Word Cloud and PCA Prep

Here we accomplished several tasks: 1) Collecting horror movies and romance movies from top 10000 movie data base, analyzing the contents of the movie title and movie overview, creating the wordclouds that show the most common words in two different genres; 2) Creating a corpus from the above-mentioned movies, filter the most frequent words, using a long boolean vector to indicate each word's appearance in each movie's title or overview; 3) Conducting PCA and choosing first PCs that explain 80% of variance in the data, cleaning the data format and outputting to .csv files for further PCA and SVM study in R.

**See milestone02_yujiaochen_brianho_jonjay_part02.ipynb**

## Transition to R for PCA and SVM

See attached Rmd file — in R we accomplished several tasks: 1) From the dataframe of 527 Horror movies and 570 Romance movies prepared by python codes (please refer to Milestone_2_Part_1.pdf), we randomly chose 300 movies to be the test set, and the rest of them to be the training set; 2) Using PCA to extract first 150 PCs that explain 80% of the variance in data, and projecting the training data and testing data to get PC score in each sets; 3) Employing SVM with radial basis function to classify the horror and romance movies based on their PC scores. The parameter (gamma and cost) has been found through tuning; 4) The final predicting accuracy on test set using this model is 87%, which is a satisfactory result. 

**See milestone02_yujiaochen_brianho_jonjay_part03.rmd**