# Movie genre classification

## Questions
We need to address the following problems before we do movie genre classification:
1. TMDB and IMDB have different movie genre list and this can create issues for prediction. Which list of genre we use? What do we do if they disagree?
2. A movie can have more than 1 genre. The data from TMDB and IMDB will not indicate which one is the main genre if there is more than 1 genre. However, when we do movie genre prediction, we may only want our reponse to be 1 genre. What should we predict? A genre or a list of genre?
3. If we aim to tag a movie with multiple genres, what metrics do we use to evaluate our methods?  The accuracy is no longer a simple 0-1 evaluation.

## Additional data set
Amazon API provides genre tag and frequency weighting.  We can use it to find movie primary genre.

## Approach
Below are some of our thoughts on the questions in the first section.

### TMDB genre vs IMDB genre
Issues:
1. TMDB has a shorter genre list than IMDB list. Although we aim to pull information of the same set of movies, the genre lists extracted from IMDB and TMDB data are still different.
2. Some genre distinction are not based on plots. For example, "Foreign" genre can refer to film out of the country.
3. Some genre have too few movies. 
Possible solutions:
1. We can check if the genre classification are similiar for IMDB and TMDB by the following: look at percentage breakdown of each genre. If the results are significantly different amonng the two databases, we will have to explore what causes the difference. We can then merge the list or just to use IMDB genre list if the genre are not that different from TMDB.
2. We should perhaps remove some genres based on movie release date or country. 
3. For genre with little movies, we can merge them into another genre. Ex. If we do cluster analysis and find Noir is indeed close to Crime, we can group movies from Noir to Crime.

### What are our reponse variable?
Issues:
1. Movies have 3 genres on average from IMDB and TMDB.
Unforunately, neither database tags primary genre of a movie.
Possible Solutions:
1. We create clusteres of genres. Ex. Cluser 1 consists of Action,Drama,Romance; cluster 2 consists of Horror. A movie can be assigned to cluster 1. Some researches used this method. 
2. We will have a outpuf vecot of Y. Y = [Y_horror, Y_romance, Y_drama,...]
3. We can do output only 1 Y of primary genre. But to do this, we will need to get data from Amazon.
Currently, we prefer option 2.

### Multi-label classification
We will use multi-label classification. Before we discuss which methods to use (ex. KNN, SVM), we should consdier how we will treat the response variable first.

1. Binary Relevence(BR)
we seperate each genre into seperate problems (one for each genre).
However, this ignore label dependence. 
Ex. if a movie is tagged as Drama, it is likely that it is also tagged as Action. if a movie is tagged as  Horror, it is likely that it is also tagged as Romance.
If two classes of a genre (Yes/No) have very uneven sizes in the training set, the classifier will lean toward the class with higher movie number.
There is a method called a label correction strategy that can help to improve accuracy
For example, if our prediction is [Y_horror, Y_romance, Y_drama]= [1,1,0], which does not really happen in training set. We find another likely matching vector. We may change our prediction to be [1,0,1].

2. Classifier Chains (CC)
We seperate each genre into seperate problems, but include previous predictions as predictors.
For example, X is our predictor for Y_horror. Next, X, Y_horror are our predictor for Y_romance.
However, error may be propagated down the chain.

3. Label Powerset (LP)
Instead of having seperate Y_i for each genre i, we will predict only Y. Y has 2^I possible values where I is the number of genre.
For example, if Y_horror = 1, Y_romance = 0, Y_drama = 1, Y = [101]
However, imbalance of the data can be an issue.

4. Neural network
The output node will be each genre label.

We aim to try out meothds 1 to 4 with different algorithms.
We also need to examine label dependence for each genre by examining co-occurance frequency of genre. It will give us a better picture of which method to use.
We can also transform Y to [Y_OR,Y_AND, Y_XOR].


### Classification Algorithms

Below is the list of algorithms. We will likely add more algorithms later.
1. KNN
2. Kmeans
3. Naive Bayes
4. SVM
5. LDA

### Classification Evaluation
Imagine if the reponse variable Y are Y_horror, Y_romance, Y_drama, ... for each genre i, we will need to find appropriate methods to evaluate classification methods.

We can calculate different measurements and evalute by multiple measurements.
1. Compare bit-wise
This can be too lenient
2. Compare vector-ise
This can be too strict
3. Hamming loss (for BR)
4. 0/1 loss (for CC)
5. precision 
6. recall
7. F measures

For some of the measurements above, we can use macro-average(which give equal qeight to every class) or micro-averaging(which weights class relatively to its example frequency).



## useful resources
* [IMDB genre guide](http://www.imdb.com/help/search?domain=helpdesk_faq&index=2&file=genres)


In [2]:
import tmdbsimple as tmdb
tmdb.API_KEY = "71e259894a515060876bab2a33d6bdc9"

In [3]:
import imdb as ib
from imdb import IMDb
import pandas as pd
from PIL import Image
from StringIO import StringIO
import requests
import os
import time
from shutil import copyfile
import types
import numpy as np

In [4]:
dir_python_notebook = os.getcwd()
dir_movie_project = os.path.abspath(os.path.join(dir_python_notebook, os.pardir))
dir_data = os.path.join(dir_movie_project, 'data')

# Load data

Use the data from the files for now instead of calling API.

In [5]:
tmdb_filename = str(dir_data)+'\\drv_tmdb_movie_details.json'
imdb_filename = str(dir_data)+'\\drv_imdb_movie_info.json'
tmdb_movies = pd.read_json(tmdb_filename)
imdb_movies = pd.read_json(imdb_filename)

In [40]:
def get_genre(tmdb_movies ,key):
    tmdb_genre = tmdb_movies[tmdb_movies[key].notnull()][key].tolist()
    tmdb_genre_set = set()
    for g in tmdb_genre:
        if g is not None:
            tmdb_genre_set = tmdb_genre_set.union(set(g))
    tmdb_genre = list(tmdb_genre_set)
    tmdb_genre.sort()
    return(tmdb_genre)

In [41]:
def get_genere_num (row, column_name):
    if row[column_name] is None :
        return 0
    else:
        return len(row[column_name])

In [42]:
def is_genre (row, column_name, genre):
    """check if that movie is in this genre as a movie can have more than 1 genre"""
    if row[column_name] is None :
        return 0
    else:
        if genre in row[column_name] :
            return 1
        else:
            return 0

In [43]:
tmdb_genre = get_genre(tmdb_movies,u'genres')
tmdb_movies[u'genre_num'] = tmdb_movies.apply(lambda row: get_genere_num(row,u'genres'), axis=1)

for g in tmdb_genre: 
    tmdb_movies[g] = tmdb_movies.apply(lambda row: is_genre(row,u'genres',g), axis=1)

In [44]:
np.mean(tmdb_movies[u'genre_num'])

2.255087358684481

In [45]:
tmdb_genre_df = tmdb_movies[tmdb_genre].mean(axis=0)
tmdb_genre_df

Action             0.180473
Adventure          0.123535
Animation          0.025488
Comedy             0.324974
Crime              0.156835
Documentary        0.037410
Drama              0.498869
Family             0.049538
Fantasy            0.072765
Foreign            0.021583
History            0.045427
Horror             0.099692
Music              0.024666
Mystery            0.063926
Romance            0.142652
Science Fiction    0.100925
TV Movie           0.005550
Thriller           0.233299
War                0.027338
Western            0.020144
dtype: float64

In [58]:
tmdb_genre_df[tmdb_genre_df < 0.05]

Animation      0.025488
Documentary    0.037410
Family         0.049538
Foreign        0.021583
History        0.045427
Music          0.024666
TV Movie       0.005550
War            0.027338
Western        0.020144
dtype: float64

In [46]:
imdb_genre = get_genre(imdb_movies,u'genres')
imdb_movies[u'genre_num'] = imdb_movies.apply(lambda row: get_genere_num(row,u'genres'), axis=1)

for g in imdb_genre: 
    imdb_movies[g] = imdb_movies.apply(lambda row: is_genre(row,u'genres',g), axis=1)

In [51]:
# number of movies with no genre
sum(imdb_movies[u'genres'].isnull())

209

In [47]:
imdb_genre_df = imdb_movies[imdb_genre].mean(axis=0)
imdb_genre_df

Action         0.154828
Adult          0.005870
Adventure      0.114323
Animation      0.054104
Biography      0.037081
Comedy         0.355396
Crime          0.153067
Documentary    0.065600
Drama          0.459642
Family         0.084532
Fantasy        0.081646
Film-Noir      0.010664
History        0.032042
Horror         0.108600
Music          0.043587
Musical        0.029009
Mystery        0.076460
News           0.000147
Reality-TV     0.000098
Romance        0.188876
Sci-Fi         0.082233
Short          0.032531
Sport          0.033216
Talk-Show      0.000098
Thriller       0.201106
War            0.040603
Western        0.023188
dtype: float64

In [57]:
imdb_genre_df[imdb_genre_df < 0.05]

Adult         0.005870
Biography     0.037081
Film-Noir     0.010664
History       0.032042
Music         0.043587
Musical       0.029009
News          0.000147
Reality-TV    0.000098
Short         0.032531
Sport         0.033216
Talk-Show     0.000098
War           0.040603
Western       0.023188
dtype: float64

In [52]:
np.mean(imdb_movies[u'genre_num'])

2.4685451521377555

## Observations on genre list

1. The genre lists between IMDB and TMDB are very similiar so far. Perhaps it is because we have extracted the same movie.
2. A movie has 2 genres on average.
3. Some of the genres(ex. Biography and War) only account for less than 5%. One-vs-all approach may not be good.

## Compare Genre from IMDB, TMDB

In [21]:
tmdb_genre = tmdb_genres.values()
tmdb_genre.sort()
tmdb_genre

['Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western']

**IMDB genre list: http://www.imdb.com/help/search?domain=helpdesk_faq&index=2&file=genres **

In [87]:
imdb_genre = ['Action','Adult','Adventure','Animation','Biography','Comedy','Crime','Documentary','Drama','Family',
              'Fantasy','Film Noir','Game-Show','History','Horror','Musical','Music','Mystery','News','Reality-TV',
              'Romance','Sci-Fi','Short','Sport','Talk-Show','Thriller','War','Western']

IMBD has more genres than TMDB, most of them have the same corresponds, except 'Science Fiction' (TMDB) = 'Sci-Fi' (IMDB). 

In [88]:
genre = list(set(imdb_genre + tmdb_genre))

In [89]:
#remove duplicated
genre.remove('Sci-Fi')

In [90]:
#complete genres
genre.sort()
genre

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film Noir',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Science Fiction',
 'Short',
 'Sport',
 'TV Movie',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [91]:
tmdb_genres

{12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Science Fiction',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie'}

In [92]:
imdb_genres = {12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Sci-Fi',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie',
 1: 'Adult',
 2: 'Biography',
 3: 'Film Noir',
 4: 'Game-Show',
 5: 'Musical',
 6: 'News',
 7: 'Reality-TV',
 8: 'Short',
 9: 'Sport',
10: 'Talk-Show'}


In [93]:
genre_dic = tmdb_genres.copy()
genre_dic.update(imdb_genres )

In [94]:
#as a dictironary format, most id is same as id in tmdb (updated 878 will get value 'Sci-Fi' instead of 'Science Fiction')
genre_dic

{1: 'Adult',
 2: 'Biography',
 3: 'Film Noir',
 4: 'Game-Show',
 5: 'Musical',
 6: 'News',
 7: 'Reality-TV',
 8: 'Short',
 9: 'Sport',
 10: 'Talk-Show',
 12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Sci-Fi',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie'}

In [95]:
items = imdb_genres.items()
df_imdb = pd.DataFrame({'keys': [i[0] for i in items], 'values': [i[1] for i in items]})

In [96]:
df_imdb

Unnamed: 0,keys,values
0,10752,War
1,1,Adult
2,2,Biography
3,3,Film Noir
4,4,Game-Show
5,5,Musical
6,6,News
7,7,Reality-TV
8,8,Short
9,9,Sport


In [97]:
items = tmdb_genres.items()
df_tmdb = pd.DataFrame({'keys': [i[0] for i in items], 'values': [i[1] for i in items]})

In [98]:
items = genre_dic.items()
df_genre = pd.DataFrame({'keys': [i[0] for i in items], 'values': [i[1] for i in items]})

In [99]:
df1 = pd.merge(df_genre, df_tmdb, how='left', on=['keys'])
df1 = df1.sort_values(['values_x'])

In [100]:
df2 = pd.merge(df1, df_imdb,how='left', on=['keys'])
df2.columns = ['id','genres','tmdb_genre','imdb_genre']

In [101]:
df2

Unnamed: 0,id,genres,tmdb_genre,imdb_genre
0,28,Action,Action,Action
1,1,Adult,,Adult
2,12,Adventure,Adventure,Adventure
3,16,Animation,Animation,Animation
4,2,Biography,,Biography
5,35,Comedy,Comedy,Comedy
6,80,Crime,Crime,Crime
7,99,Documentary,Documentary,Documentary
8,18,Drama,Drama,Drama
9,10751,Family,Family,Family


**Summary:**

1. IMDB covers all TMDB genres.
2. The only genre that has different names in IMDB and TMDB is "Science Fiction"
3. Combining all genres from IMDB, TMDB, we have total 28 genres
4. For TMDB, since genres are stored in {"id":"genre"} pairs, we can use "genre_dic" to convert it to same genres as IMDB (i.e, 'Science Fiction' from TMDB will be 'Sci-Fi' after conversion).
5. For IMDB, we can just use its genres, since it will have same genres in our defined genres.



# Steps to obtain genre list

1. filter data set to only use movies that have data in TMDB and IMDB, IMDB and TMDB genres must not be null
2. Use the genre list from IMDB.
3. Out of 28 genres, we should merge the genre with really small number(ex. less than 5% of movies) to another genre. The final result is unique genre list.
4. convert each genre from unique genre list to a seperate data field, and we will use 1/0 to indicate if a movie has this genre

## Future Work:


1. Up to now, we maintain the genres from IMDB and TMDB. Some genres has very limited number of movies, so we are planning to merge minor genres to other major genres. One possible solution for merging genres is clustering genres, and we will do this part in milestone2. 