# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 20% API
- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 50% Performance in the Kaggle competition (The grade for each group will be proportional to the ranking it occupies in the competition. The group in the first place will obtain 5 points, for each position below, 0.25 points will be subtracted, that is: first place: 5 points, second: 4.75 points, third place: 4.50 points ... eleventh place: 2.50 points, twelfth place: 2.25 points).

• The project must be carried out in the groups assigned for module 4.
• Use clear and rigorous procedures.
• The delivery of the project is on July 12, 2020, 11:59 pm, through Sicua + (Upload: the API and the report in PDF format).
• No projects will be received after the delivery time or by any other means than the one established. 




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [1]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [2]:
dataTraining = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [5]:
%%time

import re
def pre_process(text):
    # lowercase
    text = text.lower()
    # tags
    text = re.sub('&lt;/?.*?&gt;',' &lt;&gt; ',text)
    # special characters and digits
    text=re.sub('(\\d|\\W)+',' ',text)
    # remove punctuation
    #text = re.sub('[.;:!\'?,\"()\[\]]', '', text)
    #text = [REPLACE.sub('', line) for line in text]
    
    return text
dataTraining['plot_low']=dataTraining['plot'].apply(lambda x:pre_process(x))

import nltk
nltk.corpus.stopwords.words('english')
nltk.download('wordnet') 
from nltk.corpus import stopwords


english_stop_words=stopwords.words('english')
def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

dataTraining['plot_low_rm'] = remove_stop_words(dataTraining['plot_low'])


from nltk.stem.porter import PorterStemmer
def get_stemmed_text(corpus):
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

dataTraining['plot_low_rm_stem'] = get_stemmed_text(dataTraining['plot_low_rm'])


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sebtc\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Wall time: 16.6 s


In [6]:
dataTraining[['plot','plot_low','plot_low_rm','plot_low_rm_stem']]

Unnamed: 0,plot,plot_low,plot_low_rm,plot_low_rm_stem
3107,most is the story of a single father who takes...,most is the story of a single father who takes...,story single father takes eight year old son w...,stori singl father take eight year old son wor...
900,a serial killer decides to teach the secrets o...,a serial killer decides to teach the secrets o...,serial killer decides teach secrets satisfying...,serial killer decid teach secret satisfi caree...
6724,"in sweden , a female blackmailer with a disfi...",in sweden a female blackmailer with a disfigur...,sweden female blackmailer disfiguring facial s...,sweden femal blackmail disfigur facial scar me...
4704,"in a friday afternoon in new york , the presi...",in a friday afternoon in new york the presiden...,friday afternoon new york president tredway co...,friday afternoon new york presid tredway corpo...
2582,"in los angeles , the editor of a publishing h...",in los angeles the editor of a publishing hous...,los angeles editor publishing house carol hunn...,lo angel editor publish hous carol hunnicut go...
...,...,...,...,...
8417,""" our marriage , their wedding . "" it ' s l...",our marriage their wedding it s lesson number...,marriage wedding lesson number one newly engag...,marriag wed lesson number one newli engag coup...
1592,"the wandering barbarian , conan , alongside ...",the wandering barbarian conan alongside his go...,wandering barbarian conan alongside goofy rogu...,wander barbarian conan alongsid goofi rogu pal...
1723,"like a tale spun by scheherazade , kismet fol...",like a tale spun by scheherazade kismet follow...,like tale spun scheherazade kismet follows rem...,like tale spun scheherazad kismet follow remar...
7605,"mrs . brisby , a widowed mouse , lives in a...",mrs brisby a widowed mouse lives in a cinder b...,mrs brisby widowed mouse lives cinder block ch...,mr brisbi widow mous live cinder block childre...


### Create count vectorizer


In [7]:
vect = CountVectorizer(ngram_range=(1,2),lowercase=True,max_features=10000)

X_dtm = vect.fit_transform(dataTraining['plot_low_rm_stem'])
X_dtm.shape

(7895, 10000)

In [8]:
print(vect.get_feature_names()[:50])

['aaron', 'abandon', 'abbey', 'abbi', 'abbott', 'abduct', 'abe', 'abel', 'abigail', 'abil', 'abl', 'abl convinc', 'abl escap', 'abl find', 'abl get', 'abl make', 'abner', 'aboard', 'aboard ship', 'abort', 'abound', 'abraham', 'abraham lincoln', 'abroad', 'abruptli', 'absenc', 'absent', 'absolut', 'absorb', 'absurd', 'abus', 'abus husband', 'academ', 'academi', 'academi award', 'accept', 'accept invit', 'accept job', 'accept offer', 'access', 'accid', 'accident', 'accident death', 'accident kill', 'acclaim', 'accommod', 'accompani', 'accomplic', 'accomplish', 'accord']


### Create y

In [9]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

AttributeError: module 'pandas' has no attribute 'dataframe'

In [10]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [11]:
from sklearn.model_selection import GridSearchCV, cross_val_score

In [12]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

### Train multi-class multi-label model

In [13]:
%%time

param_grid_ad = {'estimator__max_depth': [3,9,15], 
                 'estimator__n_estimators':[1,10,100,200],
                 'estimator__max_features': [0.4,0.7,0.9,1]}
#param_grid_ad = {'estimator__n_estimators': [1,2], 'estimator__max_depth':[1,2],'estimator__max_features': [0.7]}

cv=10

#model_ad = RandomForestClassifier()
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1,random_state=42))
grid_search = GridSearchCV(estimator = clf, param_grid = param_grid_ad,cv = cv,scoring='accuracy')

grid_search.fit(X_train, y_train_genres)


Wall time: 3h 47min 46s


GridSearchCV(cv=10,
             estimator=OneVsRestClassifier(estimator=RandomForestClassifier(n_jobs=-1,
                                                                            random_state=42)),
             param_grid={'estimator__max_depth': [3, 9, 15],
                         'estimator__max_features': [0.4, 0.7, 0.9, 1],
                         'estimator__n_estimators': [1, 10, 100, 200]},
             scoring='accuracy')

In [20]:
grid_search.best_params_
#{'estimator__max_depth': 15,
# 'estimator__max_features': 0.4,
# 'estimator__n_estimators': 200}

{'estimator__max_depth': 15,
 'estimator__max_features': 0.4,
 'estimator__n_estimators': 200}

In [21]:
best_max_depth = grid_search.best_params_['estimator__max_depth']
best_n_estimators = grid_search.best_params_['estimator__n_estimators']
best_max_features = grid_search.best_params_['estimator__max_features']

In [25]:
results___=grid_search.cv_results_
pd.DataFrame.from_dict(results___)[['rank_test_score','param_estimator__max_depth','param_estimator__n_estimators','mean_test_score']]


Unnamed: 0,rank_test_score,param_estimator__max_depth,param_estimator__n_estimators,mean_test_score
0,46,3,1,0.04292
1,26,3,10,0.055586
2,37,3,100,0.051427
3,39,3,200,0.050861
4,47,3,1,0.04065
5,24,3,10,0.060693
6,21,3,100,0.063907
7,23,3,200,0.060882
8,43,3,1,0.046511
9,22,3,10,0.060882


In [27]:
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, 
                                                 n_estimators=best_n_estimators, 
                                                 max_depth=best_max_depth, 
                                                 max_features = best_max_features,
                                                 random_state=42))

In [28]:
%%time
clf.fit(X_train, y_train_genres)


Wall time: 2min 13s


OneVsRestClassifier(estimator=RandomForestClassifier(max_depth=15,
                                                     max_features=0.4,
                                                     n_estimators=200,
                                                     n_jobs=-1,
                                                     random_state=42))

In [29]:
%%time
y_pred_genres = clf.predict_proba(X_test)

Wall time: 2.27 s


In [31]:
%%time
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

Wall time: 40 ms


0.790535280124351

### Predict the testing dataset

In [None]:
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [None]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)

In [None]:
res.head()

In [None]:
res.to_csv('pred_genres_text_RF.csv', index_label='ID')