# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 20% API
- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 50% Performance in the Kaggle competition (The grade for each group will be proportional to the ranking it occupies in the competition. The group in the first place will obtain 5 points, for each position below, 0.25 points will be subtracted, that is: first place: 5 points, second: 4.75 points, third place: 4.50 points ... eleventh place: 2.50 points, twelfth place: 2.25 points).

• The project must be carried out in the groups assigned for module 4.
• Use clear and rigorous procedures.
• The delivery of the project is on July 12, 2020, 11:59 pm, through Sicua + (Upload: the API and the report in PDF format).
• No projects will be received after the delivery time or by any other means than the one established. 




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [40]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [41]:
dataTraining = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [42]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [43]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [44]:
from nltk.stem import WordNetLemmatizer 

In [46]:
def pre_process(text):
    # lowercase
    text = text.lower()
    # tags
    text = re.sub('&lt;/?.*?&gt;',' &lt;&gt; ',text)
    # special characters and digits
    text=re.sub('(\\d|\\W)+',' ',text)
    # remove punctuation
    #text = re.sub('[.;:!\'?,\"()\[\]]', '', text)
    #text = [REPLACE.sub('', line) for line in text]
    
    return text

dataTraining['plot_low']=dataTraining['plot'].apply(lambda x:pre_process(x))
dataTesting['plot_low']=dataTesting['plot'].apply(lambda x:pre_process(x))



###############################################################################

# se eliminan stopwords

english_stop_words=stopwords.words('english')

def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

dataTraining['plot_low_rm'] = remove_stop_words(dataTraining['plot_low'])
dataTesting['plot_low_rm'] = remove_stop_words(dataTesting['plot_low'])

###############################################################################

# Stemming:


def get_stemmed_text(corpus):
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

dataTraining['plot_low_rm_stem'] = get_stemmed_text(dataTraining['plot_low_rm'])
dataTesting['plot_low_rm_stem'] = get_stemmed_text(dataTesting['plot_low_rm'])

###############################################################################

# LEMATIZACION:


def lemma(texto):
  lemmatizador = WordNetLemmatizer()
  return [' '.join([lemmatizador.lemmatize(word) for word in review.split()]) for review in texto]

dataTraining['plot_low_rm_lemma'] = lemma(dataTraining['plot_low_rm'])
dataTesting['plot_low_rm_lemma'] = lemma(dataTesting['plot_low_rm'])


In [47]:
dataTraining[['plot','plot_low','plot_low_rm','plot_low_rm_stem','plot_low_rm_lemma']]

Unnamed: 0,plot,plot_low,plot_low_rm,plot_low_rm_stem,plot_low_rm_lemma
3107,most is the story of a single father who takes...,most is the story of a single father who takes...,story single father takes eight year old son w...,stori singl father take eight year old son wor...,story single father take eight year old son wo...
900,a serial killer decides to teach the secrets o...,a serial killer decides to teach the secrets o...,serial killer decides teach secrets satisfying...,serial killer decid teach secret satisfi caree...,serial killer decides teach secret satisfying ...
6724,"in sweden , a female blackmailer with a disfi...",in sweden a female blackmailer with a disfigur...,sweden female blackmailer disfiguring facial s...,sweden femal blackmail disfigur facial scar me...,sweden female blackmailer disfiguring facial s...
4704,"in a friday afternoon in new york , the presi...",in a friday afternoon in new york the presiden...,friday afternoon new york president tredway co...,friday afternoon new york presid tredway corpo...,friday afternoon new york president tredway co...
2582,"in los angeles , the editor of a publishing h...",in los angeles the editor of a publishing hous...,los angeles editor publishing house carol hunn...,lo angel editor publish hous carol hunnicut go...,los angeles editor publishing house carol hunn...
...,...,...,...,...,...
8417,""" our marriage , their wedding . "" it ' s l...",our marriage their wedding it s lesson number...,marriage wedding lesson number one newly engag...,marriag wed lesson number one newli engag coup...,marriage wedding lesson number one newly engag...
1592,"the wandering barbarian , conan , alongside ...",the wandering barbarian conan alongside his go...,wandering barbarian conan alongside goofy rogu...,wander barbarian conan alongsid goofi rogu pal...,wandering barbarian conan alongside goofy rogu...
1723,"like a tale spun by scheherazade , kismet fol...",like a tale spun by scheherazade kismet follow...,like tale spun scheherazade kismet follows rem...,like tale spun scheherazad kismet follow remar...,like tale spun scheherazade kismet follows rem...
7605,"mrs . brisby , a widowed mouse , lives in a...",mrs brisby a widowed mouse lives in a cinder b...,mrs brisby widowed mouse lives cinder block ch...,mr brisbi widow mous live cinder block childre...,mr brisby widowed mouse life cinder block chil...


### Create count vectorizer


In [48]:
vect = CountVectorizer(max_features=5000, min_df=0.0005)
X_dtm = vect.fit_transform(dataTraining['plot_low_rm_stem'])
voca_coun_vec = vect.vocabulary_ # vocabulario de countvectorizer()

print(X_dtm.shape)

(7895, 5000)


In [49]:
print(voca_coun_vec)

{'stori': 4273, 'singl': 4075, 'father': 1635, 'take': 4412, 'eight': 1408, 'year': 4980, 'old': 3133, 'son': 4159, 'work': 4949, 'railroad': 3579, 'bridg': 543, 'day': 1113, 'boy': 513, 'meet': 2816, 'woman': 4941, 'board': 476, 'train': 4591, 'drug': 1345, 'abus': 15, 'goe': 1889, 'engin': 1473, 'room': 3823, 'tell': 4450, 'stay': 4249, 'edg': 1393, 'nearbi': 3008, 'lake': 2510, 'ship': 4027, 'come': 874, 'lift': 2602, 'though': 4505, 'suppos': 4361, 'arriv': 242, 'hour': 2146, 'later': 2527, 'happen': 1996, 'see': 3947, 'tri': 4615, 'warn': 4850, 'abl': 8, 'approach': 213, 'fall': 1610, 'gear': 1846, 'attempt': 278, 'lower': 2670, 'leav': 2556, 'horrif': 2136, 'choic': 756, 'crush': 1056, 'peopl': 3285, 'complet': 901, 'fact': 1599, 'die': 1222, 'save': 3901, 'addict': 46, 'look': 2646, 'window': 4922, 'movi': 2956, 'end': 1465, 'man': 2722, 'wander': 4841, 'new': 3040, 'citi': 780, 'longer': 2644, 'hold': 2105, 'small': 4114, 'babi': 314, 'run': 3847, 'parallel': 3224, 'name': 2985

In [8]:
print(vect.get_feature_names()[:50])

['aaron', 'abandon', 'abbey', 'abbi', 'abbott', 'abduct', 'abe', 'abel', 'abigail', 'abil', 'abl', 'abl convinc', 'abl escap', 'abl find', 'abl get', 'abl make', 'abner', 'aboard', 'aboard ship', 'abort', 'abound', 'abraham', 'abraham lincoln', 'abroad', 'abruptli', 'absenc', 'absent', 'absolut', 'absorb', 'absurd', 'abu', 'abus', 'abus husband', 'academ', 'academi', 'academi award', 'accept', 'accept invit', 'accept job', 'accept offer', 'access', 'accid', 'accident', 'accident death', 'accident kill', 'acclaim', 'accommod', 'accompani', 'accomplic', 'accomplish']


### Create y

In [50]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [51]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

# XGBOOST

In [52]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm.todense(), y_genres, test_size=0.33, random_state=42)

In [55]:
from xgboost import XGBClassifier

In [78]:
xgb = OneVsRestClassifier(XGBClassifier(random_state=42,n_estimators=100, learning_rate=0.3, max_depth=7))
xgb.fit(X_train, y_train_genres)





OneVsRestClassifier(estimator=XGBClassifier(base_score=None, booster=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None, gamma=None,
                                            gpu_id=None, importance_type='gain',
                                            interaction_constraints=None,
                                            learning_rate=0.3,
                                            max_delta_step=None, max_depth=5,
                                            min_child_weight=None, missing=nan,
                                            monotone_constraints=None,
                                            n_estimators=100, n_jobs=None,
                                            num_parallel_tree=None,
                                            random_state=42, reg_alpha=None,
                                            

In [80]:
y_pred_genresxg = xgb.predict_proba(X_test)
roc_auc_score(y_test_genres, y_pred_genresxg, average='macro')

0.8434843032865246

In [79]:
df = pd.DataFrame(y_test_genres)
df.to_csv("y_test.csv")

In [81]:
X_test_dtm = vect.transform(dataTesting['plot_low_rm_stem'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = xgb.predict_proba(X_test_dtm)

In [82]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_xgb1.csv', index_label='ID')