![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import ast
import seaborn as sns
import re,string
import nltk
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer
import tensorflow as tf
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
import multiprocessing
from sklearn.model_selection import RepeatedKFold

In [None]:
# Carga de datos de archivo .csv
Train = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
Test = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [None]:
dataTraining = Train
dataTesting = Test

In [None]:
# Visualización datos de entrenamiento
dataTraining.head()

In [None]:
# Visualización datos de test
dataTesting.head()

#### Preprocesamiento de Datos

In [None]:
dataTraining.info()

In [None]:
dataTraining.isnull().sum()

In [None]:
generos = []

for i in dataTraining['genres']:
    list_genres = ast.literal_eval(i)
    for j in list_genres:
        generos.append(j)

plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,5))
sns.countplot(x=generos)    # Count of number of descriptions for each class label
plt.xticks(rotation=60)
plt.show()

In [None]:
plt.figure(figsize=(12,5))
pd.DataFrame(generos).value_counts().plot.barh()
plt.show()

In [None]:
dataTraining['plot length']=dataTraining['plot'].apply(lambda x:len(x))   # calculating length of plot for each row
dataTraining.head()

In [12]:
# import re,string
# import nltk
# #nltk.download('stopwords')    # uncomment if stopwords are not downloaded
# from nltk.corpus import stopwords
# from nltk.stem import LancasterStemmer

# stemmer=LancasterStemmer()
# stopwords=stopwords.words('english')

# def clean_text(text):
#     text=re.sub('-',' ',text.lower())   # replace `word-word` as `word word`
#     text = ' '.join([stemmer.stem(word) for word in text.split() if word not in stopwords]) # remove stopwords and stem other words
#     text=re.sub(f'[{string.digits}]',' ',text)  # remove digits
#     return re.sub(f'[{re.escape(string.punctuation)}]','',text) # remove punctuations

In [13]:
# dataTraining['cleaned plot']=dataTraining['plot'].apply(clean_text)  # clean text for all rows of description
# dataTesting['cleaned plot']=dataTesting['plot'].apply(clean_text)
# dataTraining['cleaned plot length']=dataTraining['cleaned plot'].apply(lambda x:len(x))  # calculate length of cleaned text
# dataTraining.head()

In [14]:
# (dataTraining['cleaned plot length']>1500).value_counts()

In [15]:
# # Remove extremely long descriptions: outliers
# print('Dataframe size (before removal): ',len(dataTraining))
# filt=dataTraining['cleaned plot length']>1500
# dataTraining.drop(dataTraining[filt].index,axis=0,inplace=True)     # filter rows having cleaned description length > 2000
# print('Dataframe size (after removal): ',len(dataTraining))
# print(f'Removed rows: {filt.sum()}')

In [16]:
# dataTraining.drop(columns=['title','plot','plot length','cleaned plot length'],axis=1,inplace=True)     # drop unnecessary columns for model
# dataTesting.drop(columns=['title','plot'],axis=1,inplace=True)
# dataTraining.head()

In [None]:
# Definición de variables predictoras (X)
from nltk.corpus import stopwords
stopwords=stopwords.words('english')

vect = CountVectorizer(max_features=1500, lowercase = False, stop_words=stopwords)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

#### Lematización

In [18]:
# # Importación de librerias
# from nltk.stem import WordNetLemmatizer
# wordnet_lemmatizer = WordNetLemmatizer()
# import nltk
# nltk.download('wordnet')

In [19]:
# # Definiicón de lista con vocabulario de la matriz de documentos
# words = list(vect.vocabulary_.keys())

In [20]:
# # Definición de la función que tenga como parámetro texto y devuelva una lista de lemas
# def split_into_lemmas(text):
#     text = text.lower()
#     words = text.split()
#     return [wordnet_lemmatizer.lemmatize(word) for word in words]

In [21]:
# # Creación de matrices de documentos usando CountVectorizer, usando el parámetro 'split_into_lemmas'
# vect_lemas = CountVectorizer(analyzer=split_into_lemmas)

In [22]:
# # Desempeño del modelo al lematizar el texto
# vect_lemas

In [None]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [24]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [25]:
# # Grid de hiperparámetros evaluados
# # ==============================================================================
# param_grid = {'n_estimators'  : [50, 100, 500, 1000],
#               'max_features'  : ['auto', 'sqrt', 'log2'],
#               'max_depth'     : [None, 1, 3, 5, 10, 20],
#               'subsample'     : [0.5, 1],
#               'learning_rate' : [0.001, 0.01, 0.1]
#              }

# # Búsqueda por grid search con validación cruzada
# # ==============================================================================
# grid = GridSearchCV(
#         estimator  = xgb.XGBClassifier(random_state=123),
#         param_grid = param_grid,
#         scoring    = 'accuracy',
#         n_jobs     = multiprocessing.cpu_count() - 1,
#         cv         = RepeatedKFold(n_splits=3, n_repeats=1, random_state=123), 
#         refit      = True,
#         verbose    = 0,
#         return_train_score = True
#        )

# grid.fit(X = X_train, y = y_train_genres)

# # Resultados
# # ==============================================================================
# resultados = pd.DataFrame(grid.cv_results_)
# resultados.filter(regex = '(param*|mean_t|std_t)') \
#     .drop(columns = 'params') \
#     .sort_values('mean_test_score', ascending = False) \
#     .head(4)

In [26]:
# # Mejores hiperparámetros por validación cruzada
# # ==============================================================================
# print("----------------------------------------")
# print("Mejores hiperparámetros encontrados (cv)")
# print("----------------------------------------")
# print(grid.best_params_, ":", grid.best_score_, grid.scoring)

In [27]:
# Definición y entrenamiento
clf = OneVsRestClassifier(xgb.XGBClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [28]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8025316749427879

In [29]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [30]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.05231,0.039072,0.00042,0.002565,0.151989,0.026575,0.000664,0.470647,0.024712,0.037654,...,0.009205,0.027514,1.7e-05,0.987415,0.013618,0.00028,0.00053,0.086193,0.00102,0.002179
4,0.025965,0.005199,0.004291,0.575535,0.045121,0.307576,0.294935,0.944089,0.003839,0.002557,...,0.002849,0.00996,0.000656,0.011166,0.002386,0.011194,0.003384,0.153691,0.034821,0.003528
5,0.001451,0.000202,4e-05,0.000369,0.021729,0.419026,7.4e-05,0.680908,4.1e-05,0.001767,...,0.000193,0.327308,2.3e-05,0.151924,0.003713,0.001086,0.004907,0.570237,0.000352,0.000962
6,0.025874,0.04513,5.3e-05,0.001681,0.043484,0.051594,0.001757,0.856734,0.001092,0.009028,...,0.000552,0.14444,3e-05,0.061414,0.054991,1.2e-05,0.000174,0.845745,0.038545,6.5e-05
7,0.109577,0.030596,0.001107,0.00199,0.148409,0.061959,0.000511,0.075539,0.01606,0.459454,...,0.007466,0.026427,2.3e-05,0.003859,0.16818,0.000784,0.002018,0.260326,0.002429,0.001002
