![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [2]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


## Preprocesamiento de Datos

In [5]:
# Limpieza de data

import re
from string import digits
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

stemmer = SnowballStemmer('english')

def clean_text(texto):
    
    texto = re.sub("\'", "", texto) 
    texto = re.sub("[^a-zA-Z]"," ",texto)
    texto = texto.lower() 
    
    remove_digits = str.maketrans('', '', digits)
    texto = texto.translate(remove_digits)
    
    return texto

def tokenization(texto):
    texto = re.split('\W+', texto)
    return texto

def stemming(texto):
    text = [stemmer.stem(word) for word in texto]
    return texto


wordnet_lemmatizer = WordNetLemmatizer()

def split_into_lemmas(text):
    text = text.lower()
    words = text.split()
    return [wordnet_lemmatizer.lemmatize(word) for word in words]


dataTraining['plot'] = dataTraining['plot'].apply(lambda x: clean_text(x))
#dataTraining['plot'] = dataTraining['plot'].apply(lambda x: tokenization(x.lower()))
#dataTraining['plot'] = dataTraining['plot'].apply(lambda x: stemming(x))

#dataTraining['plot']=[" ".join(plot) for plot in dataTraining['plot'].values]

dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,in sweden a female blackmailer with a disfi...,"['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,in a friday afternoon in new york the presi...,['Drama'],7.4
2582,1990,Narrow Margin,in los angeles the editor of a publishing h...,"['Action', 'Crime', 'Thriller']",6.6


In [6]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
mle = MultiLabelBinarizer()
y_genres = mle.fit_transform(dataTraining['genres'])

In [7]:
y_genres.shape

(7895, 24)

## Calibracion de Parametros

In [None]:
pipeline = Pipeline(
    [
        ## Calibracion para preprocesamiento de variables predictoras, vectorizacion:
        #("CV", CountVectorizer()), 
        ('tfidf', TfidfTransformer()),
        
        ## Calibracion de modelos
        #("OVR_RF", OneVsRestClassifier(RandomForestClassifier())),
        ("OVR_LR", OneVsRestClassifier(LogisticRegression())),
        #("OVR_NB", OneVsRestClassifier(MultinomialNB())),
        #("OVR_XGB", OneVsRestClassifier(XGBClassifier())),
    ])

In [None]:
# Definicion de parametros a calibrar:

parameters = {
    'OVR_LR__estimator__penalty' : ['l1','l2'],
    'OVR_LR__estimator__class_weight' : ['balanced'],
    'OVR_LR__estimator__solver' : ['saga'],
    'OVR_LR__estimator__n_jobs' : [-1],
    'tfidf__analyzer' : [split_into_lemmas],
    'tfidf__stop_words' : [None, "english"],
    #'tfidf__ngram_range': ((1, 1), (1, 2), (2, 2)), 
    "tfidf__max_df": (0.068,0.07,0.073),
    'tfidf__max_features': (4878,4880,4883),    
    'tfidf__norm': ('l1', 'l2'),
    'tfidf__use_idf': (True, False),
    'tfidf__smooth_idf' : [True, False]
}

In [None]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring = "f1_macro")

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
t0 = time()
grid_search.fit(Xtrain, yTrain)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

## Random Forest

In [63]:
# Definición de variables predictoras (X)

# TfidfVectorizer
vect = TfidfVectorizer(max_features = 10000)
X_dtm = vect.fit_transform(dataTraining['plot'])
display(X_dtm.shape)

# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

(7895, 10000)

In [64]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=10,
                                                     max_features='auto',
                                                     max_leaf_nodes=None,
                                                     max_samples=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                              

In [65]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.811770019789752

## Logistic Regression with Multilabel

In [8]:
# Definición de variables predictoras (X)

# TfidfVectorizer
vect = TfidfVectorizer(max_features = 10000, analyzer = split_into_lemmas, norm = 'l2', smooth_idf = False, use_idf = True)
X_dtm = vect.fit_transform(dataTraining['plot'])
display(X_dtm.shape)

# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

(7895, 10000)

In [9]:
# Definición y entrenamiento
#clf = OneVsRestClassifier(LogisticRegression(n_jobs=-1, penalty='l2', solver='lbfgs', class_weight='balanced'))
clf = OneVsRestClassifier(LogisticRegression(n_jobs=-1, penalty='l2', solver='lbfgs', class_weight='balanced'))
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight='balanced',
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto', n_jobs=-1,
                                                 penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [10]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8820557310152739

## Naive Bayes

In [24]:
# Definición de variables predictoras (X)

# CountVectorizer
vect = CountVectorizer(max_df=0.0925, max_features=5000, analyzer = split_into_lemmas) #stop_words='english'
X_dtm = vect.fit_transform(dataTraining['plot'])
display(X_dtm.shape)

# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

(7895, 5000)

In [25]:
# Definición y entrenamiento
clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=MultinomialNB(alpha=1.0, class_prior=None,
                                            fit_prior=True),
                    n_jobs=None)

In [26]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8485746973166446

## XGBoost

In [42]:
# Definición de variables predictoras (X)

# TfidfVectorizer
vect = TfidfVectorizer(max_features = 10000, analyzer = split_into_lemmas, norm = 'l2', smooth_idf = False, use_idf = True)

X_dtm = vect.fit_transform(dataTraining['plot'])
display(X_dtm.shape)

# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

(7895, 10000)

In [43]:
# Definición y entrenamiento
clf = OneVsRestClassifier(XGBClassifier(learning_rate=0.1, colsample_bytree=0.5, gamma=4, random_state=1))
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=XGBClassifier(base_score=None, booster=None,
                                            callbacks=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=0.5,
                                            early_stopping_rounds=None,
                                            enable_categorical=False,
                                            eval_metric=None, gamma=4,
                                            gpu_id=None, grow_policy=None,
                                            importance_type=None,
                                            interaction_constraints=None,
                                            learning_rate=0.1, max_bin=None,
                                            max_cat_to_onehot=None,
                                            max_delta_step=None, max_depth=None,
          

In [44]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8394698041502177

## LSTM 

In [24]:
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers.core import Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

from livelossplot import PlotLossesKeras
%matplotlib inline

In [31]:
X_LTSM = dataTraining['plot'].tolist()

# For vocabulary only the intersec characters is used to avoid issues with data collection
voc = set(''.join(X_LTSM))
vocabulary = {x: idx + 1 for idx, x in enumerate(set(voc))}

# Max len
max_len = 1000
X_LTSM = [x[:max_len] for x in X_LTSM]

# Convert characters to int and pad
X_LTSM = [[vocabulary[x1] for x1 in x if x1 in vocabulary.keys()] for x in X_LTSM]
X_LTSM = sequence.pad_sequences(X_LTSM, maxlen=max_len)


# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_LTSM, y_genres, test_size=0.33, random_state=42)

In [32]:
model = Sequential()
model.add(Embedding(len(vocabulary) + 1, 128, input_length=max_len))
model.add(LSTM(21, dropout=0.5))
model.add(Dropout(0.5))
model.add(Dense(24, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.summary() 

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 1000, 128)         3584      
                                                                 
 lstm_2 (LSTM)               (None, 21)                12600     
                                                                 
 dropout_2 (Dropout)         (None, 21)                0         
                                                                 
 dense_2 (Dense)             (None, 24)                528       
                                                                 
Total params: 16,712
Trainable params: 16,712
Non-trainable params: 0
_________________________________________________________________


In [33]:
model.fit(X_train, y_train_genres, validation_data=[X_test, y_test_genres], batch_size=24, epochs=10, verbose=1, callbacks=[PlotLossesKeras()])

Epoch 1/10
 18/221 [=>............................] - ETA: 3:46 - loss: 0.6429 - accuracy: 0.1736

KeyboardInterrupt: 

## Data OOT - Testing 

In [22]:
# transformación variables predictoras X del conjunto de test

dataTesting['plot'] = dataTesting['plot'].apply(lambda x: clean_text(x))
dataTesting['plot'] = dataTesting['plot'].apply(lambda x: tokenization(x))
dataTesting['plot'] = dataTesting['plot'].apply(lambda x: stemming(x))
dataTesting['plot']=[" ".join(plot) for plot in dataTesting['plot'].values]

X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [23]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_LR_TVec_Blc.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.267194,0.279288,0.108277,0.085547,0.568156,0.250839,0.034905,0.486035,0.177305,0.477951,...,0.401186,0.273624,0.008773,0.892269,0.141252,0.0567,0.081294,0.340726,0.070689,0.123154
4,0.441787,0.107861,0.188399,0.51747,0.408021,0.709902,0.337557,0.641156,0.117255,0.099274,...,0.195617,0.164389,0.019732,0.151704,0.092027,0.0781,0.066774,0.514413,0.166083,0.153833
5,0.232943,0.12753,0.058718,0.327611,0.229498,0.822622,0.058519,0.752264,0.066478,0.119403,...,0.110351,0.878031,0.008146,0.261019,0.405813,0.056154,0.162329,0.682154,0.170601,0.165427
6,0.409141,0.415869,0.067779,0.133755,0.304626,0.218998,0.039754,0.645831,0.166866,0.212028,...,0.188706,0.277358,0.011966,0.448434,0.427896,0.034617,0.128468,0.50772,0.42312,0.171304
7,0.156751,0.181676,0.122124,0.1432,0.31343,0.234045,0.106658,0.363503,0.215756,0.462822,...,0.105153,0.462379,0.01048,0.346148,0.74912,0.074563,0.057744,0.503922,0.077563,0.093795


## Disponibilizar modelo

In [11]:
import joblib

import os
os.chdir('..')

joblib.dump(vect, 'vect_movies.pkl', compress=3)
joblib.dump(clf, 'model_movies.pkl', compress=3)

['model_movies.pkl']

In [None]:
# Importación librerías
import werkzeug
werkzeug.cached_property = werkzeug.utils.cached_property

from flask import Flask, request
from flask_restplus import Api, Resource, fields

# Importar modelo y predicción
from model_deployment.m09_model_deployment import predict_proba
from model import predict_price

# Definición aplicación Flask
app = Flask(__name__)

# Definición API Flask
api = Api(
    app, 
    version='1.0', 
    title='Clasificación de género de películas',
    description='API para Clasificación de género de películas')

ns = api.namespace('predict', 
     description='Movies Classifier')

# Definición argumentos o parámetros de la API
parser = api.parser()
parser.add_argument(
    'URL', 
    type=str, 
    required=True, 
    help='Description to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

In [None]:
# Definición de la clase para disponibilización
@ns.route('/')
class PhishingApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        
        return {
         "result": predict_proba(args['URL'])
        }, 200

In [None]:
# Ejecución de la aplicación que disponibiliza el modelo de manera local en el puerto 5000
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)

In [None]:
class ExampleApi(Resource):
    @api.doc(parser = parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        result = self.calculate(args)
        
        return result, 200
    
    def calculate(self, args):
        input = args['number']
        output = (input/2)*5
        
        print('url=', output)
        
        return {
            'result': output
        }