![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [5]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


#### Librerias

In [6]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
# Descarga de recursos de NLTK
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    WordNetLemmatizer().lemmatize('test')
except LookupError:
    nltk.download('wordnet')
try:
    nltk.word_tokenize("example")
except LookupError:
    nltk.download('punkt')
    nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\CYBER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\CYBER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CYBER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\CYBER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
# Inicialización de herramientas
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
# Función de preprocesamiento
def preprocess_text(text):
    if isinstance(text, str): # Asegurarse de que la entrada sea una cadena
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text) # Liempieza de caracteres diferentes a letras, sustitucion por " "
        tokens = nltk.word_tokenize(text)
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2] # Lematización y eliminación de stopwords y palabras cortas
        return ' '.join(tokens)
    return '' # Manejar casos donde el texto no es una cadena

Tokenizacion sencilla.
    Faltaria comprobar: 

In [None]:
# Aplicar a los datos
dataTraining['plot_clean'] = dataTraining['plot'].apply(preprocess_text)
dataTesting['plot_clean'] = dataTesting['plot'].apply(preprocess_text)

                                                   plot  \
3107  most is the story of a single father who takes...   
900   a serial killer decides to teach the secrets o...   
6724  in sweden ,  a female blackmailer with a disfi...   
4704  in a friday afternoon in new york ,  the presi...   
2582  in los angeles ,  the editor of a publishing h...   

                                             plot_clean  
3107  story single father take eight year old son wo...  
900   serial killer decides teach secret satisfying ...  
6724  sweden female blackmailer disfiguring facial s...  
4704  friday afternoon new york president tredway co...  
2582  los angeles editor publishing house carol hunn...  


In [17]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating,plot_clean
3107,2003,Most,"most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender . a day before , the boy meets a woman boarding a train , a drug abuser . at the bridge , the father goes into the engine room , and tells his son to stay at the edge of the nearby lake . a ship comes , and the bridge is lifted . though it is supposed to arrive an hour later , the train happens to arrive . the son sees this , and tries to warn his father , who is not able to see this . just as the oncoming train approaches , his son falls into the drawbridge gear works while attempting to lower the bridge , leaving the father with a horrific choice . the father then lowers the bridge , the gears crushing the boy . the people in the train are completely oblivious to the fact a boy died trying to save them , other than the drug addict woman , who happened to look out her train window . the movie ends , with the man wandering a new city , and meets the woman , no longer a drug addict , holding a small baby . other relevant narratives run in parallel , namely one of the female drug - addict , and they all meet at the climax of this tumultuous film .","['Short', 'Drama']",8.0,story single father take eight year old son work railroad drawbridge bridge tender day boy meet woman boarding train drug abuser bridge father go engine room tell son stay edge nearby lake ship come bridge lifted though supposed arrive hour later train happens arrive son see try warn father able see oncoming train approach son fall drawbridge gear work attempting lower bridge leaving father horrific choice father lower bridge gear crushing boy people train completely oblivious fact boy died trying save drug addict woman happened look train window movie end man wandering new city meet woman longer drug addict holding small baby relevant narrative run parallel namely one female drug addict meet climax tumultuous film
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets of his satisfying career to a video store clerk .,"['Comedy', 'Crime', 'Horror']",5.6,serial killer decides teach secret satisfying career video store clerk
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfiguring facial scar meets a gentleman who lives beyond his means . they become accomplices in blackmail , and she falls in love with him , bitterly resigned to the impossibility of his returning her affection . her life changes when one of her victims proves to be the wife of a plastic surgeon , who catches her in his apartment , but believes her to be a jewel thief rather than a blackmailer . he offers her the chance to look like a normal woman again , and she accepts , despite the agony of multiple operations . meanwhile , her gentleman accomplice forms an evil scheme to rid himself of the one person who stands in his way to a fortune - his four - year - old - nephew .","['Drama', 'Film-Noir', 'Thriller']",7.2,sweden female blackmailer disfiguring facial scar meet gentleman life beyond mean become accomplice blackmail fall love bitterly resigned impossibility returning affection life change one victim prof wife plastic surgeon catch apartment belief jewel thief rather blackmailer offer chance look like normal woman accepts despite agony multiple operation meanwhile gentleman accomplice form evil scheme rid one person stand way fortune four year old nephew
4704,1954,Executive Suite,"in a friday afternoon in new york , the president of the tredway corporation avery bullard has just had a meeting with investment bankers and sends a telegram scheduling a meeting at the furniture factory in millburgh , pennsylvania , at six pm with his executives . bullard has never appointed an executive vice - president for the corporation after the death of the previous one but when he is getting a taxi , he has a stroke and dies on the street . a thief steals his wallet to get his money and his body goes to the morgue without identification . the investment banker george nyle caswell sees bullard ' s body from his window and decides to use the information to make money , asking a broker to sell as much tredway stocks as possible until the end of the day , with the intention of buying them back monday morning by a lower price making profit . meanwhile the executives unsuccessfully wait for bullard in the meeting room . when they learn that bullard is dead , the ambitions accountant vp and controller loren phineas shaw releases to the press the balance of tredway showing profit and assumes temporarily the leadership of the company , expecting to be elected the next president by the seven - member board . however , the vp for design and development mcdonald "" don "" walling and the vp and treasurer frederick y . alderson oppose to shaw . there is a struggle in the corporation for the position of president and shaw blackmails the vp for sales josiah walter dudley that is married and has a mistress , his secretary eva bardeman , to get his vote . caswell needs to cover the N , N stocks he sold and shaw promises to give to him the stocks for the price he sold if he is elected president . the vp for manufacturing jesse q . grimm is near to retire but is a close friend of frederick and supports him . therefore the heir of tredway and bullard ' s mistress julia o . tredway will be responsible to give the casting vote . but she is disenchanted with the corporation . who will be elected the next president ?",['Drama'],7.4,friday afternoon new york president tredway corporation avery bullard meeting investment banker sends telegram scheduling meeting furniture factory millburgh pennsylvania six pm executive bullard never appointed executive vice president corporation death previous one getting taxi stroke dy street thief steal wallet get money body go morgue without identification investment banker george nyle caswell see bullard body window decides use information make money asking broker sell much tredway stock possible end day intention buying back monday morning lower price making profit meanwhile executive unsuccessfully wait bullard meeting room learn bullard dead ambition accountant vp controller loren phineas shaw release press balance tredway showing profit assumes temporarily leadership company expecting elected next president seven member board however vp design development mcdonald walling vp treasurer frederick alderson oppose shaw struggle corporation position president shaw blackmail vp sale josiah walter dudley married mistress secretary eva bardeman get vote caswell need cover n n stock sold shaw promise give stock price sold elected president vp manufacturing jesse q grimm near retire close friend frederick support therefore heir tredway bullard mistress julia tredway responsible give casting vote disenchanted corporation elected next president
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing house carol hunnicut goes to a blind date with the lawyer michael tarlow , who has embezzled the powerful mobster leo watts . carol accidentally witnesses the murder of michel by leo ' s hitman . the scared carol sneaks out of michael ' s room and hides in an isolated cabin in canada . meanwhile the deputy district attorney robert caulfield and sgt . dominick benti discover that carol is a witness of the murder and they report the information to caulfield ' s chief martin larner and they head by helicopter to canada to convince carol to testify against leo . however they are followed and the pilot and benti are murdered by the mafia . caulfield and carol flees and they take a train to vancouver . caulfield hides carol in his cabin and he discloses that there are three hitman in the train trying to find carol and kill her . but they do not know her and caulfield does not know who might be the third killer from the mafia and who has betrayed him in his office .","['Action', 'Crime', 'Thriller']",6.6,los angeles editor publishing house carol hunnicut go blind date lawyer michael tarlow embezzled powerful mobster leo watt carol accidentally witness murder michel leo hitman scared carol sneak michael room hide isolated cabin canada meanwhile deputy district attorney robert caulfield sgt dominick benti discover carol witness murder report information caulfield chief martin larner head helicopter canada convince carol testify leo however followed pilot benti murdered mafia caulfield carol flees take train vancouver caulfield hide carol cabin discloses three hitman train trying find carol kill know caulfield know might third killer mafia betrayed office


## Limpieza

## Tokenizacion

## Eliminacion de Stop Word

## Lematizacion

## Multiclasificacion 

In [None]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

In [None]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [None]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [None]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [None]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

In [None]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()