![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [18]:
# Ignorar alertas
import warnings
warnings.filterwarnings('ignore')

In [19]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [20]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

### Análisis exploratorio de datos

In [21]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,"most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender . a day before , the boy meets a woman boarding a train , a drug abuser . at the bridge , the father goes into the engine room , and tells his son to stay at the edge of the nearby lake . a ship comes , and the bridge is lifted . though it is supposed to arrive an hour later , the train happens to arrive . the son sees this , and tries to warn his father , who is not able to see this . just as the oncoming train approaches , his son falls into the drawbridge gear works while attempting to lower the bridge , leaving the father with a horrific choice . the father then lowers the bridge , the gears crushing the boy . the people in the train are completely oblivious to the fact a boy died trying to save them , other than the drug addict woman , who happened to look out her train window . the movie ends , with the man wandering a new city , and meets the woman , no longer a drug addict , holding a small baby . other relevant narratives run in parallel , namely one of the female drug - addict , and they all meet at the climax of this tumultuous film .","['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets of his satisfying career to a video store clerk .,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfiguring facial scar meets a gentleman who lives beyond his means . they become accomplices in blackmail , and she falls in love with him , bitterly resigned to the impossibility of his returning her affection . her life changes when one of her victims proves to be the wife of a plastic surgeon , who catches her in his apartment , but believes her to be a jewel thief rather than a blackmailer . he offers her the chance to look like a normal woman again , and she accepts , despite the agony of multiple operations . meanwhile , her gentleman accomplice forms an evil scheme to rid himself of the one person who stands in his way to a fortune - his four - year - old - nephew .","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the president of the tredway corporation avery bullard has just had a meeting with investment bankers and sends a telegram scheduling a meeting at the furniture factory in millburgh , pennsylvania , at six pm with his executives . bullard has never appointed an executive vice - president for the corporation after the death of the previous one but when he is getting a taxi , he has a stroke and dies on the street . a thief steals his wallet to get his money and his body goes to the morgue without identification . the investment banker george nyle caswell sees bullard ' s body from his window and decides to use the information to make money , asking a broker to sell as much tredway stocks as possible until the end of the day , with the intention of buying them back monday morning by a lower price making profit . meanwhile the executives unsuccessfully wait for bullard in the meeting room . when they learn that bullard is dead , the ambitions accountant vp and controller loren phineas shaw releases to the press the balance of tredway showing profit and assumes temporarily the leadership of the company , expecting to be elected the next president by the seven - member board . however , the vp for design and development mcdonald "" don "" walling and the vp and treasurer frederick y . alderson oppose to shaw . there is a struggle in the corporation for the position of president and shaw blackmails the vp for sales josiah walter dudley that is married and has a mistress , his secretary eva bardeman , to get his vote . caswell needs to cover the N , N stocks he sold and shaw promises to give to him the stocks for the price he sold if he is elected president . the vp for manufacturing jesse q . grimm is near to retire but is a close friend of frederick and supports him . therefore the heir of tredway and bullard ' s mistress julia o . tredway will be responsible to give the casting vote . but she is disenchanted with the corporation . who will be elected the next president ?",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing house carol hunnicut goes to a blind date with the lawyer michael tarlow , who has embezzled the powerful mobster leo watts . carol accidentally witnesses the murder of michel by leo ' s hitman . the scared carol sneaks out of michael ' s room and hides in an isolated cabin in canada . meanwhile the deputy district attorney robert caulfield and sgt . dominick benti discover that carol is a witness of the murder and they report the information to caulfield ' s chief martin larner and they head by helicopter to canada to convince carol to testify against leo . however they are followed and the pilot and benti are murdered by the mafia . caulfield and carol flees and they take a train to vancouver . caulfield hides carol in his cabin and he discloses that there are three hitman in the train trying to find carol and kill her . but they do not know her and caulfield does not know who might be the third killer from the mafia and who has betrayed him in his office .","['Action', 'Crime', 'Thriller']",6.6


In [22]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate . theresa osborne is running along the beach when she stumbles upon a bottle washed up on the shore . inside is a message , reading the letter she feels so moved and yet she felt as if she has violated someone ' s thoughts . in love with a man she has never met , theresa tracks down the author of the letter to a small town in wilmington , two lovers with crossed paths . but yet one can ' t let go of their past ."
4,1978,Midnight Express,"the true story of billy hayes , an american college student who is caught smuggling drugs out of turkey and thrown into prison ."
5,1996,Primal Fear,"martin vail left the chicago da ' s office to become a successful criminal lawyer , that success predicated on working on high profile cases . as such , he fights to get the case of naive nineteen year old rural kentuckian aaron stampler , an altar boy accused of the vicious bludgeoning death of archbishop rushman of chicago . the story that aaron tells marty is that he , abused by his father , was in the room when the murder was committed by a third party , a shadowy figure he did not see , before he blacked out , which commonly happens to him . not remembering anything during the blackout period , he awoke covered in the archbishop ' s blood , his fright the reason he ran from the police . he also states that he had no reason to kill the archbishop , who he loved as the father he wished he had . marty doesn ' t care if he is guilty or innocent , but needs to know the truth to defend him adequately . unlike the rest of the world , marty does believe his story , he who hopes he can use aaron ' s general appearance of being an innocent to his advantage . the powerful state attorney , john shaughnessy , who marty has had many a moral run - in , wants a first degree murder conviction and the death penalty in this case . he appoints to the case janet venable , who still has bad feelings toward marty , an ex - lover , their six month relationship which ended badly . although the case looks to be a slam dunk for janet , her career may be made or broken by its outcome . in building his case , marty comes across some major pieces of information , some pertaining to the archbishop himself , and one uncovered by dr . molly arrington about aaron , she a psychiatrist hired by marty to assess aaron ' s mental state . these pieces of information as a collective pose a problem for marty in how to mount a credible and legitimate defense for his client . it is more of a moral dilemma for marty if only because he believes the life of a young man , who he believes in , is at stake ."
6,1950,Crisis,"husband and wife americans dr . eugene and mrs . helen ferguson - he a renowned neurosurgeon - are traveling through latin america for a vacation . when they make the decision to return to new york earlier than expected , they find they are being detained by the military in the country they are in . ultimately , they learn the reason is that president raoul farrago , the tyrannical military dictator of the country , has been diagnosed with a brain tumor and will die without an operation to remove it , farrago choosing gene as the doctor to lead the surgical team . because of the volatile politics within the country and for his own safety as revolutionary forces would like to see him dead , farrago refuses to go to a hospital for the operation , instead it to be done at his home . despite not particularly liking farrago or his ways , gene agrees purely in his oath as a doctor . however , he ends up being caught in the middle between farrago / his brutal regime and the revolutionaries , each side who is willing to use him and helen to get what they want , namely the life or death of farrago ."
7,1959,The Tingler,"the coroner and scientist dr . warren chapin is researching the shivering effect of fear with his assistant david morris . dr . warren is introduced to ollie higgins , the relative of a criminal sentenced to the electric chair , while making the autopsy of the corpse , and he makes a comment about the tingler - effect to him . ollie asks for a lift to dr . warner , and introduces his deaf - mute wife martha higgins , who manages a theater of their own . dr . warner returns home , where he lives with his unfaithful and evil wife isabel stevens chapin and her sweet sister lucy stevens . dr . warner , upset with the situation with his wife , threatens and uses her as a subject of his experiment . when martha dies of fear , dr . warner makes her autopsy and finds a creature that lives inside every human being , feeds with fear and is controlled by the scream . once martha was not able to scream , the tingler was not rendered harmless and became enormous . when the living being escapes , dr . warner and ollie chase it in a crowded movie theater ."


In [23]:
nulos = dataTraining.isnull().sum()               # Busqueda de valores nulos
duplicados = dataTraining.duplicated().sum()      # Busqueda de valores repetidos
titulos_unicos = dataTraining['title'].nunique()  # títulos únicos de películas

print(f'Valores nulos:\n{nulos}')
print(f'\nValores duplicados: {duplicados}')
print(f'\nCantidad de títulos únicos: {titulos_unicos}')

Valores nulos:
year      0
title     0
plot      0
genres    0
rating    0
dtype: int64

Valores duplicados: 1

Cantidad de títulos únicos: 7729


In [24]:
dataTraining.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7895 entries, 3107 to 215
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    7895 non-null   int64  
 1   title   7895 non-null   object 
 2   plot    7895 non-null   object 
 3   genres  7895 non-null   object 
 4   rating  7895 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 370.1+ KB


In [25]:
# Listado de películas con títulos duplicados
dataTraining[dataTraining.duplicated(subset=['title'], keep=False)].sort_values('title')

Unnamed: 0,year,title,plot,genres,rating
10885,1916,"20,000 Leagues Under the Sea","captain nemo has built a fantastic submarine for his mission of revenge . he has traveled over N , N leagues in search of charles denver - a man who caused the death of princess daaker . seeing what he had done , denver took the daughter to his yacht and sailed away . he abandoned her and a sailor on a mysterious island and has come back after all these years to see if she is still alive and if the nightmares he has will stop . the daughter has been found by five survivors of a union army balloon that crashed near the island . at sea , professor aronnax was aboard the ship ' abraham lincoln ' when nemo rammed it and threw the professor , his daughter and two others into the water . prisoners at first , they are now treated as guests to view the underwater world and to hunt under the waves . nemo will also tells them about the nautilus and the revenge that has driven him for all these years .","['Action', 'Adventure', 'Sci-Fi']",7.1
10745,1954,"20,000 Leagues Under the Sea","in N , a monster is terrorizing the seas , sinking ships ! three unlikely companions board a warship in search of the beast , only to find out the hard way it is an engine of destruction : a submarine boat . the trio is captured by the vessel ' s commander , captain nemo , and accompany him on a journey of adventure and discovery as the nautilus and her occupants travel some N , N leagues under the sea .","['Adventure', 'Drama', 'Family', 'Fantasy', 'Sci-Fi']",7.2
9182,1957,3:10 to Yuma,"when the charming outlaw ben wade is captured after the heist of a stagecoach , the stage line owner mr . butterfield offers us$ N to the man that escorts the bandit to the city of contention to take the N : N pm train to yuma to be sent to trial . the rancher dan evans is broken and needs the money to save his cattle and support his family and accepts the assignment . during their journey , dan saves the life of ben when a vigilante tries to execute the criminal . meanwhile ben ' s gang split to find where ben is and then rescues their boss . when they find that ben is trapped in a hotel room , they put the place under siege and dan can not find any man to help him .","['Drama', 'Thriller', 'Western']",7.6
2583,2007,3:10 to Yuma,"rancher dan evans heads into bisbee to clear up issues concerning the sale of his land when he witnesses the closing events of a stagecoach robbery led by famed outlaw ben wade . shortly thereafter , wade is captured by the law in bisbee and evans finds himself one of the escorts who will take wade to the N : N to yuma train in contention for the reward of $ N . evans ' s effort to take wade to the station is in part an effort to save his land but also part of an inner battle to determine whether he can be more than just a naive rancher in the eyes of his impetuous and gunslinging son william evans . the transport to contention is hazardous and filled with ambushes by indians , pursuits by wade ' s vengeful gang and wade ' s own conniving and surreptitious demeanor that makes the ride all the more intense .","['Adventure', 'Crime', 'Drama', 'Western']",7.8
8062,1938,A Christmas Carol,"in the nineteenth century , in london , the bitter , greedy and cranky ebenezer scrooge hates christmas and people . he runs his business exploiting his employee bob cratchit and spends unfriendly treatment to his nephew fred and acquaintances . in the christmas eve , he is visited by the doomed chained ghost of his former partner jacob marley , who died seven years ago and tells him that three spirits would visit him that night . the first one , the spirit of past christmas , recalls his happy childhood and coming of age ; the spirit of the present christmas shows him the poor situation of bob ' s family and the happiness of fred and his fiancée bessy ; and the spirit of future christmas shows his fate . scrooge finds that life is good and finds redemption changing thoughts about christmas , bob , tiny tim , his nephew and people in general .","['Drama', 'Family', 'Fantasy']",7.5
...,...,...,...,...,...
8107,1959,Warlock,"the town of warlock is plagued by a gang of thugs , leading the inhabitants to hire clay blaisdell , a famous gunman , to act as marshal . when blaisdell appears , he is accompanied by his friend tom morgan , a club - footed gambler who is unusually protective of blaisdell ' s life and reputation . however , johnny gannon , one of the thugs who has reformed , volunteered to accept the post of official deputy sheriff in rivalry to blaisdell ; and a woman arrives in town accusing blaisdell and morgan of having murdered her fiancé . the stage is set for a complex set of moral and personal conflicts .","['Romance', 'Western']",7.3
4314,2000,Where the Heart Is,"having lived her entire life in a trailer , novalee nation is a pregnant , superstitious and uneducated seventeen year old . with the exception of her boyfriend and the unborn baby ' s father willie jack pickens , she is all alone in the world with no money of her own . on the drive from their current home in tennessee to bakersfield , california where they are planning on moving , willie jack abandons novalee in the small town of sequoyah , oklahoma . with no food or money , novalee ends up secretly living in the town ' s wal - mart . as her stay in sequoyah is extended , novalee is befriended and ultimately assisted by a variety of townsfolk , including a recovering alcoholic named sister husband , single mother and nurse lexi coop who names her children after snack foods , walmart ' s contract photographer named moses whitecotten , and the temporary librarian named forney hull , an academically brilliant man originally from new england . as much as they help novalee , she in turn has a profound affect on them as each is dealing with an issue which stops them from having a truly fulfilled life . forney ends up falling in love with novalee , who feels she has to make a decision of how she can best help him . she is assisted in this decision by an unlikely source .","['Comedy', 'Drama', 'Romance']",6.8
9969,1990,Where the Heart Is,"stewart mcbain ( coleman ) is a real - estate mogul who spends his living blowing up old buildings to make room to erect new buildings . all goes as planned for a new subdivision , until a group of protesters object to the destruction of one lonely , ugly building , called the dutch house . typically , the media is sent to the scene of the protest , and mcbain appears on tv in a bad way . his children - daphne ( thurman ) , chloe ( amis ) , and jimmy ( hewlett ) - ridicule him for appearing on tv , and as a reward for their remarks , he drops them off at the dutch house with $ N apiece , and tells them they ' re on their own . they must find jobs if they expect to make money to stay warm . mcbain and his wife , jean watch from afar as their children adapt to their new lifestyle , meeting new friends , and inviting others into their new home , including a decrepit bum .","['Comedy', 'Drama']",6.2
508,1939,Wuthering Heights,"the story of unfortunate lovers heathcliff and cathy who , despite a deep affection for one another , are forced by circumstance and prejudice to live their apart . heathcliff and cathy first meet as children when her father brings the abandoned boy to live with them . when the old man dies several years later cathy ' s brother , now the master of the estate , turns heathcliff out forcing him to live with the servants and working as a stable boy . the barrier of class comes between them and she eventually marries a rich neighbor , mr . edgar linton , at which point heathcliff disappears . he returns several years later , now a rich man but little can be done .","['Drama', 'Romance']",7.7


#### Librerias

In [27]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TreebankWordTokenizer

from itertools import chain
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
frecuencia_generos = list(chain.from_iterable(dataTraining['genres']))

# Paso 2: Contar la frecuencia de cada género
conteo_generos = Counter(frecuencia_generos)

# Paso 3: Ordenar los géneros por frecuencia
sorted_genres = conteo_generos.most_common()
genres, counts = zip(*sorted_genres)

# Paso 4: Crear el gráfico de barras
plt.figure(figsize=(12, 6))
bars = plt.bar(genres, counts, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Género')
plt.ylabel('Cantidad de películas')
plt.title('Frecuencia de géneros en el dataset')
plt.tight_layout()

# Paso 5: Añadir etiquetas a cada barra
for bar, count in zip(bars, counts):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 5,
             str(count), ha='center', va='bottom', fontsize=9)

plt.show()

In [None]:
# Descarga de recursos de NLTK
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    WordNetLemmatizer().lemmatize('test')
except LookupError:
    nltk.download('wordnet')
try:
    nltk.word_tokenize("example")
except LookupError:
    nltk.download('punkt')
    nltk.download('punkt_tab')

In [None]:
# Inicialización de herramientas
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()


In [None]:
# Función de preprocesamiento
def preprocess_text_simple(text):
    if isinstance(text, str): # Asegurarse de que la entrada sea una cadena
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text) # Liempieza de caracteres diferentes a letras, sustitucion por " "
        tokens = nltk.word_tokenize(text)
        tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 2] # Lematización y eliminación de stopwords y palabras cortas
        return ' '.join(tokens)
    return '' # Manejar casos donde el texto no es una cadena

Tokenizacion sencilla.
    Faltaria comprobar: 

In [None]:
# Aplicar a los datos
dataTraining['plot_clean'] = dataTraining['plot'].apply(preprocess_text_simple)
dataTesting['plot_clean'] = dataTesting['plot'].apply(preprocess_text_simple)

In [None]:
dataTraining.head()

### Multiclasificacion

In [None]:
mlb = MultiLabelBinarizer()
genre_matrix = mlb.fit_transform(dataTraining['genres'])

genre_labels = mlb.classes_

# Crear nuevo DataFrame con columnas binarias
import pandas as pd
genre_df = pd.DataFrame(genre_matrix, columns=genre_labels)

# Unir al dataframe original si lo deseas
dataTraining_bin = pd.concat([dataTraining.reset_index(drop=True), genre_df], axis=1)
dataTraining_bin.head()


## Entrenamiento del modelo

In [None]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

In [None]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [None]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [None]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [None]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

## Aplicación del modelo a datos de test

In [None]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()