![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from nltk.stem import WordNetLemmatizer
import nltk
import string
import joblib
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [5]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


## Preprocesamiento de datos

In [6]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [7]:
print(le.classes_)

['Action' 'Adventure' 'Animation' 'Biography' 'Comedy' 'Crime'
 'Documentary' 'Drama' 'Family' 'Fantasy' 'Film-Noir' 'History' 'Horror'
 'Music' 'Musical' 'Mystery' 'News' 'Romance' 'Sci-Fi' 'Short' 'Sport'
 'Thriller' 'War' 'Western']


In [8]:
y_genres[0]

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0])

La variable géneros tiene 24 clases únicas que se transforman en una matriz para poder indicar el sentido multiclase del problema

### Estandarización

In [9]:
# Dejamos todos los textos en minúscla
dataTraining['plot_n'] = dataTraining['plot'].str.lower()

In [10]:
#ELiminamos los signos de puntuación 
dataTraining['plot_n'] = dataTraining['plot_n'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

### Eliminación *stopwords*

In [11]:
no_stopw = CountVectorizer(stop_words='english')
stop_words = set(no_stopw.get_stop_words())

### Lematización

In [12]:
wordnet_lemmatizer = WordNetLemmatizer()

### Aplicación

In [13]:
def procesa(plot):
    
    tokeni = nltk.word_tokenize(plot)
    tokeni = [token for token in tokeni if token not in stop_words]
    
    tokeni = [wordnet_lemmatizer.lemmatize(token, pos ='v') for token in tokeni]
    
    return ' '.join(tokeni)

In [14]:
dataTraining['plot_n'] = dataTraining['plot_n'].apply(procesa)

In [15]:
vect = CountVectorizer(max_features=4000, ngram_range=(1,3))

In [16]:
X_dtm = vect.fit_transform(dataTraining['plot_n'])

In [17]:
X_dtm.shape

(7895, 4000)

In [18]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=210525)

In [19]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [20]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8062448424600887

In [22]:
y_test_genres

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 1, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [11]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

# Disponibilización del modelo

In [81]:
data = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)

In [64]:
data['genres'] = data['genres'].map(lambda x: eval(x))
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(data['genres'])

In [21]:
#Guardo las clases de entrenamiento para que una vez generado el modelo, las devuelva
joblib.dump(mlb, 'model_deployment/generos.pkl')

NameError: name 'mlb' is not defined

In [39]:
# Preprocesamiento de texto
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def procesar(text):
    # 1. Minúsculas
    text = text.lower()
    # 2. Eliminar puntuación
    text = text.translate(str.maketrans('', '', string.punctuation))
    # 3. Tokenizar
    tokens = nltk.word_tokenize(text)
    # 4. Eliminar stopwords
    tokens = [t for t in tokens if t not in stop_words]
    # 5. Lematizar
    lemmatized = [lemmatizer.lemmatize(t, pos='v') for t in tokens]
    return ' '.join(lemmatized)


data['plot_n'] = data['plot'].apply(procesar)

In [40]:
# Vectorización
vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['plot_n'])

# Entrenamiento random forest
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000, random_state=21052025))
clf.fit(X, y)

In [22]:
# Guardo modelo y vectorizador
joblib.dump(clf, 'model_deployment/modelo.pkl')
joblib.dump(vect, 'model_deployment/vectorizer.pkl')

['model_deployment/vectorizer.pkl']

In [82]:
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Procesamiento de géneros
data['genres'] = data['genres'].map(lambda x: eval(x))
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(data['genres'])

# Guardamos los géneros para referencia
import joblib
joblib.dump(mlb, 'model_deployment/genre_binarizer.pkl')

# Preprocesamiento de texto
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def procesar(texto):
    texto = texto.lower()
    texto = texto.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(texto)
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatizer.lemmatize(t, pos='v') for t in tokens]
    return ' '.join(tokens)

data['plot_clean'] = data['plot'].apply(procesar)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\carol\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\carol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\carol\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [83]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

# Vectorización
vectorizer = CountVectorizer(max_features=1000)
X = vectorizer.fit_transform(data['plot_clean'])

# Entrenamiento del modelo multietiqueta
clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X, y)

# Guardar modelo y vectorizador
joblib.dump(clf, 'model_deployment/genre_classifier.pkl')
joblib.dump(vectorizer, 'model_deployment/vectorizer.pkl')


['model_deployment/vectorizer.pkl']

In [23]:
from model_deployment.m09_model_deployment import predict_genres

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\carol\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\carol\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\carol\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
# app.py
from flask import Flask
from flask_restx import Api, Resource, fields
import joblib
from  model_deployment.m09_model_deployment import predict_genres

app = Flask(__name__)
api = Api(app, version='1.0', title='Clasificación de géneros de películas', description='Predicción del género de una película dado la sinopsis.')

ns = api.namespace('predict', description='Clasificación de género')

parser = api.parser()

parser.add_argument('plot', type=str, required=True, help='Agrega la sinopsis de la película', location='args')

resource_fields = api.model('Resource', {
    'genres': fields.List(fields.String),
})

@ns.route('/')
class GenreApi(Resource):
    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        plot = args['plot']
        genres = predict_genres(plot)
        return {"genres": genres}, 200

if __name__ == '__main__':
    app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://192.168.80.18:5000
Press CTRL+C to quit
127.0.0.1 - - [21/May/2025 20:23:13] "GET /predict/?plot=summer%20,%20%20%20N%20%20,%20%20new%20york%20city%20.%20%20depressed%20in%20part%20because%20of%20a%20dysfunctional%20home%20life%20the%20result%20of%20money%20problems%20and%20feeling%20like%20a%20loser%20,%20%20luke%20shapiro%20,%20%20a%20drug%20dealer%20who%20has%20just%20graduated%20from%20high%20school%20,%20%20has%20for%20the%20last%20several%20months%20bartered%20weed%20for%20the%20services%20of%20therapist%20dr%20.%20%20jeff%20squires%20,%20%20an%20aging%20hippie%20,%20%20in%20his%20want%20to%20talk%20to%20somebody%20,%20%20anybody%20who%20will%20listen%20,%20%20in%20a%20general%20way%20.%20%20dr%20.%20%20squires%20is%20the%20stepfather%20of%20the%20object%20of%20luke%20'%20s%20masturbatory%20fantasies%20,%20%20classmate%20stephanie%20squires%20,%20%20who%20sees%20luke%20no%20more%20than%

In [24]:
print(predict_genres("summer ,   N  ,  new york city .  depressed in part because of a dysfunctional home life the result of money problems and feeling like a loser ,  luke shapiro ,  a drug dealer who has just graduated from high school ,  has for the last several months bartered weed for the services of therapist dr .  jeff squires ,  an aging hippie ,  in his want to talk to somebody ,  anybody who will listen ,  in a general way .  dr .  squires is the stepfather of the object of luke ' s masturbatory fantasies ,  classmate stephanie squires ,  who sees luke no more than a casual acquaintance .  she knows about the dealer / buyer relationship between luke and her stepfather ,  but not the therapy sessions ,  while dr .  squires has no idea of luke ' s romantic interest in stephanie .  with all their respective friends away for the summer ,  stephanie suggests to luke that they hang out together for the summer as friends .  concurrently ,  luke and dr .  squires truly do become more like friends ,  constantly high ones ,  than therapist / patient .  that friendship becomes complicated not only as luke needs advice on how to navigate the fantasy of stephanie as his girlfriend into potential reality without divulging to dr .  squires it is stephanie about who he is talking ,  but also as dr .  squires unburdens himself to luke about his own problems ,  such as the unsatisfying life that his marriage to stephanie ' s mother has become .  specifically with luke and stephanie ,  a question becomes :  if luke is able to get his wish of stephanie as his girlfriend ,  what will happen come september when their pre - summer friends return and as they all move onto the next phase of their lives post - high school ."))

['Drama']


In [25]:
print(predict_genres('the coroner and scientist dr .  warren chapin is researching the shivering effect of fear with his assistant david morris .  dr .  warren is introduced to ollie higgins ,  the relative of a criminal sentenced to the electric chair ,  while making the autopsy of the corpse ,  and he makes a comment about the tingler - effect to him .  ollie asks for a lift to dr .  warner ,  and introduces his deaf - mute wife martha higgins ,  who manages a theater of their own .  dr .  warner returns home ,  where he lives with his unfaithful and evil wife isabel stevens chapin and her sweet sister lucy stevens .  dr .  warner ,  upset with the situation with his wife ,  threatens and uses her as a subject of his experiment .  when martha dies of fear ,  dr .  warner makes her autopsy and finds a creature that lives inside every human being ,  feeds with fear and is controlled by the scream .  once martha was not able to scream ,  the tingler was not rendered harmless and became enormous .  when the living being escapes ,  dr .  warner and ollie chase it in a crowded movie theater .'))

[]


In [26]:
print(predict_genres('tinker bell journey far north of never land to patch things up with her friend terence and restore a pixie dust tree .'))

[]


In [27]:
print(predict_genres("now that tony stark has revealed to the world that he is iron man ,  the entire world is now eager to get their hands on his hot technology  -  whether it ' s the united states government ,  weapons contractors ,  or someone else .  that someone else happens to be ivan vanko  -  the son of now deceased anton vanko ,  howard stark ' s former partner .  stark had vanko banished to russia for conspiring to commit treason against the us ,  and now ivan wants revenge against tony  -  and he ' s willing to get it at any cost .  but after being humiliated in front of the senate armed forces committee ,  rival weapons contractor justin hammer sees ivan as the key to upping his status against stark enterprises after an attack on the monaco  N  .  but an ailing tony has to figure out a way to save himself ,  get vanko ,  and get hammer before the government shows up and takes his beloved suits away .  and can he figure out what a mysterious figure named nick fury wants with him ?'"))

[]


In [89]:
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [62]:
dataTesting.to_csv('data_testing.csv')