## Movie Genre Classification

Classify a movie genre based on its plot.

<img src="https://raw.githubusercontent.com/sergiomora03/AdvancedTopicsAnalytics/main/notebooks/img/moviegenre.png"
     style="float: left; margin-right: 10px;" />



### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 30% Code with the data processing and models developed that support the reported results.
- 30% Presentation of no more than 15 minutes with the main results of the project.
- 10% Model performance achieved. Metric: "AUC".

• The project must be carried out in groups of 4 people.
• Use clear and rigorous procedures.
• The delivery of the project is on March 15th, 2024, 11:59 pm, through email with Github link.
• No projects will be received after the delivery time or by any other means than the one established.




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Cargue de datos

In [21]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
import re
import nltk
from nltk.stem import WordNetLemmatizer


In [2]:
dataTraining = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [33]:
dataTraining.describe

<bound method NDFrame.describe of       year                              title  \
3107  2003                               Most   
900   2008          How to Be a Serial Killer   
6724  1941                     A Woman's Face   
4704  1954                    Executive Suite   
2582  1990                      Narrow Margin   
...    ...                                ...   
8417  2010                 Our Family Wedding   
1592  1984                Conan the Destroyer   
1723  1955                             Kismet   
7605  1982                 The Secret of NIMH   
215   2009  Tinker Bell and the Lost Treasure   

                                                   plot  \
3107  most is the story of a single father who takes...   
900   a serial killer decides to teach the secrets o...   
6724  in sweden ,  a female blackmailer with a disfi...   
4704  in a friday afternoon in new york ,  the presi...   
2582  in los angeles ,  the editor of a publishing h...   
...                    

In [4]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [31]:
dataTesting.describe

<bound method NDFrame.describe of        year                       title  \
1      1999         Message in a Bottle   
4      1978            Midnight Express   
5      1996                 Primal Fear   
6      1950                      Crisis   
7      1959                 The Tingler   
...     ...                         ...   
11263  2008       The Fifth Commandment   
11265  2003       Coffee and Cigarettes   
11269  1957                    Pal Joey   
11270  2002  Jonah: A VeggieTales Movie   
11275  1993           Man's Best Friend   

                                                    plot  
1      who meets by fate ,  shall be sealed by fate ....  
4      the true story of billy hayes ,  an american c...  
5      martin vail left the chicago da ' s office to ...  
6      husband and wife americans dr .  eugene and mr...  
7      the coroner and scientist dr .  warren chapin ...  
...                                                  ...  
11263  in bangkok ,  an assassin who

## Limpieza de datos

### Identificación de Stopwords

In [5]:
from nltk.corpus import stopwords
# Descargar las stopwords
nltk.download('stopwords')

# Obtener las stopwords en inglés
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
print(stop_words)

{'most', "you've", 'your', 'them', 'wouldn', 'these', 'few', 'no', 'very', 'themselves', 'didn', 'needn', 'theirs', 'other', 'll', 'do', 'with', "hadn't", 'under', "haven't", 'will', 'their', 'his', 're', 'shouldn', 'd', 'as', "should've", "shan't", 'just', 'ourselves', 'down', 'not', 'any', 'own', 'are', 'myself', 'i', 'this', 'by', 'above', 'about', 'more', 'himself', 'and', 'don', 'but', 'our', "that'll", 'after', 'the', 'had', "aren't", 'being', 'if', 'all', 'each', 'a', 'can', 'm', 'out', "shouldn't", 'because', 'does', 'my', 'or', 'why', 'in', 'when', 'having', 'while', 'who', 'against', 'y', 'isn', 'was', 'doesn', 'then', 'than', 'during', 'me', 'where', 'below', "weren't", 'o', 'an', 'hers', "couldn't", 'until', 'between', 'over', 'have', 'to', 'what', 'doing', 'it', 'of', "doesn't", "she's", 'hasn', 'mustn', 'weren', 'has', 'ma', 'hadn', 'which', 'ours', 'those', 'same', 'haven', 'so', 'there', 'won', 'yours', 'up', 'you', 'did', "wasn't", 'nor', 'she', 'been', "mustn't", 'her

### Limpieza de símbolos

In [7]:
DtaTraining = dataTraining['plot']

In [8]:
import re
DtaTrainingLimpia = [
    ' '.join(word for word in re.sub(r"[.,;:!-]", '', doc.lower()).split() )
    for doc in DtaTraining
]

In [9]:
for linea in DtaTrainingLimpia[:5]:
    print(linea)

most is the story of a single father who takes his eight year old son to work with him at the railroad drawbridge where he is the bridge tender a day before the boy meets a woman boarding a train a drug abuser at the bridge the father goes into the engine room and tells his son to stay at the edge of the nearby lake a ship comes and the bridge is lifted though it is supposed to arrive an hour later the train happens to arrive the son sees this and tries to warn his father who is not able to see this just as the oncoming train approaches his son falls into the drawbridge gear works while attempting to lower the bridge leaving the father with a horrific choice the father then lowers the bridge the gears crushing the boy the people in the train are completely oblivious to the fact a boy died trying to save them other than the drug addict woman who happened to look out her train window the movie ends with the man wandering a new city and meets the woman no longer a drug addict holding a sm

## Lemmatización y Stopwords

In [10]:
lemmatizer = WordNetLemmatizer()
def tokenize_and_lemmatize(text):
    tokens = text.split()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.lower() not in stop_words]
    return ' '.join(lemmatized_tokens)

In [11]:
DTraining_lematizado_token = [tokenize_and_lemmatize(doc) for doc in DtaTrainingLimpia]

# Ahora DTraining_lematizado_token tendrá el texto lematizado
print(DTraining_lematizado_token[:5]) 

['story single father take eight year old son work railroad drawbridge bridge tender day boy meet woman boarding train drug abuser bridge father go engine room tell son stay edge nearby lake ship come bridge lifted though supposed arrive hour later train happens arrive son see try warn father able see oncoming train approach son fall drawbridge gear work attempting lower bridge leaving father horrific choice father lower bridge gear crushing boy people train completely oblivious fact boy died trying save drug addict woman happened look train window movie end man wandering new city meet woman longer drug addict holding small baby relevant narrative run parallel namely one female drug addict meet climax tumultuous film', 'serial killer decides teach secret satisfying career video store clerk', 'sweden female blackmailer disfiguring facial scar meet gentleman life beyond mean become accomplice blackmail fall love bitterly resigned impossibility returning affection life change one victim p

## Vectorización 

In [12]:
vectNgramas = CountVectorizer(ngram_range=(1, 4))
X_dtm = vectNgramas.fit_transform(DTraining_lematizado_token)
X_dtm.shape

(7895, 1481576)

## Creación vector y

In [13]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [14]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [15]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

## Train multi-class multi-label model

In [16]:
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))

In [17]:
clf.fit(X_train, y_train_genres)

In [18]:
y_pred_genres = clf.predict_proba(X_test)

In [19]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7199462527370218

## 1. TF-IDF

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000)
X_tfidf = tfidf.fit_transform(DTraining_lematizado_token)

# Repartir los datos de nuevo con las nuevas características
XTFIDF_train, XTFIDF_test, yTFIDF_train_genres, yTFIDF_test_genres = train_test_split(X_tfidf, y_genres, test_size=0.33, random_state=42)

# Entrenar el modelo con TF-IDF
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(XTFIDF_train, yTFIDF_train_genres)

yTFIDF_pred_genres = clf.predict_proba(XTFIDF_test)

roc_auc_tfidf = roc_auc_score(yTFIDF_test_genres, yTFIDF_pred_genres, average='macro')
print(f"AUC con TF-IDF: {roc_auc_tfidf}")


AUC con TF-IDF: 0.8094747961022325


### 1.1 TF-IDF + Word2Vec

In [26]:
from gensim.models import Word2Vec

In [28]:
# 1. Crear la matriz TF-IDF
tfidf = TfidfVectorizer(max_features=3000)
X_tfidf = tfidf.fit_transform(DTraining_lematizado_token)

# 2. Repartir los datos
XTFIDF_W2V_train, XTFIDF_W2V_test, yTFIDF_W2V_train_genres, yTFIDF_W2V_test_genres = train_test_split(X_tfidf, y_genres, test_size=0.33, random_state=42)

# 3. Entrenar el modelo con TF-IDF
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(XTFIDF_W2V_train, yTFIDF_W2V_train_genres)

yTFIDF_W2V_pred_genres = clf.predict_proba(XTFIDF_W2V_test)

# 4. Calcular AUC con TF-IDF
roc_auc_tfidf = roc_auc_score(yTFIDF_W2V_test_genres, yTFIDF_W2V_pred_genres, average='macro')
print(f"AUC con TF-IDF: {roc_auc_tfidf}")

# 5. Entrenar el modelo Word2Vec
model_w2v = Word2Vec(sentences=DTraining_lematizado_token, vector_size=100, window=10, min_count=1, workers=4)

# 6. Función para obtener vectores promedio ponderados usando Word2Vec
def get_weighted_average_vector(doc, model, tfidf_vectorizer):
    words = doc.split()
    tfidf_weights = tfidf_vectorizer.transform([doc]).toarray()[0]
    word_vectors = np.array([model.wv[word] for word in words if word in model.wv])
    
    if len(word_vectors) == 0:
        return np.zeros(model.vector_size)  # Retornar un vector nulo si no hay palabras
    
    valid_weights = np.array([tfidf_weights[i] for i, word in enumerate(words) if word in model.wv])
    # Verificar si hay pesos válidos
    if np.sum(valid_weights) == 0:
        return np.zeros(model.vector_size)  # Retornar un vector nulo si todos los pesos son cero
    
    weighted_avg_vector = np.average(word_vectors, axis=0, weights=valid_weights)
    return weighted_avg_vector

# 7. Calcular el vector promedio ponderado para cada documento
document_vectors_train = np.array([get_weighted_average_vector(doc, model_w2v, tfidf) for doc in DTraining_lematizado_token])

# 8. Repartir los datos utilizando los vectores de Word2Vec
X_w2v_train, X_w2v_test, y_w2v_train_genres, y_w2v_test_genres = train_test_split(document_vectors_train, y_genres, test_size=0.33, random_state=42)

# 9. Entrenar el modelo con Word2Vec
clf_w2v = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf_w2v.fit(X_w2v_train, y_w2v_train_genres)

# 10. Predecir y calcular AUC para Word2Vec
y_w2v_pred_genres = clf_w2v.predict_proba(X_w2v_test)
roc_auc_w2v = roc_auc_score(y_w2v_test_genres, y_w2v_pred_genres, average='macro')
print(f"AUC con Word2Vec: {roc_auc_w2v}")


AUC con TF-IDF: 0.8094745899132644
AUC con Word2Vec: 0.5022579890360185


## 2. LightGBM (Vectorizado con Ngramas)

In [22]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
import lightgbm as lgb

clf_lgb = OneVsRestClassifier(lgb.LGBMClassifier(n_estimators=100, max_depth=10, random_state=42))
clf_lgb.fit(X_train, y_train_genres)

y_pred_genres_lgb = clf_lgb.predict_proba(X_test)

roc_auc_lgb = roc_auc_score(y_test_genres, y_pred_genres_lgb, average='macro')
print(f"AUC con LightGBM: {roc_auc_lgb}")


[LightGBM] [Info] Number of positive: 880, number of negative: 4409
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.013346 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 53112
[LightGBM] [Info] Number of data points in the train set: 5289, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.166383 -> initscore=-1.611481
[LightGBM] [Info] Start training from score -1.611481
[LightGBM] [Info] Number of positive: 684, number of negative: 4605
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011964 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 53112
[LightGBM] [Info] Number of data points in the train set: 5289, number of used features: 1000
[LightGBM] [Info] [b

## 3. Modelo Word2Vec

In [None]:
pip install gensim

In [19]:
from gensim.models import Word2Vec
from sklearn.preprocessing import StandardScaler

# Tokenizar los datos para Word2Vec
#dataTraining['tokenized_plot'] = dataTraining['plot'].apply(lambda x: x.split())

# Entrenar Word2Vec
word2vec = Word2Vec(sentences=DTraining_lematizado_token, vector_size=100, window=5, min_count=1, workers=4)
X_word2vec = np.array([np.mean([word2vec.wv[word] for word in words if word in word2vec.wv] or [np.zeros(100)], axis=0)
                       for words in DTraining_lematizado_token])

# Escalar los datos
scaler = StandardScaler()
X_word2vec_scaled = scaler.fit_transform(X_word2vec)

# División de los datos
X_trainW2V, X_testW2V, y_trainW2V_genres, y_testW2V_genres = train_test_split(X_word2vec_scaled, y_genres, test_size=0.33, random_state=42)

# Entrenar el modelo
clf_w2v = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf_w2v.fit(X_trainW2V, y_trainW2V_genres)

y_pred_genres_w2v = clf_w2v.predict_proba(X_testW2V)

roc_auc_w2v = roc_auc_score(y_testW2V_genres, y_pred_genres_w2v, average='macro')
print(f"AUC con Word2Vec: {roc_auc_w2v}")


AUC con Word2Vec: 0.5888642570227681


## 4. N-GRAMS + Random Forest con optimización de hiperparametros

In [29]:
from sklearn.model_selection import RandomizedSearchCV

# Definir el espacio de hiperparámetros
param_dist = {
    'n_estimators': [100, 200, 500],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Crear el modelo base
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Búsqueda aleatoria de hiperparámetros
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, 
                                   n_iter=10, cv=3, verbose=1, n_jobs=-1, random_state=42)

# Entrenar con los mejores parámetros
random_search.fit(X_train, y_train_genres)

# Predecir y calcular AUC
y_pred_genres_optimized = random_search.predict_proba(X_test)
# Acceder a las probabilidades de la clase positiva (segunda columna)
y_pred_genres_optimized = random_search.predict_proba(X_test)
y_pred_genres_optimized = np.array([proba[:, 1] for proba in y_pred_genres_optimized]).T

# Calcular el AUC
print(roc_auc_score(y_test_genres, y_pred_genres_optimized, average='macro'))


Fitting 3 folds for each of 10 candidates, totalling 30 fits
0.8309297816518653
