# LSTM Model for Movie Genre Prediction

## Description:

This notebook demonstrates the implementation of a Long Short-Term Memory (LSTM) neural network model for predicting movie genres. The primary goal is to develop a deep learning architecture capable of accurately classifying movie genres based on textual data such as movie titles, plot summaries, and sentiment analysis.

In this notebook, we'll begin by preprocessing the textual data, which involves tokenization, lemmatization, and encoding of movie genres. We'll then split the dataset into training and testing sets to train and evaluate the LSTM model. The model architecture consists of two LSTM layers followed by a dense output layer with sigmoid activation to predict multiple genres simultaneously.

Furthermore, we'll incorporate additional features such as sentiment analysis scores and dominant topics extracted from the text to enhance the predictive performance of the model. The training process involves optimizing the model's parameters using the binary cross-entropy loss function and the Adam optimizer.


**Author:** [Caique Matos]
**Date:** [04/16/24]


In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


from tensorflow.keras.layers import Input, Concatenate
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.metrics import confusion_matrix, classification_report

nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.metrics import accuracy_score

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
path = 'C:/Users/caiqu/OneDrive/Competitions/sony/input/'

In [None]:
df_train = pd.read_csv(path+'df_train_under.csv')
df_valid = pd.read_csv(path+"df_validation_under.csv")
df_test = pd.read_csv(path+'df_test.csv')

In [None]:
df_train.columns

## Pre-process and Extracting Value from text columns


A abordagem escolhida foi agrupar todas as colunas de texto(título e colunas de resumo de enredo), passando por uma tokenização e por uma fase de "lematização".

In [None]:
# Pré-processamento do texto
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    if isinstance(text, str):
        tokens = nltk.word_tokenize(text.lower())
        filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token.isalnum()]
        return ' '.join(filtered_tokens)
    else:
        return ''

df_train['processed_title'] = df_train['SRC_TITLE_NM'].fillna('').apply(preprocess_text)
df_train['processed_plot'] = df_train['PLOT_SUMMARY'].fillna('').apply(preprocess_text)
df_train['processed_plot_outline'] = df_train['PLOT_OUTLINE'].fillna('').apply(preprocess_text)
df_train['processed_plot_medium'] = df_train['PLOT_MEDIUM'].fillna('').apply(preprocess_text)
df_train['combined_text'] = df_train['processed_title'] + ' ' + df_train['processed_plot'] + df_train['processed_plot_outline']+df_train['processed_plot_medium'] 


In [None]:
# Pré-processamento dos gêneros
df_train['SRC_GENRE'] = df_train['SRC_GENRE'].apply(lambda x: x.split('|'))
mlb = MultiLabelBinarizer()
genres_encoded = mlb.fit_transform(df_train['SRC_GENRE'])

# Carregar os embeddings GloVe pré-treinados
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:
# Tokenizar e preencher as sequências de texto
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train['combined_text'])
sequences = tokenizer.texts_to_sequences(df_train['combined_text'])
word_index = tokenizer.word_index
padded_sequences = pad_sequences(sequences, maxlen=300)

In [None]:
# Criar a matriz de embedding
embedding_matrix = np.zeros((len(word_index) + 1, 100))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

## LSTM Modelling

In [None]:
# Define inputs
input_text = Input(shape=(300,))
input_sentiment = Input(shape=(9,))

# Word embedding layer
embedding_text = Embedding(len(word_index) + 1, 100, weights=[embedding_matrix], input_length=300, trainable=False)(input_text)

# LSTM layer
lstm_text = LSTM(128, dropout=0.2, recurrent_dropout=0.2)(embedding_text)

# Concatenate LSTM output with sentiment input
concatenated = Concatenate()([lstm_text, input_sentiment])

# Output layer
output = Dense(len(mlb.classes_), activation='sigmoid')(concatenated)

# Define model
model = Model(inputs=[input_text, input_sentiment], outputs=output)

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Split data into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(padded_sequences, genres_encoded, test_size=0.2, random_state=42)
X_train_sentiment, X_test_sentiment, _, _ = train_test_split(df_train[['PLOT_SUMMARY_SENTIMENT_ENCODED', 'SRC_TITLE_NM_SENTIMENT_ENCODED',
       'PLOT_OUTLINE_SENTIMENT_ENCODED', 'PLOT_MEDIUM_SENTIMENT_ENCODED',
       'PLOT_SUMMARY_DOMINANT_TOPIC', 'PLOT_OUTLINE_DOMINANT_TOPIC',
       'PLOT_MEDIUM_DOMINANT_TOPIC', 'SRC_TITLE_NM_DOMINANT_TOPIC','RATING_AVG']], genres_encoded, test_size=0.2, random_state=42)

# Train model
model.fit([X_train_text, X_train_sentiment], y_train, batch_size=64, epochs=20, validation_data=([X_test_text, X_test_sentiment], y_test))

### Exploring Training/Validating Phase

In [None]:
# Fazer previsões no conjunto de teste
y_pred_probs = model.predict([X_test_text, X_test_sentiment])
y_pred_classes = np.argmax(y_pred_probs, axis=1)
predicted_genres = mlb.classes_[y_pred_classes]

# Calcular a acurácia média
accuracy = accuracy_score(np.argmax(y_test, axis=1), y_pred_classes)

# Calcular a acurácia por gênero
genre_accuracy = {}
for i, genre in enumerate(mlb.classes_):
    genre_accuracy[genre] = accuracy_score(y_test[:, i], y_pred_probs[:, i] > 0.5)

# Criar DataFrame com as previsões e os valores reais
df_predictions = pd.DataFrame({
    'Real_Genre': mlb.inverse_transform(y_test),
    'Predicted_Genre': predicted_genres
})

# Exibir resultados
print('Overall Accuracy:', accuracy)
print('Genre-wise Accuracy:\n')
for genre, acc in genre_accuracy.items():
    print(f'{genre}: {acc}')


### Outsample Validation

In [None]:
# Pré-processamento do texto para os dados de validação


df_valid['processed_title'] = df_valid['SRC_TITLE_NM'].fillna('').apply(preprocess_text)
df_valid['processed_plot'] = df_valid['PLOT_SUMMARY'].fillna('').apply(preprocess_text)
df_valid['processed_plot_outline'] = df_valid['PLOT_OUTLINE'].fillna('').apply(preprocess_text)
df_valid['processed_plot_medium'] = df_valid['PLOT_MEDIUM'].fillna('').apply(preprocess_text)


df_valid['combined_text'] = df_valid['processed_title'] + ' ' + df_valid['processed_plot'] + df_valid['processed_plot_outline']+df_valid['processed_plot_medium'] 


# Tokenizar e preencher as sequências de texto para os dados de validação
sequences_valid = tokenizer.texts_to_sequences(df_valid['combined_text'])
padded_sequences_valid = pad_sequences(sequences_valid, maxlen=300)


# Pré-processamento dos gêneros no conjunto de validação
df_valid['SRC_GENRE'] = df_valid['SRC_GENRE'].apply(lambda x: x.split('|'))
genres_encoded_valid = mlb.transform(df_valid['SRC_GENRE'])

# Fazer previsões no conjunto de validação
y_pred_probs_valid = model.predict([padded_sequences_valid, df_valid[['PLOT_SUMMARY_SENTIMENT_ENCODED', 'SRC_TITLE_NM_SENTIMENT_ENCODED',
       'PLOT_OUTLINE_SENTIMENT_ENCODED', 'PLOT_MEDIUM_SENTIMENT_ENCODED',
       'PLOT_SUMMARY_DOMINANT_TOPIC', 'PLOT_OUTLINE_DOMINANT_TOPIC',
       'PLOT_MEDIUM_DOMINANT_TOPIC', 'SRC_TITLE_NM_DOMINANT_TOPIC','RATING_AVG']]])
y_pred_classes_valid = np.argmax(y_pred_probs_valid, axis=1)
predicted_genres_valid = mlb.classes_[y_pred_classes_valid]

# Adicionar as previsões como uma coluna ao DataFrame df_valid
df_valid['Predicted_Genre'] = predicted_genres_valid

# Calcular a acurácia média no conjunto de validação
accuracy_valid = accuracy_score(np.argmax(genres_encoded_valid, axis=1), y_pred_classes_valid)

# Calcular a acurácia por gênero no conjunto de validação
genre_accuracy_valid = {}
for i, genre in enumerate(mlb.classes_):
    genre_accuracy_valid[genre] = accuracy_score(genres_encoded_valid[:, i], y_pred_probs_valid[:, i] > 0.5)

# Exibir resultados
print('Overall Accuracy (Validation):', accuracy_valid)
print('\n\nGenre-wise Accuracy (Validation):\n')
for genre, acc in genre_accuracy_valid.items():
    print(f'{genre}: {acc}')


### Predctions for Test_set

In [None]:
df_test['processed_title'] = df_test['SRC_TITLE_NM'].fillna('').apply(preprocess_text)
df_test['processed_plot'] = df_test['PLOT_SUMMARY'].fillna('').apply(preprocess_text)
df_test['processed_plot_outline'] = df_test['PLOT_OUTLINE'].fillna('').apply(preprocess_text)
df_test['processed_plot_medium'] = df_test['PLOT_MEDIUM'].fillna('').apply(preprocess_text)


df_test['combined_text'] = df_test['processed_title'] + ' ' + df_test['processed_plot'] + df_test['processed_plot_outline']+df_test['processed_plot_medium'] 


# Tokenizar e preencher as sequências de texto para os dados de validação
sequences_valid = tokenizer.texts_to_sequences(df_test['combined_text'])
padded_sequences_valid = pad_sequences(sequences_valid, maxlen=300)


# Fazer previsões no conjunto de validação
y_pred_probs_valid = model.predict([padded_sequences_valid, df_test[['PLOT_SUMMARY_SENTIMENT_ENCODED', 'SRC_TITLE_NM_SENTIMENT_ENCODED',
       'PLOT_OUTLINE_SENTIMENT_ENCODED', 'PLOT_MEDIUM_SENTIMENT_ENCODED',
       'PLOT_SUMMARY_DOMINANT_TOPIC', 'PLOT_OUTLINE_DOMINANT_TOPIC',
       'PLOT_MEDIUM_DOMINANT_TOPIC', 'SRC_TITLE_NM_DOMINANT_TOPIC','RATING_AVG']]])
y_pred_classes_valid = np.argmax(y_pred_probs_valid, axis=1)
predicted_genres_valid = mlb.classes_[y_pred_classes_valid]

# Adicionar as previsões como uma coluna ao DataFrame df_valid
df_test['Predicted_Genre'] = predicted_genres_valid

In [None]:
df_test[[ 'ID','SRC_TITLE_ID','Predicted_Genre']]

In [None]:
df_test['Predicted_Genre'].value_counts()

In [None]:
df_test[[ 'ID','Predicted_Genre']].to_csv('output/LSTM_US_test_set_predctions.csv', index=False)