<a href="https://colab.research.google.com/github/clemgi0/movie-analyser_deep-learning-proyecto/blob/main/03_arquitectura_de_linea_de_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Analyser | Deep Learning Final Project

In this serie of notebook, we will follow my avances for this project. Let's begin by defining it. Basically, what I want to achieve is to create a deep learning IA model using Keras and Tensorflow that could predict the success of a movie through it's resume, and some other possible input datas like the name of the movie, it's director or it's genre.

### DATASETS
https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Features of the first dataset :

Here are it's features:

0 Poster_Link - Link of the poster that imdb using

1 Series_Title - Name of the movie

2 Released_Year - Year at which that movie released

3 Certificate - Certificate earned by that movie

4 Runtime - Total runtime of the movie

5 Genre - Genre of the movie

6 IMDB_Rating - Rating of the movie at IMDB site

7 Overview - mini story/ summary

8 Meta_score - Score earned by the movie

9 Director - Name of the Director

10, 11, 12, 13 Star1,Star2,Star3,Star4 - Name of the Stars

14 No_of_votes - Total number of votes

15 Gross - Money earned by that movie



https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset

Features of the second dataset :

0 Filmtv_id - Movie id

1 Title - Name of the movie

2 Year - Movie year of release

3 Genre - Movie genre

4 Duration - Movie duration (in min)

5 Country - Countries where the movie was filmed

6 Directors - Name of movie directors

7 Actors - Name of movie actors

8 Avg_vote - Average rating (by critics and public)

9 Critics_vote - Average vote of the critics

10 Public_vote - Average vote of the public

11 Total_vote - Total votes expressed by critics and public

12 Overview - Movie description

13 Notes - Movie notes

14 Humor - Movie humor score given by filmtv

15 Rythm - Movie rythm score given by filmtv

16 Tension - Movie tension score given by filmtv

17 Erotism - Movie erotism score given by filmtv

In [2]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import kagglehub
import os
import nltk
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk import word_tokenize
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

## Base architecture
In this thirs notebook, we will focus on realising a basic architecture to test out our datas. Then, we will try to implement a more complicated model architecture to intent to get the best results possible.

### Data preparation

**Here we select the dataset that we want to use (run two times so it shows "Using Colab cache...).**

In [4]:
path = kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")
dataset = "IMDB"

Using Colab cache for faster access to the 'imdb-dataset-of-top-1000-movies-and-tv-shows' dataset.


In [29]:
path = kagglehub.dataset_download("stefanoleone992/filmtv-movies-dataset")
dataset = "FilmTV"

Using Colab cache for faster access to the 'filmtv-movies-dataset' dataset.


Here we import the datas we want and shuffle them for the reason we saw on the first notebook. We also withdraw the features that interests us.

In [30]:
files_in_path = os.listdir(path)
csv_files = [f for f in files_in_path if f.endswith('.csv')]

if csv_files:
    data_file = os.path.join(path, csv_files[0])
    df = pd.read_csv(data_file)

    df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle the datas to avoid linear IMDB rating

    data = df.to_numpy()
    if dataset == "IMDB":
      data = data[:, [1, 5, 7, 9, 2, 6, 8]] # Title / Genre / Overview / Director / Average score / IMDB rating / meta-score
    else:
      data = data[:, [1, 3, 12, 6, 8]] # Title / Genre / Overview / Director / Average votes

    print("Data shape:", data[:3,:])
else:
    print("No CSV files found in the specified path. Please specify which file to load if it's not a CSV or has a different extension.")

Data shape: [['Svadba' 'Comedy'
  'Mishka and Tania, friends since school, are getting married. But something is wrong because the girl leaves for Moscow and disappears for a few years. Having vanished the dreams of becoming a model, she decides to return to the country to marry the good Mishka, muscular and solid worker with a clean face, still in love with her. At this point the film tells about the wedding preparations, the ceremony, the dramas and the trafficking that goes through it.'
  'Pavel Lungin' 7.0]
 ['The Phantom of Crestwood' 'Thriller'
  'Pushed by the beautiful Jenny Wren, banker Priam Andes throws a party at Crestwood, his summer residence. The girl asks Priam to also invite three wealthy men whom she intends to pluck, but her plans will be unexpectedly upset by an inexplicable death ...'
  'J. Walter Ruben' 6.5]
 ['Ragazzi della marina' 'War'
  'The cruiser "Raimondo Montecuccoli" leaves Livorno with the cadets of the Naval Academy. Among them, three sailors are worri

Here, we prepocess the datas as we did it in second notebook by cleaning, shortening, tokenizing and normalizing them (takes approximately 1min for FilmTV dataset).

In [85]:
# Remove the rows containing nan values
df2 = pd.DataFrame(data)
df2 = df2.replace("nan", np.nan)   # if "nan" is a chain, we delete the row
df2 = df2.dropna()
data_clean = df2.to_numpy()


# Set variables
if dataset == "IMDB":
  ov_max_features = 5000
  ti_max_features = 1000
  ge_max_features = 23
else:
  ov_max_features = 50000
  ti_max_features = 25000
  ge_max_features = 34


# Downloading the english stopwords dictionnary and creating the remove_stowords function
nltk.download('punkt_tab')
nltk.download('stopwords')
stopwords_en = nltk.corpus.stopwords.words('english')
stopwords_all = stopwords_en
stopwords_all += nltk.corpus.stopwords.words('spanish')
stopwords_all += nltk.corpus.stopwords.words('french')
stopwords_all += nltk.corpus.stopwords.words('italian')
stopwords_all += nltk.corpus.stopwords.words('german')
stopwords_all = set(stopwords_all)

def remove_stopwords(text_list, language): # For the Overview since they are only in English
    cleaned_texts = []
    for text in text_list:
      if language == "english":
        tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords_en]
      else:
        tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords_all]
      cleaned_texts.append(' '.join(tokens))
    return cleaned_texts


# Tokenize Overview
data_clean[:,2] = remove_stopwords(data_clean[:,2], "english")
ov_tokenizer = Tokenizer(num_words=ov_max_features, split=' ')
ov_tokenizer.fit_on_texts(data_clean[:,2])
ov_tokenizer.word_index.update({'<pad>': 0})
ov_tokenized = ov_tokenizer.texts_to_sequences(data_clean[:,2])


# Tokenize Title
data_clean[:,0] = remove_stopwords(data_clean[:,0], "all")
ti_tokenizer = Tokenizer(num_words=ti_max_features, oov_token="<UNK>")
ti_tokenizer.fit_on_texts(data_clean[:,0])
ti_tokenized = ti_tokenizer.texts_to_sequences(data_clean[:,0])


# Labelize Director
directors_raw = data_clean[:, 3]
le_director = LabelEncoder()
di_labelized = le_director.fit_transform(directors_raw)


# Tokenize Genre
genres_split = [g.lower().split(", ") for g in data_clean[:,1]]
ge_tokenizer = Tokenizer(num_words=ge_max_features, oov_token="<UNK>")
ge_tokenizer.fit_on_texts([" ".join(g) for g in genres_split])
ge_tokenized = ge_tokenizer.texts_to_sequences([" ".join(g) for g in genres_split])


# Taking average and normalization of the scores
if dataset == "IMDB":
  data_clean[:,4] = (data_clean[:,5] + data_clean[:,6] / 10.0) / 2.0
sc_normalized = data_clean[:,4] / 10.0


# Padding the tokenized features
ov_max_len = max(len(x) for x in ov_tokenized)
ti_max_len = max(len(x) for x in ti_tokenized)
ge_max_len = max(len(x) for x in ge_tokenized)

ov_padded = pad_sequences(ov_tokenized, maxlen=ov_max_len)
ti_padded = pad_sequences(ti_tokenized, maxlen=ti_max_len)
ge_padded = pad_sequences(ge_tokenized, maxlen=ge_max_len)


# Splitting the datas
nb_train_data = int(0.8*len(data_clean[:,0]))

x_train_overview = ov_padded[:nb_train_data,:]
x_train_title = ti_padded[:nb_train_data,:]
x_train_director = di_labelized[:nb_train_data]
x_train_genre = ge_padded[:nb_train_data,:]
y_train_score = np.array(sc_normalized[:nb_train_data], dtype=np.float32)

x_test_overview = ov_padded[nb_train_data:,:]
x_test_title = ti_padded[nb_train_data:,:]
x_test_director = di_labelized[nb_train_data:]
x_test_genre = ge_padded[nb_train_data:,:]
y_test_score = np.array(sc_normalized[nb_train_data:], dtype=np.float32)
print("Final data prepared for the", dataset, "dataset:\n\nOverview\n", ov_padded[:3,:],"\nTitle\n", ti_padded[:3,:],"\nDirector\n" ,di_labelized[:3],"\nGenre\n" ,ge_padded[:3,:],"\nScore\n" ,sc_normalized[:3])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Final data prepared for the FilmTV dataset:

Overview
 [[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0  

In [32]:
print("Maximum length of Overview / Title / Genre for the", dataset, "dataset:", ov_max_len, "/", ti_max_len, "/", ge_max_len)

Maximum length of Overview / Title / Genre for the FilmTV dataset: 322 / 13 / 3


Here, we finish to prepare our datas that will train our model. For the IMDB dataset, the maximum length of an shortened overview is 34 for the IMDB dataset so we choose a padding sequence of 40 to be sure that we don't loose word and we follow the same logic for the FilmTV dataset (for which the max_len is 331).

### First model

Here, we create our first model, something really simple with a GlobalAveragePooling1D layer, just to see what we can get from the preprocessing of our datas.

In [33]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, GlobalAveragePooling1D, Dense

# --- Architecture ---
def getModel1(max_len, max_features, embedding_dim):
  inputs = Input(shape=(max_len,))
  x = Embedding(max_features, embedding_dim)(inputs)
  x = GlobalAveragePooling1D()(x)
  x = Dense(64, activation='relu')(x)
  outputs = Dense(1, activation='sigmoid')(x)

  return Model(inputs, outputs)

In [34]:
embedding_dim = 64

model = getModel1(ov_max_len, ov_max_features, embedding_dim)

model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)

model.summary()

In [48]:
model.fit(x_train_overview, y_train_score, epochs=1, batch_size=256, verbose=1)

[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 76ms/step - loss: 0.0201 - mae: 0.1153


<keras.src.callbacks.history.History at 0x7a326c7d2030>

In [51]:
loss, mae = model.evaluate(x_test_overview, y_test_score, verbose=1)
print("MSE (loss) :", loss)
print("MAE :", mae)

[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0200 - mae: 0.1139
MSE (loss) : 0.019620675593614578
MAE : 0.11314806342124939


Finally, we can see that this model is doing a pretty miserable job at predicting our movies score but we knew it would be like that since it can't encapsulate the sense of the Overview without an RNN layer. So, this is what our next model will include.

In [52]:
y_pred = model.predict(x_test_overview)

for i in range(10):
    print("Overview:", data[800+i, 2][:80], "...")
    print("Real rating :", y_test_score[i], " – Prediction :", y_pred[i][0])
    print("---")

[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step
Overview: Bill and Cassel, leaders of two rival gangs unjustly accused of murdering three  ...
Real rating : 0.43  – Prediction : 0.57448816
---
Overview: Woody Grant (Bruce Dern), a short-tempered old man who has turned everyone away  ...
Real rating : 0.52  – Prediction : 0.5836494
---
Overview: A road movie on a red van around Sicily in search of the new oral narrators who  ...
Real rating : 0.41  – Prediction : 0.5791759
---
Overview: A young London woman arrives at an isolated castle to visit relatives she hasn't ...
Real rating : 0.46  – Prediction : 0.57828987
---
Overview: The bloodthirsty count this time is a famous scientist who invented a virus capa ...
Real rating : 0.53  – Prediction : 0.57887983
---
Overview: After thirty years spent in Geneva as master builder in the Boyer civil engineer ...
Real rating : 0.5  – Prediction : 0.58501166
---
Overview: Nick Naylor's job is to represent the tobacco mu

### Second model

Input 1 : overview → Embedding → LSTM → Dense(64)

Input 2 : title → Embedding → LSTM → Dense(16)

Input 3 : genre → Embedding → Flatten → Dense(8)

Input 4 : director → Embedding → Flatten → Dense(16)

Concatenate the 4 branches

Dense(64) → Output (score)

In [102]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import (Input, Embedding, LSTM, Dense, Concatenate, Flatten)

def getModel2(embedding_dim):
  # ---- Inputs ----
  overview_in = Input(shape=(ov_max_len,), name="overview_input")
  title_in    = Input(shape=(ti_max_len,), name="title_input")
  genre_in    = Input(shape=(ge_max_len,), name="genre_input")
  director_in = Input(shape=(1,), name="director_input")

  # ---- Overview branch (RNN) ----
  x = Embedding(ov_max_features, embedding_dim, name="ov_embedding")(overview_in)
  x = LSTM(64, return_sequences=False, name="ov_LSTM")(x)
  x = Dense(64, activation='relu', name="ov_dense64")(x)

  # ---- Title embedding ----
  title_emb = Embedding(ti_max_features, output_dim=16, name="ti_embedding")(title_in)
  title_emb = LSTM(16, return_sequences=False, name="ti_LSTM")(title_emb)
  title_emb = Dense(16, activation='relu', name="ti_dense16")(title_emb)

  # ---- Genre embedding ----
  genre_emb = Embedding(ge_max_features, output_dim=8, name="ge_embedding")(genre_in)
  genre_emb = Flatten(name="ge_flatten")(genre_emb)
  genre_emb = Dense(8, activation='relu', name="ge_dense8")(genre_emb)

  # ---- Director dense ----
  director_emb = Embedding(len(le_director.classes_), output_dim=16)(director_in)
  director_emb = Flatten(name="di_flatten")(director_emb)
  director_emb = Dense(16, activation='relu', name="di_dense16")(director_emb)

  # ---- Fusion ----
  merged = Concatenate()([
      x,
      title_emb,
      genre_emb,
      director_emb
  ])

  # ---- Final regressor ----
  hidden = Dense(64, activation='relu', name="hidden_dense64")(merged)
  output = Dense(1, activation='sigmoid', name="score_output")(hidden)

  return Model(
      inputs=[overview_in, title_in, genre_in, director_in],
      outputs=output
  )

In [109]:
embedding_dim = 64

model = getModel2(embedding_dim)
model.compile(optimizer="adam", loss="mse", metrics=["mae"])
model.summary()

In [110]:
model.fit([x_train_overview, x_train_title, x_train_genre, x_train_director], y_train_score, epochs=1, batch_size=256, verbose=1)

[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m126s[0m 956ms/step - loss: 0.0191 - mae: 0.1120


<keras.src.callbacks.history.History at 0x7a32381691f0>

In [111]:
loss, mae = model.evaluate([x_test_overview, x_test_title, x_test_genre, x_test_director], y_test_score, verbose=1)
print("MSE (loss) :", loss)
print("MAE :", mae)

[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 55ms/step - loss: 0.0142 - mae: 0.0943
MSE (loss) : 0.01394168846309185
MAE : 0.09310586005449295


In [112]:
y_pred = model.predict([x_test_overview, x_test_title, x_test_genre, x_test_director])

for i in range(10):
    print("Overview:", data[nb_train_data+i, 2][:80], "...")
    print("Real rating :", y_test_score[i], " – Prediction :", y_pred[i][0])
    print("---")

[1m249/249[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 54ms/step
Overview: In an old theater of the sixties, the actress Piera Degli Esposti retraces her l ...
Real rating : 0.43  – Prediction : 0.63440853
---
Overview: Martin and Laura Burney have been married for more than three years, but for her ...
Real rating : 0.52  – Prediction : 0.55669725
---
Overview: We are in 1694, in England. Neville, an ambitious painter, accepts a curious con ...
Real rating : 0.41  – Prediction : 0.49020493
---
Overview: Over the course of a single day, you follow the sun's course from the highest mo ...
Real rating : 0.46  – Prediction : 0.4715481
---
Overview: Twenty years after finishing her studies, Anna has to go on a reunion with her f ...
Real rating : 0.53  – Prediction : 0.62028277
---
Overview: In Ajo City, in southern Arizona, an irrepressible proliferation of rabbits dest ...
Real rating : 0.5  – Prediction : 0.56113744
---
Overview: A gravedigger, while about to dig a grave wit