<a href="https://colab.research.google.com/github/clemgi0/movie-analyser_deep-learning-proyecto/blob/main/03_arquitectura_de_linea_de_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Analyser | Deep Learning Final Project

In this serie of notebook, we will follow my avances for this project. Let's begin by defining it. Basically, what I want to achieve is to create a deep learning IA model using Keras and Tensorflow that could predict the success of a movie through it's resume, and some other possible input datas like the name of the movie, it's director or it's genre.

### DATASETS
https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Features of the first dataset :

Here are it's features:

0 Poster_Link - Link of the poster that imdb using

1 Series_Title - Name of the movie

2 Released_Year - Year at which that movie released

3 Certificate - Certificate earned by that movie

4 Runtime - Total runtime of the movie

5 Genre - Genre of the movie

6 IMDB_Rating - Rating of the movie at IMDB site

7 Overview - mini story/ summary

8 Meta_score - Score earned by the movie

9 Director - Name of the Director

10, 11, 12, 13 Star1,Star2,Star3,Star4 - Name of the Stars

14 No_of_votes - Total number of votes

15 Gross - Money earned by that movie



https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset

Features of the second dataset :

0 Filmtv_id - Movie id

1 Title - Name of the movie

2 Year - Movie year of release

3 Genre - Movie genre

4 Duration - Movie duration (in min)

5 Country - Countries where the movie was filmed

6 Directors - Name of movie directors

7 Actors - Name of movie actors

8 Avg_vote - Average rating (by critics and public)

9 Critics_vote - Average vote of the critics

10 Public_vote - Average vote of the public

11 Total_vote - Total votes expressed by critics and public

12 Overview - Movie description

13 Notes - Movie notes

14 Humor - Movie humor score given by filmtv

15 Rythm - Movie rythm score given by filmtv

16 Tension - Movie tension score given by filmtv

17 Erotism - Movie erotism score given by filmtv

In [65]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import kagglehub
import os
import nltk
from nltk import word_tokenize

### Base architecture
In this thirs notebook, we will focus on realising a basic architecture to test out our datas. Then, we will try to implement a more complicated model architecture to intent to get the best results possible.

**Here we select the dataset that we want to use.**

In [98]:
path = kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")
dataset = "IMDB"

Using Colab cache for faster access to the 'imdb-dataset-of-top-1000-movies-and-tv-shows' dataset.


In [118]:
path = kagglehub.dataset_download("stefanoleone992/filmtv-movies-dataset")
dataset = "FilmTV"

Using Colab cache for faster access to the 'filmtv-movies-dataset' dataset.


Here we import the datas we want and shuffle them for the reason we saw on the first notebook. We also withdraw the features that interests us.

In [119]:
files_in_path = os.listdir(path)
csv_files = [f for f in files_in_path if f.endswith('.csv')]

if csv_files:
    data_file = os.path.join(path, csv_files[0])
    df = pd.read_csv(data_file)

    df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle the datas to avoid linear IMDB rating

    data = df.to_numpy()
    if dataset == "IMDB":
      data = data[:, [1, 5, 7, 9, 6, 8]] # Title / Genre / Overview / Director / IMDB rating / meta-score
    else:
      data = data[:, [1, 3, 12, 6, 9, 10]] # Title / Genre / Overview / Director / Critics votes / Public votes

    print("Data shape:", data[:3,:])
else:
    print("No CSV files found in the specified path. Please specify which file to load if it's not a CSV or has a different extension.")

Data shape: [['Svadba' 'Comedy'
  'Mishka and Tania, friends since school, are getting married. But something is wrong because the girl leaves for Moscow and disappears for a few years. Having vanished the dreams of becoming a model, she decides to return to the country to marry the good Mishka, muscular and solid worker with a clean face, still in love with her. At this point the film tells about the wedding preparations, the ceremony, the dramas and the trafficking that goes through it.'
  'Pavel Lungin' 7.0 7.0]
 ['The Phantom of Crestwood' 'Thriller'
  'Pushed by the beautiful Jenny Wren, banker Priam Andes throws a party at Crestwood, his summer residence. The girl asks Priam to also invite three wealthy men whom she intends to pluck, but her plans will be unexpectedly upset by an inexplicable death ...'
  'J. Walter Ruben' 6.0 7.0]
 ['Ragazzi della marina' 'War'
  'The cruiser "Raimondo Montecuccoli" leaves Livorno with the cadets of the Naval Academy. Among them, three sailors a

Here, we prepocess the datas as we did it in second notebook by cleaning, shortening, tokenizing and normalizing them.

In [120]:
# Remove the rows containing nan values
df2 = pd.DataFrame(data)
df2 = df2.replace("nan", np.nan)   # if "nan" is a chain, we delete the row
df2 = df2.dropna()
data_clean = df2.to_numpy()


# Downloading the english stopwords dictionnary
nltk.download('punkt_tab')
nltk.download('stopwords')
stopwords_en = nltk.corpus.stopwords.words('english')


# Remove the stopwords from nltk english stopwords dictionnary to get a clean dataset
cleaned_texts = []
for text in data_clean[:,2]:
    tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords_en]
    cleaned_texts.append(' '.join(tokens))


# Tokenize the cleaned dataset of movie's Overview
if dataset == "IMDB":
  max_features = 5000
else:
  max_features = 50000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(cleaned_texts)
tokenizer.word_index.update({'<pad>': 0})
X_cleaned = tokenizer.texts_to_sequences(cleaned_texts)


# Retrieve the differents data's
if dataset == "IMDB":
  #x_train = data[:800, [0, 1, 3]] # Name of the movie / Genre / Director
  x_train = X_cleaned[:800] # Overview
  y_train = data_clean[:800, [4, 5]] # IMDB rating / meta-score

  #x_test = data[800:, [0, 1, 2, 3]] # Name of the movie / Genre / Director
  x_test = X_cleaned[800:] # Overview
  y_test = data_clean[800:, [4, 5]] # IMDB rating / meta-score

  # Normalization of the goal's datas
  y_train[:, 0] = y_train[:, 0] / 10.0   # IMDB rating
  y_train[:, 1] = y_train[:, 1] / 100.0  # Meta-score

  y_test[:, 0] = y_test[:, 0] / 10.0
  y_test[:, 1] = y_test[:, 1] / 100.0
else:
  #x_train = data[:33000, [0, 1, 3]] # Name of the movie / Genre / Director
  x_train = X_cleaned[:33000] # Overview
  y_train = data_clean[:33000, [4, 5]] / 10.0 # IMDB rating / meta-score

  #x_test = data[33000:, [0, 1, 2, 3]] # Name of the movie / Genre / Director
  x_test = X_cleaned[33000:] # Overview
  y_test = data_clean[33000:, [4, 5]] /10.0 # IMDB rating / meta-score

print("\nx_train :\n", x_train[:3], "\nx_test :\n", x_test[:3])
if dataset == "IMDB":
  print("\ny_train :\nIMDB rating | meta-score\n", y_train[:3,:], "\ny_test :\nIMDB rating | meta-score\n",y_test[:3,:])
else:
  print("\ny_train :\nCritic votes | Public votes\n", y_train[:3,:], "\ny_test :\nCritic votes | Public votes\n",y_test[:3,:])

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



x_train :
 [[27083, 10282, 37, 166, 75, 687, 95, 210, 609, 15, 171, 2126, 901, 14, 10744, 251, 419, 915, 31, 104, 165, 270, 84, 27083, 8281, 2307, 718, 3548, 116, 138, 7, 196, 20, 404, 378, 4384, 2866, 4836, 1211, 81], [2322, 60, 2227, 17829, 2414, 19809, 10745, 1774, 318, 35621, 360, 2487, 15, 368, 19809, 16, 3791, 33, 322, 97, 1171, 35622, 505, 1442, 898, 2579, 40], [19810, 10746, 35623, 5, 171, 8282, 10747, 4524, 2716, 125, 33, 4189, 2286, 243, 135, 544, 4457, 2171, 953, 6, 1094, 10748, 402, 922, 16, 2867, 997, 459, 243]] 
x_test :
 [[5956, 8494, 23568, 68, 736, 1684, 5155, 333, 522, 13, 2088, 25, 157, 692, 4434, 18498, 297, 605, 242, 42, 4344, 592, 287], [4775, 1393, 2200, 68, 61, 37733, 13, 81, 15604, 1288, 35, 76, 42, 4, 8, 70, 336, 1831, 24, 153, 38, 524, 107, 658, 1543, 10810, 6248, 17, 7173, 8571, 27, 136, 948, 4413, 82, 7065, 2595, 6115], [2540, 20538, 182, 2462, 470, 12535, 1761, 1021, 271, 50, 3694, 295, 403, 367, 1154, 7092, 3214, 12535, 595, 384, 20538, 1265, 1682, 1783,

In [121]:
print("Maximum length of an Overview for the dataset chosen:",max(len(x) for x in x_train))

Maximum length of an Overview for the dataset chosen: 209


Here, we finish to prepare our datas that will train our model. For the IMDB dataset, the maximum length of an shortened overview is so we chose a padding sequence of 40 to be sure that we don't loose word and we follow the same logic for the FilmTV dataset

In [122]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, GlobalAveragePooling1D, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences

if dataset == "IMDB":
  max_features = 5000
else:
  max_features = 50000
embedding_dim = 64
if dataset == "IMDB":
  max_len = 40
else:
  max_len = 340

X_train_seq = pad_sequences(x_train, maxlen=max_len)
X_test_seq  = pad_sequences(x_test, maxlen=max_len)

y_train_reg = y_train[:, 0].astype(np.float32)
y_test_reg  = y_test[:, 0].astype(np.float32)

Here, we create our first model, something really simple with a GlobalAveragePooling1D layer, just to see what we can get from the preprocessing of our datas.

In [123]:
# --- Architecture ---
inputs = Input(shape=(max_len,))
x = Embedding(max_features, embedding_dim)(inputs)
x = GlobalAveragePooling1D()(x)
x = Dense(64, activation='relu')(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs, outputs)

In [124]:
model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)

model.summary()

In [125]:
model.fit(X_train_seq, y_train_reg, epochs=15, batch_size=128, verbose=1)

Epoch 1/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 65ms/step - loss: 0.0267 - mae: 0.1330
Epoch 2/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 50ms/step - loss: 0.0249 - mae: 0.1287
Epoch 3/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - loss: 0.0253 - mae: 0.1282
Epoch 4/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - loss: 0.0262 - mae: 0.1321
Epoch 5/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - loss: 0.0248 - mae: 0.1284
Epoch 6/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - loss: 0.0249 - mae: 0.1280
Epoch 7/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 47ms/step - loss: 0.0249 - mae: 0.1275
Epoch 8/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 48ms/step - loss: 0.0247 - mae: 0.1276
Epoch 9/15
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 49ms/step - l

<keras.src.callbacks.history.History at 0x7841b07854c0>

In [126]:
loss, mae = model.evaluate(X_test_seq, y_test_reg, verbose=1)
print("MSE (loss) :", loss)
print("MAE :", mae)

[1m1006/1006[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - loss: 0.0251 - mae: 0.1269
MSE (loss) : 0.024928240105509758
MAE : 0.126577690243721


Finally, we can see that this model is doing a pretty miserable at predicting our movies score but we knew it would be like that since it can't encapsulate the sense of the Overview without an RNN layer. So, this is what our next model will include.

In [127]:
y_pred = model.predict(X_test_seq)

for i in range(10):
    print("Overview:", data[800+i, 2][:80], "...")
    print("Real rating :", y_test_reg[i], " – Prediction :", y_pred[i][0])
    print("---")



[1m1006/1006[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step
Overview: Bill and Cassel, leaders of two rival gangs unjustly accused of murdering three  ...
Real rating : 0.631  – Prediction : 0.5855587
---
Overview: Woody Grant (Bruce Dern), a short-tempered old man who has turned everyone away  ...
Real rating : 0.6  – Prediction : 0.58299434
---
Overview: A road movie on a red van around Sicily in search of the new oral narrators who  ...
Real rating : 0.567  – Prediction : 0.58289796
---
Overview: A young London woman arrives at an isolated castle to visit relatives she hasn't ...
Real rating : 0.694  – Prediction : 0.58223045
---
Overview: The bloodthirsty count this time is a famous scientist who invented a virus capa ...
Real rating : 0.667  – Prediction : 0.58130246
---
Overview: After thirty years spent in Geneva as master builder in the Boyer civil engineer ...
Real rating : 0.333  – Prediction : 0.57736945
---
Overview: Nick Naylor's job is to represent the to