<a href="https://colab.research.google.com/github/clemgi0/movie-analyser_deep-learning-proyecto/blob/main/02_preprocesado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Analyser | Deep Learning Final Project

In this serie of notebook, we will follow my avances for this project. Let's begin by defining it. Basically, what I want to achieve is to create a deep learning AI model using Keras and Tensorflow that could predict the success of a movie through it's resume, and some other possible input datas like the name of the movie, it's director or it's genre.





### DATASETS
https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Features of the first dataset :

Here are it's features:

0 Poster_Link - Link of the poster that imdb using

1 Series_Title - Name of the movie

2 Released_Year - Year at which that movie released

3 Certificate - Certificate earned by that movie

4 Runtime - Total runtime of the movie

5 Genre - Genre of the movie

6 IMDB_Rating - Rating of the movie at IMDB site

7 Overview - mini story/ summary

8 Meta_score - Score earned by the movie

9 Director - Name of the Director

10, 11, 12, 13 Star1,Star2,Star3,Star4 - Name of the Stars

14 No_of_votes - Total number of votes

15 Gross - Money earned by that movie



https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset

Features of the second dataset :

0 Filmtv_id - Movie id

1 Title - Name of the movie

2 Year - Movie year of release

3 Genre - Movie genre

4 Duration - Movie duration (in min)

5 Country - Countries where the movie was filmed

6 Directors - Name of movie directors

7 Actors - Name of movie actors

8 Avg_vote - Average rating (by critics and public)

9 Critics_vote - Average vote of the critics

10 Public_vote - Average vote of the public

11 Total_vote - Total votes expressed by critics and public

12 Overview - Movie description

13 Notes - Movie notes

14 Humor - Movie humor score given by filmtv

15 Rythm - Movie rythm score given by filmtv

16 Tension - Movie tension score given by filmtv

17 Erotism - Movie erotism score given by filmtv

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import kagglehub
import os
import nltk
from nltk import word_tokenize

### Preprocess of the data
In this second notebook, we will focus on the preprocessing of the data that we have collected and explored previously.

Here we import the datas we want and shuffle them for the reason we saw on the first notebook. We also withdraw the features that interests us.

**Here we select the dataset that we want to use.**

In [None]:
path = kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")
dataset = "IMDB"

Using Colab cache for faster access to the 'imdb-dataset-of-top-1000-movies-and-tv-shows' dataset.


In [None]:
path = kagglehub.dataset_download("stefanoleone992/filmtv-movies-dataset")
dataset = "FilmTV"

Using Colab cache for faster access to the 'filmtv-movies-dataset' dataset.


For the IMDB dataset, we will calculate the average score later on, here we put the Release year as a dump value.

In [None]:
files_in_path = os.listdir(path)
csv_files = [f for f in files_in_path if f.endswith('.csv')]

if csv_files:
    data_file = os.path.join(path, csv_files[0])
    df = pd.read_csv(data_file)

    df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle the datas to avoid linear IMDB rating

    data = df.to_numpy()
    if dataset == "IMDB":
      data = data[:, [1, 5, 7, 9, 2, 6, 8]] # Title / Genre / Overview / Director / Average score / IMDB rating / meta-score
    else:
      data = data[:, [1, 3, 12, 6, 8]] # Title / Genre / Overview / Director / Average votes

    print("Data shape:", data[:3,:])
else:
    print("No CSV files found in the specified path. Please specify which file to load if it's not a CSV or has a different extension.")

Data shape: [['Svadba' 'Comedy'
  'Mishka and Tania, friends since school, are getting married. But something is wrong because the girl leaves for Moscow and disappears for a few years. Having vanished the dreams of becoming a model, she decides to return to the country to marry the good Mishka, muscular and solid worker with a clean face, still in love with her. At this point the film tells about the wedding preparations, the ceremony, the dramas and the trafficking that goes through it.'
  'Pavel Lungin' 7.0]
 ['The Phantom of Crestwood' 'Thriller'
  'Pushed by the beautiful Jenny Wren, banker Priam Andes throws a party at Crestwood, his summer residence. The girl asks Priam to also invite three wealthy men whom she intends to pluck, but her plans will be unexpectedly upset by an inexplicable death ...'
  'J. Walter Ruben' 6.5]
 ['Ragazzi della marina' 'War'
  'The cruiser "Raimondo Montecuccoli" leaves Livorno with the cadets of the Naval Academy. Among them, three sailors are worri

Here, we get rid of the rows where one of the features selected have a nan value.

In [None]:
df2 = pd.DataFrame(data)
df2 = df2.replace("nan", np.nan)   # if "nan" is a chain, we delete the row
df2 = df2.dropna()
data_clean = df2.to_numpy()



---



In [None]:
nltk.download('punkt_tab')
nltk.download('stopwords')
stopwords_en = nltk.corpus.stopwords.words('english')
stopwords_all = stopwords_en
stopwords_all += nltk.corpus.stopwords.words('spanish')
stopwords_all += nltk.corpus.stopwords.words('french')
stopwords_all += nltk.corpus.stopwords.words('italian')
stopwords_all += nltk.corpus.stopwords.words('german')

stopwords_all = set(stopwords_all)

def remove_stopwords(text_list, language): # For the Overview since they are only in English
    cleaned_texts = []
    for text in text_list:
      if language == "english":
        tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords_en]
      else:
        tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords_all]
      cleaned_texts.append(' '.join(tokens))
    return cleaned_texts

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Tokenizing the Overview

After downloading the english stopwords from nltk, we use them to remove them from our Overview feature to only keep the meaningful words (if the cell below fails, re-run the downloading cell of the dataset and the following ones).

In [None]:
data_clean[:,2] = remove_stopwords(data_clean[:,2], "english")
print("First 3 rows of the cleaned Overviews:\n\n",data_clean[:3,2])

First 3 rows of the cleaned Overviews:

 ['mishka tania , friends since school , getting married . something wrong girl leaves moscow disappears years . vanished dreams becoming model , decides return country marry good mishka , muscular solid worker clean face , still love . point film tells wedding preparations , ceremony , dramas trafficking goes .'
 'pushed beautiful jenny wren , banker priam andes throws party crestwood , summer residence . girl asks priam invite three wealthy men intends pluck , plans unexpectedly upset inexplicable death ...'
 'cruiser `` raimondo montecuccoli `` leaves livorno cadets naval academy . among , three sailors worried problems different nature . authority superiors common life board improves character crew , managing solve personal problems .']


Here, we tokenize our Overviews with a max_features of 5000 or 50000 given that there is 5060 different words in our IMDB dictionnary and 60064 in our FilmTV dictionnary. We might want to reduce this max_feature to keep only the most used words if by doing this our model accuracy gets better.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

if dataset == "IMDB":
  ov_max_features = 5000
else:
  ov_max_features = 50000

ov_tokenizer = Tokenizer(num_words=ov_max_features, split=' ')
ov_tokenizer.fit_on_texts(data_clean[:,2])
ov_tokenizer.word_index.update({'<pad>': 0})
ov_tokenized = ov_tokenizer.texts_to_sequences(data_clean[:,2])

print("First 3 rows of the cleaned Overview:\n", data_clean[:,2][:3])
print("\nFirst 3 rows of the tokenized Overview:\n", ov_tokenized[:3])

First 3 rows of the cleaned Overview:
 ['mishka tania , friends since school , getting married . something wrong girl leaves moscow disappears years . vanished dreams becoming model , decides return country marry good mishka , muscular solid worker clean face , still love . point film tells wedding preparations , ceremony , dramas trafficking goes .'
 'pushed beautiful jenny wren , banker priam andes throws party crestwood , summer residence . girl asks priam invite three wealthy men intends pluck , plans unexpectedly upset inexplicable death ...'
 "cruiser `` raimondo montecuccoli '' leaves livorno cadets naval academy . among , three sailors worried problems different nature . authority superiors common life board improves character crew , managing solve personal problems ."]

First 3 rows of the tokenized Overview:
 [[28433, 9714, 34, 149, 65, 646, 92, 191, 556, 16, 169, 2223, 878, 13, 10929, 230, 403, 914, 28, 93, 165, 268, 84, 28433, 8857, 2430, 745, 3320, 110, 131, 7, 208, 19, 41

In [None]:
# Calculate the number of unique words in the vocabulary
unique_words_count = len(ov_tokenizer.word_index) - 1 # Subtract 1 for the <pad> token
print(f"The number of different words in the", dataset, "dataset is:", unique_words_count)
print("\nList of some of the words that have been tokenized from the Overview:\n", list(ov_tokenizer.word_index)[:20])

The number of different words in the FilmTV dataset is: 63444

List of some of the words that have been tokenized from the Overview:
 ["'s", 'two', 'one', 'young', 'life', "''", 'love', 'however', 'old', 'new', 'family', 'woman', 'years', 'father', 'time', 'girl', 'world', 'wife', 'film', 'find']


#### Tokenizing the Title

In this section, we will tokenize the Title feature. Since the titles are in differents languages in the FilmTV dataset, we will clean them from their stopwords from all of the main languages that are present (English, Italian, French, German and Spanish). Just like before, we set the max feature just a bit beneath the total amount of words in the dictionnary for each dataset (after removing the stopwords: IMDB 1326 words / FilmTV 26153 words).

In [None]:
data_clean[:,0] = remove_stopwords(data_clean[:,0], "all")
print("First 3 rows of the cleaned Titles:\n\n",data_clean[:3,0])

First 3 rows of the cleaned Titles:

 ['svadba' 'phantom crestwood' 'ragazzi marina']


In [None]:
if dataset == "IMDB":
  ti_max_features = 1000
else:
  ti_max_features = 25000

ti_tokenizer = Tokenizer(num_words=ti_max_features, oov_token="<UNK>")
ti_tokenizer.fit_on_texts(data_clean[:,0])
ti_tokenized = ti_tokenizer.texts_to_sequences(data_clean[:,0])

print("First 3 rows of the Title:\n", data_clean[:,0][:3])
print("\nFirst 3 rows of the tokenized Title:\n", ti_tokenized[:3])

First 3 rows of the Title:
 ['svadba' 'phantom crestwood' 'ragazzi marina']

First 3 rows of the tokenized Title:
 [[10099], [560, 10100], [561, 2678]]


Here, we can see that potter or 2 both appear a lot of time and we can have some good expactations that these informations would provide valuable information to our model.  

In [None]:
# Calculate the number of unique words in the vocabulary
unique_words_count = len(ti_tokenizer.word_index) - 1 # Subtract 1 for the <pad> token
print(f"The number of different words in the", dataset, "dataset is:", unique_words_count)
print("\nList of some of the words that have been tokenized from the Title:\n", list(ti_tokenizer.word_index)[:20])

The number of different words in the FilmTV dataset is: 28448

List of some of the words that have been tokenized from the Title:
 ['<UNK>', "'s", '2', 'love', 'christmas', 'night', 'last', 'dead', 'story', "'", 'ii', 'one', 'life', '3', 'girl', 'house', 'little', 'black', 'time', 'death']


#### Labeling the Director

In this section, we will label the Director feature. Since a full name can't taken into two separates name because that would mean nothing, we will labelise the directors according to their name. There is no need to remove the stop words here since we only will have names. The number of different directors for IMDB dataset is 843 and for the FilmTV dataset is 13185.

In [None]:
from sklearn.preprocessing import LabelEncoder

directors_raw = data_clean[:, 3]

le_director = LabelEncoder()
di_labelized = le_director.fit_transform(directors_raw)

In [None]:
# Calculate the number of unique words in the vocabulary
print(f"The number of different words in the", dataset, "dataset is:", len(di_labelized))
print("\nList of some of the words that have been tokenized from the Directors:\n", directors_raw[:10], "\nfor\n", di_labelized[:10])

The number of different words in the FilmTV dataset is: 39742

List of some of the words that have been tokenized from the Directors:
 ['Pavel Lungin' 'J. Walter Ruben' 'Francesco De Robertis' 'Keoni Waxman'
 'Francesco Invernizzi' 'Aamir Khan' 'Elie Chouraqui' 'Tara Wood'
 'Paolo Sorrentino' 'Sam Pillsbury'] 
for
 [10798  6033  4409  7779  4422    14  3786 13262 10594 12262]


#### Labeling the Genre

In this section, we will first split the genres to get the list of genres of the movie since in the IMDB dataset there can be multiple genres. Then we tokenize it and select all of them in the tokenizer since there is a really small quantity different genre (23 in the IMDB dataset and 34 in the FilmTV dataset).

In [None]:
genres_split = [g.lower().split(", ") for g in data_clean[:,1]]

In [None]:
if dataset == "IMDB":
  ge_max_features = 23
else:
  ge_max_features = 34

ge_tokenizer = Tokenizer(num_words=ge_max_features, oov_token="<UNK>")
ge_tokenizer.fit_on_texts([" ".join(g) for g in genres_split])
ge_tokenized = ge_tokenizer.texts_to_sequences([" ".join(g) for g in genres_split])


In [None]:
# Calculate the number of unique words in the vocabulary
unique_words_count = len(ge_tokenizer.word_index) - 1 # Subtract 1 for the <pad> token
print(f"The number of different words in the", dataset, "dataset is:", unique_words_count)
print("\nList of some of the words that have been tokenized from the Title:\n", list(ge_tokenizer.word_index)[:20])

The number of different words in the FilmTV dataset is: 34

List of some of the words that have been tokenized from the Title:
 ['<UNK>', 'drama', 'comedy', 'thriller', 'horror', 'action', 'documentary', 'adventure', 'western', 'animation', 'romantic', 'sci', 'fi', 'biography', 'fantasy', 'crime', 'musical', 'war', 'grotesque', 'spy']


#### Preparation of the scores

Finally, since for the IMDB dataset there are two different scores, we will just take the average of them both. In the FilmTV datase, we already have an average score. We take note that the meta-score is a 0 to 100 rating so we devide it by 10 before adding it.

In [None]:
if dataset == "IMDB":
  print("IMDB score:", data_clean[:3,5], "\nmeta score:", data_clean[:3,6])
  data_clean[:,4] = (data_clean[:,5] + data_clean[:,6] / 10.0) / 2.0
  print("\n\nAverage score:", data_clean[:3,4])


Then, we normalize our scores by dividing them by 10.

In [None]:
sc_normalized = data_clean[:,4] / 10.0
print("Normalized score:", sc_normalized[:10])

Normalized score: [0.7 0.65 0.6 0.35 0.49000000000000005 0.72 0.38 0.73 0.6599999999999999
 0.43]


#### Separating the datas for training and testing

First, we will pad the data that we tokenized with each tokenized feature's max_length so that the array are homogeneous.

In [None]:
ov_max_len = max(len(x) for x in ov_tokenized)
ti_max_len = max(len(x) for x in ti_tokenized)
ge_max_len = max(len(x) for x in ge_tokenized)

print("Maximum length for the Overview for the", dataset, "dataset:", ov_max_len)
print("Maximum length for the Title for the", dataset, "dataset:", ti_max_len)
print("Maximum length for the Genre for the", dataset, "dataset:", ge_max_len)

Maximum length for the Overview for the FilmTV dataset: 322
Maximum length for the Title for the FilmTV dataset: 13
Maximum length for the Genre for the FilmTV dataset: 3


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

ov_padded = pad_sequences(ov_tokenized, maxlen=ov_max_len)
ti_padded = pad_sequences(ti_tokenized, maxlen=ti_max_len)
ge_padded = pad_sequences(ge_tokenized, maxlen=ge_max_len)

Finally, we prepare the datas for the training of our model like so:


80% of the rows of data will be used to train the model with using first only the Overview feature and IMDB rating to train and evaluate.

20% of the rows to test our model accuracy, since the datas have been shuffled, there shouldn't an order bias.

In [None]:
print("Number of values for the", dataset, "dataset:", len(data_clean[:,0]))

Number of values for the FilmTV dataset: 39742


In [None]:
nb_train_data = int(0.8*len(data_clean[:,0]))

x_train_overview = ov_padded[:nb_train_data,:]
x_train_title = ti_padded[:nb_train_data,:]
x_train_director = di_labelized[:nb_train_data]
x_train_genre = ge_padded[:nb_train_data,:]
y_train_score = sc_normalized[:nb_train_data]

x_test_overview = ov_padded[nb_train_data:,:]
x_test_title = ti_padded[nb_train_data:,:]
x_test_director = di_labelized[nb_train_data:]
x_test_genre = ge_padded[nb_train_data:,:]
y_test_score = sc_normalized[nb_train_data:]

print("Final data prepared for the", dataset, "dataset:\n\nOverview:\n", ov_padded[:3,:],"\nTitle\n", ti_padded[:3,:],"\nDirector\n" ,di_labelized[:3],"\nGenre\n" ,ge_padded[:3,:],"\nScore\n" ,sc_normalized[:3])

Final data prepared for the FilmTV dataset:

Overview:
 [[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0 