<a href="https://colab.research.google.com/github/clemgi0/movie-analyser_deep-learning-proyecto/blob/main/02_preprocesado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Analyser | Deep Learning Final Project

In this serie of notebook, we will follow my avances for this project. Let's begin by defining it. Basically, what I want to achieve is to create a deep learning IA model using Keras and Tensorflow that could predict the success of a movie through it's resume, and some other possible input datas like the name of the movie, it's director or it's genre.

### Preprocess of the data
In this second notebook, we will focus on the preprocessing of the data that we have collected and explored previously.


---


Here is a reminder of the feature for this first dataset :

Here are it's features:

0 Poster_Link - Link of the poster that imdb using

1 Series_Title = Name of the movie

2 Released_Year - Year at which that movie released

3 Certificate - Certificate earned by that movie

4 Runtime - Total runtime of the movie

5 Genre - Genre of the movie

6 IMDB_Rating - Rating of the movie at IMDB site

7 Overview - mini story/ summary

8 Meta_score - Score earned by the movie

9 Director - Name of the Director

10, 11, 12, 13 Star1,Star2,Star3,Star4 - Name of the Stars

14 No_of_votes - Total number of votes

15 Gross - Money earned by that movie

In [15]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import kagglehub
import os
import nltk
from nltk import word_tokenize

Here we import those datas again and shuffle them for the reason we saw on the first notebook. We also withdraw the features that interests us.

In [16]:
path = kagglehub.dataset_download("harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows")

files_in_path = os.listdir(path)
csv_files = [f for f in files_in_path if f.endswith('.csv')]

if csv_files:
    data_file = os.path.join(path, csv_files[0])
    df = pd.read_csv(data_file)

    df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle the datas to avoid linear IMDB rating

    data = df.to_numpy()
    data = data[:, [1, 5, 7, 9, 6, 8]] # Name of the movie / Genre / Overview / Director / IMDB rating / meta-score
    print("Data shape:", data[:3,:])
else:
    print("No CSV files found in the specified path. Please specify which file to load if it's not a CSV or has a different extension.")

Using Colab cache for faster access to the 'imdb-dataset-of-top-1000-movies-and-tv-shows' dataset.
Data shape: [['Trois couleurs: Bleu' 'Drama, Music, Mystery'
  'A woman struggles to find a way to live her life after the death of her husband and child.'
  'Krzysztof Kieslowski' 7.9 85.0]
 ['Captain America: The Winter Soldier' 'Action, Adventure, Sci-Fi'
  'As Steve Rogers struggles to embrace his role in the modern world, he teams up with a fellow Avenger and S.H.I.E.L.D agent, Black Widow, to battle a new threat from history: an assassin known as the Winter Soldier.'
  'Anthony Russo' 7.7 70.0]
 ['Wreck-It Ralph' 'Animation, Adventure, Comedy'
  'A video game villain wants to be a hero and sets out to fulfill his dream, but his quest brings havoc to the whole arcade where he lives.'
  'Rich Moore' 7.7 72.0]]


In [17]:
nltk.download('punkt_tab')
nltk.download('stopwords')
stopwords_en = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


After downloading the english stopwords from nltk, we use them to remove them from our Overview feature to only keep the meaningful words.

In [18]:
def remove_stopwords(text_list):
    cleaned_texts = []
    for text in text_list:
        tokens = [word.lower() for word in nltk.word_tokenize(text) if word.lower() not in stopwords_en]
        cleaned_texts.append(' '.join(tokens))
    return cleaned_texts

cleaned_descriptions = remove_stopwords(data[:,2])
print("First 3 rows of the cleaned Overviews:\n",cleaned_descriptions[:3])

First 3 rows of the cleaned Overviews:
 ['woman struggles find way live life death husband child .', 'steve rogers struggles embrace role modern world , teams fellow avenger s.h.i.e.l.d agent , black widow , battle new threat history : assassin known winter soldier .', 'video game villain wants hero sets fulfill dream , quest brings havoc whole arcade lives .']


Here, we tokenize our Overviews with a max_features of 5000 given that there is 5586 different words in our dictionnary. We might want to reduce this max_feature to keep only the most used words if by doing this our model accuracy gets better.

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer

max_features = 5000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(cleaned_descriptions)
tokenizer.word_index.update({'<pad>': 0})
X_cleaned = tokenizer.texts_to_sequences(cleaned_descriptions)

print("First 3 rows of the cleaned Overview:\n", cleaned_descriptions[:3])
print("\nFirst 3 rows of the tokenized Overview:\n", X_cleaned[:3])

First 3 rows of the cleaned Overview:
 ['woman struggles find way live life death husband child .', 'steve rogers struggles embrace role modern world , teams fellow avenger s.h.i.e.l.d agent , black widow , battle new threat history : assassin known winter soldier .', 'video game villain wants hero sets fulfill dream , quest brings havoc whole arcade lives .']

First 3 rows of the tokenized Overview:
 [[10, 82, 14, 65, 68, 4, 60, 135, 69], [2219, 2220, 82, 2221, 508, 659, 6, 921, 228, 1312, 55, 1313, 229, 922, 660, 509, 66, 109, 413, 56, 7, 271, 230, 136, 414, 923, 272], [2222, 197, 1314, 415, 2223, 79, 510, 329, 165, 273, 1315, 924, 2224, 31]]


In [20]:
# Calculate the number of unique words in the vocabulary
unique_words_count = len(tokenizer.word_index) - 1 # Subtract 1 for the <pad> token
print(f"The number of different words in the dataset is: {unique_words_count}")

The number of different words in the dataset is: 5586


In [21]:
tokenizer.word_index

{"'s": 1,
 'young': 2,
 'man': 3,
 'life': 4,
 'two': 5,
 'world': 6,
 'new': 7,
 'family': 8,
 'war': 9,
 'woman': 10,
 'story': 11,
 'love': 12,
 'one': 13,
 'find': 14,
 'old': 15,
 'must': 16,
 'finds': 17,
 'boy': 18,
 'help': 19,
 'father': 20,
 'wife': 21,
 'becomes': 22,
 'girl': 23,
 'american': 24,
 'years': 25,
 'friends': 26,
 'son': 27,
 'former': 28,
 'year': 29,
 'three': 30,
 'lives': 31,
 'city': 32,
 'town': 33,
 'murder': 34,
 'time': 35,
 'mother': 36,
 'team': 37,
 'tries': 38,
 'school': 39,
 'home': 40,
 'small': 41,
 'mysterious': 42,
 "''": 43,
 'group': 44,
 'crime': 45,
 'friend': 46,
 'people': 47,
 'daughter': 48,
 'become': 49,
 'men': 50,
 'police': 51,
 'ii': 52,
 'day': 53,
 'search': 54,
 's': 55,
 'battle': 56,
 'get': 57,
 'back': 58,
 'high': 59,
 'death': 60,
 'first': 61,
 'past': 62,
 'york': 63,
 'takes': 64,
 'way': 65,
 'agent': 66,
 'u': 67,
 'live': 68,
 'child': 69,
 'leads': 70,
 'set': 71,
 'journey': 72,
 'german': 73,
 'save': 74,
 'tog

Finally, we prepare the datas for the training of our model like so:


800 rows of data will be used to train the model with using first only the Overview feature and IMDB rating to train and evaluate.

200 rows to test our model accuracy, since the datas have been shuffled, there shouldn't an order bias.

In [22]:
#x_train = data[:800, [0, 1, 3]] # Name of the movie / Genre / Director
x_train = X_cleaned[:800] # Overview
y_train = data[:800, [4, 5]] # IMDB rating / meta-score

#x_test = data[800:, [0, 1, 2, 3]] # Name of the movie / Genre / Director
x_test = X_cleaned[800:] # Overview
y_test = data[800:, [4, 5]] # IMDB rating / meta-score
print("x_train :\n", x_train[:3], "\nx_test :\n", x_test[:3])
print("\ny_train :\nIMDB rating meta-score\n", y_train[:3,:], "\ny_test :\nIMDB rating meta-score\n",y_test[:3,:])

x_train :
 [[10, 82, 14, 65, 68, 4, 60, 135, 69], [2219, 2220, 82, 2221, 508, 659, 6, 921, 228, 1312, 55, 1313, 229, 922, 660, 509, 66, 109, 413, 56, 7, 271, 230, 136, 414, 923, 272], [2222, 197, 1314, 415, 2223, 79, 510, 329, 165, 273, 1315, 924, 2224, 31]] 
x_test :
 [[1227, 146, 1308, 108, 4876, 1654, 5, 416, 57, 207, 34], [2161, 300, 914, 100, 237, 1296, 1297, 95, 4877, 4878], [11, 2181, 91, 128, 802, 2101, 505, 101, 582, 4879, 1242, 4880, 4881, 4882, 662, 4883, 4884, 2050]]

y_train :
IMDB rating meta-score
 [[7.9 85.0]
 [7.7 70.0]
 [7.7 72.0]] 
y_test :
IMDB rating meta-score
 [[7.9 88.0]
 [7.6 41.0]
 [7.8 nan]]
