### **TMDB Score Prediction** - Regression with deep learning
- **Date**: Mar 6, 2024  
- **Task**: Create a model to predict movie score based on text and numeric inputs 
- **Procedure**: Analyze data with pandas, create nn model in TensorFlow
- **Dataset source**: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata/data 
- **References**: https://github.com/PhilChodrow/PIC16B/blob/7d12d32e070e7ff3840b971c0ce4185ef1911796/discussion/tmdb.ipynb#L758

In [1]:
# Step 0. Load libraries and custom functions
# Matrices and datasets ------------------------------------------------
import pandas as pd
import numpy as np
# Graphics -------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
# Text processors
import re
import string
#import nltk
#from nltk.corpus import stopwords
#nltk.download('stopwords')
from wordcloud import WordCloud
# Machine Learning -----------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
# Deep Learning --------------------------------------------------------
import keras
import tensorflow as tf
from keras import layers
from keras.layers import TextVectorization
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [2]:
# Step 1. Load data
# 1.1 Read csv and get basic info
df_raw = pd.read_csv('../data/02_TMDB_5000_movies.csv')
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [3]:
# 1.2 Get a sample
df_raw.sample(5, random_state=2024)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
2182,0,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",,13682,[],en,Pooh's Heffalump Movie,Who or what exactly is a Heffalump? The lovabl...,9.03154,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2005-02-11,0,68.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,There's something new in the Hundred Acre Wood.,Pooh's Heffalump Movie,6.4,88
3274,8000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 53, ""nam...",,13154,"[{""id"": 1794, ""name"": ""yakuza""}, {""id"": 12670,...",en,Showdown in Little Tokyo,"An American with a Japanese upbringing, Chris ...",8.403859,"[{""name"": ""Original Pictures"", ""id"": 4234}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1991-08-23,2275557,79.0,"[{""iso_639_1"": ""ja"", ""name"": ""\u65e5\u672c\u8a...",Released,One's a warrior. One's a wise guy. They're two...,Showdown in Little Tokyo,5.7,95
1003,49000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 35, ""nam...",,9548,"[{""id"": 578, ""name"": ""rock and roll""}, {""id"": ...",en,The Adventures of Ford Fairlane,"Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",2.808428,"[{""name"": ""Twentieth Century Fox Film Corporat...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1990-07-11,20423389,104.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Kojak. Columbo. Dirty Harry. Wimps.,The Adventures of Ford Fairlane,6.2,71
1383,32000000,"[{""id"": 18, ""name"": ""Drama""}]",,13920,"[{""id"": 5565, ""name"": ""biography""}, {""id"": 605...",en,Radio,"High school football coach, Harold Jones befri...",9.254647,"[{""name"": ""Revolution Studios"", ""id"": 497}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2003-10-24,52277485,109.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,His courage made them champions.,Radio,6.8,141
2724,18339750,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 36, ""name...",http://www.downfallthefilm.com/,613,"[{""id"": 220, ""name"": ""berlin""}, {""id"": 351, ""n...",de,Der Untergang,"In April of 1945, Germany stands at the brink ...",32.445895,"[{""name"": ""Degeto Film"", ""id"": 986}, {""name"": ...","[{""iso_3166_1"": ""AT"", ""name"": ""Austria""}, {""is...",2004-09-08,92180910,156.0,"[{""iso_639_1"": ""hu"", ""name"": ""Magyar""}, {""iso_...",Released,"April 1945, a nation awaits its...Downfall",Downfall,7.7,1037


Some of the columns contains nested json data, other contains unique 
information like ids or names, so let's transform our dataset. 

In [4]:
# 2. Preprocess data
# 2.1 Create a interim dataset for transformations, drop unused columns and NAs
df_interim = df_raw.copy()
df_interim = df_interim.drop(columns=['id','original_title','title','vote_count','original_language','homepage'])
df_interim = df_interim.dropna()

Now we can concatenate each id value in json format to form a id collection
and process it for prediction

In [5]:
# 2.2 Concatenate strings in json format and drop changed columns
df_interim['genres_c'] = df_interim['genres'].apply(lambda x: ' '.join([str(y['id']) for y in eval(x)]))
df_interim['keywords_c'] = df_interim['keywords'].apply(lambda x: ' '.join([str(y['id']) for y in eval(x)]))
df_interim['producers_c'] = df_interim['production_companies'].apply(lambda x: ' '.join([str(y['id']) for y in eval(x)]))
df_interim['countries_c'] = df_interim['production_countries'].apply(lambda x: ' '.join([str(y['iso_3166_1']) for y in eval(x)]))
df_interim['languages_c'] = df_interim['spoken_languages'].apply(lambda x: ' '.join([str(y['iso_639_1']) for y in eval(x)]))
df_interim = df_interim.drop(columns=['genres','keywords','production_companies','production_countries','spoken_languages'])

In [6]:
# 2.3 View results and current shape
display(df_interim.sample(2))
df_interim.shape

Unnamed: 0,budget,overview,popularity,release_date,revenue,runtime,status,tagline,vote_average,genres_c,keywords_c,producers_c,countries_c,languages_c
2214,20000000,Oscar and Peter land a career-making opportuni...,7.553773,1999-10-22,0,98.0,Released,All's fair in the war of love.,5.4,35 10749,237 1691 2301 2679 2683 4480 5265 6281 6351 97...,79 6194,AU US,en
1873,25500000,"Two brothers, on either side of the law, face ...",12.756344,2013-08-22,2415472,128.0,Released,Crime runs in the family.,6.0,53 80 18,,856 2490 2612 2908 5358 7454 9015 10611 11261 ...,FR US,es en it


(3959, 14)

Now some information comes as numeric like budget. But since values in 
budgets are quite large, we can apply some transformations like log. To 
avoid zeros, we can add 1 to all values.

In [7]:
# 2.4 Transform scale in numeric variables
df_interim['budget_log'] = np.log(df_interim['budget']+1)
df_interim['revenue_log'] = np.log(df_interim['revenue']+1)

About the date, we can use a point of reference, like the year of the 
oldest movie as starting point. 

In [8]:
# 2.5 Transform the date
df_interim['Year_t'] = df_interim['release_date'].apply(lambda x: float(str(x)[0:4]) if (str(x)[0:4])!='' else 2000)
df_interim['Month_t'] = df_interim['release_date'].apply(lambda x: float(str(x)[5:7]) if (str(x)[5:7])!='' else 1)
df_interim = df_interim.drop(columns=['release_date'])
df_interim['Year_diff'] = df_interim['Year_t'] - min(df_interim['Year_t'])

Finally, we create our final dataset

In [9]:
# 2.6 Create final dataset
df = df_interim.drop(['budget','revenue'], axis=1).copy()
display(df.sample(3, random_state=2024))
df.shape

Unnamed: 0,overview,popularity,runtime,status,tagline,vote_average,genres_c,keywords_c,producers_c,countries_c,languages_c,budget_log,revenue_log,Year_t,Month_t,Year_diff
2996,"John Matrix, the former leader of a special co...",34.224204,90.0,Released,Somewhere... somehow... someone's going to pay.,6.4,28 12 53,1930 3568 5600 5905 11107,306 396 1885,US,en,16.118096,17.867296,1985.0,10.0,69.0
2540,A horror comedy based on the ancient legend ab...,31.565117,98.0,Released,You don't want to be on his list.,5.9,27 35 14,657 1442 1991 3373 5570 10794 11183 14755 1479...,33 923,US,de en,16.523561,17.935339,2015.0,11.0,99.0
2992,Musical adaptation of Charles Dickens' Oliver ...,8.305998,153.0,Released,Much Much More Than a Musical!,7.0,18 10751 10402,3430 4344 8250 13014,441 1807 3632,GB,en,16.118096,17.437258,1968.0,9.0,52.0


(3959, 16)

#### **Create the model and train**

In [10]:
# Step 3. Create the model based on the dataset
# 3.1 Split the dataset into training and testing sets
X =  df.drop(columns='vote_average').copy()
y = df['vote_average'].copy()/10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2024)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=2024)
print(f'Dimensions in train: {X_train.shape}, validation: {X_val.shape} and test: {X_test.shape}')

Dimensions in train: (2533, 15), validation: (634, 15) and test: (792, 15)


Now we'll use tensorflow datasets, with the following function:

In [11]:
# 3.2 Create tensorflow datasets with batches
def make_data(X,y):
    # Be careful with nan values, they are not supported by tf.data.Dataset.from_tensor_slices
    # And don't forget there are two parentheses!!!!
    return tf.data.Dataset.from_tensor_slices(
        (
        {
            'genres_c': X['genres_c'],
            'keywords_c': X['keywords_c'],
            'overview': X['overview'],
            'producers_c': X['producers_c'],
            'countries_c': X['countries_c'],
            'languages_c': X['languages_c'],
            'tagline': X['tagline'],
            'scalars': X[['budget_log','revenue_log','popularity','runtime','Year_diff','Month_t']]
        },
        {
            'vote_average':y
        }
        )
    )
train = make_data(X_train, y_train).batch(20)
val = make_data(X_val, y_val).batch(20)
test = make_data(X_test, y_test).batch(20)

2024-03-06 19:01:23.562915: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Max
2024-03-06 19:01:23.562937: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2024-03-06 19:01:23.562943: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2024-03-06 19:01:23.562974: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-03-06 19:01:23.562989: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [12]:
# 3.3 Create support functions for text vectorization
def standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    no_punctuation = tf.strings.regex_replace(lowercase, '[%s]' % re.escape(string.punctuation), '')
    stripped = tf.strings.strip(no_punctuation)
    return stripped

def create_vectorized_layer(train, feature):
    vectorize_layer = TextVectorization(
        standardize=standardization,
        max_tokens=2000,
        output_mode='int',
        output_sequence_length=500)
    vectorize_layer.adapt(train.map(lambda x,y: x[feature]))
    return vectorize_layer

In [13]:
# 3.4 Vectorize text fields
vectorized_genres = create_vectorized_layer(train, 'genres_c')
vectorized_keywords = create_vectorized_layer(train, 'keywords_c')
vectorized_overview = create_vectorized_layer(train, 'overview')
vectorized_producers = create_vectorized_layer(train, 'producers_c')
vectorized_countries = create_vectorized_layer(train, 'countries_c')
vectorized_languages = create_vectorized_layer(train, 'languages_c')
vectorized_tagline = create_vectorized_layer(train, 'tagline')

2024-03-06 19:01:23.681954: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


In [14]:
# 3.5 Create support functions for inputs
def create_string_input(name):
    return keras.Input(
        shape=(1,),
        name = name,
        dtype = 'string'
    ) 

def create_numeric_input(name):
    return keras.Input(
        shape=(6,),
        name="scalars",
        dtype="float64"
    )

In [15]:
# 3.6 Create inputs
genres_input = create_string_input('genres_c')
keywords_input = create_string_input('keywords_c')
overview_input = create_string_input('overview')
producers_input = create_string_input('producers_c')
countries_input = create_string_input('countries_c')
languages_input = create_string_input('languages_c')
tagline_input = create_string_input('tagline')
scalar_input = create_numeric_input('scalars')

In [16]:
# 3.7 Create individual neural network architectures
# Genres
genres_features = vectorized_genres(genres_input)
genres_features = layers.Embedding(2000, 3, name='embedding_genres')(genres_features)
genres_features = layers.Dropout(0.2)(genres_features)
genres_features = layers.GlobalAveragePooling1D()(genres_features)
genres_features = layers.Dropout(0.2)(genres_features)
genres_features = layers.Dense(32, activation='sigmoid')(genres_features)
# Keywords
keywords_features = vectorized_genres(keywords_input)
keywords_features = layers.Embedding(2000, 3, name='embedding_keywords')(keywords_features)
keywords_features = layers.Dropout(0.2)(keywords_features)
keywords_features = layers.GlobalAveragePooling1D()(keywords_features)
keywords_features = layers.Dropout(0.2)(keywords_features)
keywords_features = layers.Dense(32, activation='sigmoid')(keywords_features)
# Overview
overview_features = vectorized_genres(overview_input)
overview_features = layers.Embedding(2000, 3, name='embedding_overview')(overview_features)
overview_features = layers.Dropout(0.2)(overview_features)
overview_features = layers.GlobalAveragePooling1D()(overview_features)
overview_features = layers.Dropout(0.2)(overview_features)
overview_features = layers.Dense(32, activation='sigmoid')(overview_features)
# Producers
producers_features = vectorized_genres(producers_input)
producers_features = layers.Embedding(2000, 3, name='embedding_producers')(producers_features)
producers_features = layers.Dropout(0.2)(producers_features)
producers_features = layers.GlobalAveragePooling1D()(producers_features)
producers_features = layers.Dropout(0.2)(producers_features)
producers_features = layers.Dense(32, activation='sigmoid')(producers_features)
# Countries
countries_features = vectorized_genres(countries_input)
countries_features = layers.Embedding(2000, 3, name='embedding_countries')(countries_features)
countries_features = layers.Dropout(0.2)(countries_features)
countries_features = layers.GlobalAveragePooling1D()(countries_features)
countries_features = layers.Dropout(0.2)(countries_features)
countries_features = layers.Dense(32, activation='sigmoid')(countries_features)
# Languages
languages_features = vectorized_genres(languages_input)
languages_features = layers.Embedding(2000, 3, name='embedding_languages')(languages_features)
languages_features = layers.Dropout(0.2)(languages_features)
languages_features = layers.GlobalAveragePooling1D()(languages_features)
languages_features = layers.Dropout(0.2)(languages_features)
languages_features = layers.Dense(32, activation='sigmoid')(languages_features)
# Tagline
tagline_features = vectorized_tagline(tagline_input)
tagline_features = layers.Embedding(2000, 3, name='embedding_tagline')(tagline_features)
tagline_features = layers.Dropout(0.2)(tagline_features)
tagline_features = layers.GlobalAveragePooling1D()(tagline_features)
tagline_features = layers.Dropout(0.2)(tagline_features)
tagline_features = layers.Dense(32, activation='sigmoid')(tagline_features)
# Scalars
scalar_features = layers.Dense(32, activation='sigmoid')(scalar_input)

In [17]:
# 3.8 Create main architecture
main = layers.concatenate([genres_features, keywords_features, 
                           overview_features, producers_features,
                           countries_features, languages_features,
                           tagline_features, scalar_features])
main = layers.Dense(32)(main)
output = layers.Dense(1, name='vote_average', activation='sigmoid')(main)

model = keras.Model(
    inputs=[genres_input, keywords_input,
            overview_input, producers_input,
            countries_input, languages_input,
            tagline_input, scalar_input],
    outputs=output)

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 genres_c (InputLayer)       [(None, 1)]                  0         []                            
                                                                                                  
 keywords_c (InputLayer)     [(None, 1)]                  0         []                            
                                                                                                  
 overview (InputLayer)       [(None, 1)]                  0         []                            
                                                                                                  
 producers_c (InputLayer)    [(None, 1)]                  0         []                            
                                                                                              

In [18]:
# 3.9 Train the model
model.compile(
    optimizer='adam',
    loss='mse'
)

history = model.fit(
    train,
    validation_data=val,
    epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [19]:
# 3.10 Calculate metrics
rmse = model.evaluate(test)
print(f'RMSE for test: {rmse:.5f}')

RMSE for test: 0.00720


### References
[1] https://github.com/PhilChodrow/PIC16B/blob/7d12d32e070e7ff3840b971c0ce4185ef1911796/discussion/tmdb.ipynb#L758