### **TMDB Score Prediction** - Regression with deep learning
- **Date**: Mar 6, 2024  
- **Task**: Create a model to predict movie score based on text and numeric inputs 
- **Procedure**: Analyze data with pandas, create nn model in TensorFlow
- **Dataset source**: https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format/data   
- **References**: https://github.com/PhilChodrow/PIC16B/blob/7d12d32e070e7ff3840b971c0ce4185ef1911796/discussion/tmdb.ipynb#L758

In [2]:
# Step 0. Load libraries and custom functions
# Matrices and datasets ------------------------------------------------
import pandas as pd
import numpy as np
# Graphics -------------------------------------------------------------
import matplotlib.pyplot as plt
import seaborn as sns
# Text processors
import re
import string
#import nltk
#from nltk.corpus import stopwords
#nltk.download('stopwords')
from wordcloud import WordCloud
# Machine Learning -----------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
# Deep Learning --------------------------------------------------------
import keras
import tensorflow as tf
from keras import layers
from keras.layers import TextVectorization
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [3]:
# Step 1. Load data
# 1.1 Read csv and get basic info
df_raw = pd.read_csv('../data/02_TMDB_5000_movies.csv')
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [4]:
# 1.2 Get a sample
df_raw.sample(10, random_state=2024)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
2182,0,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",,13682,[],en,Pooh's Heffalump Movie,Who or what exactly is a Heffalump? The lovabl...,9.03154,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2005-02-11,0,68.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,There's something new in the Hundred Acre Wood.,Pooh's Heffalump Movie,6.4,88
3274,8000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 53, ""nam...",,13154,"[{""id"": 1794, ""name"": ""yakuza""}, {""id"": 12670,...",en,Showdown in Little Tokyo,"An American with a Japanese upbringing, Chris ...",8.403859,"[{""name"": ""Original Pictures"", ""id"": 4234}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1991-08-23,2275557,79.0,"[{""iso_639_1"": ""ja"", ""name"": ""\u65e5\u672c\u8a...",Released,One's a warrior. One's a wise guy. They're two...,Showdown in Little Tokyo,5.7,95
1003,49000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 35, ""nam...",,9548,"[{""id"": 578, ""name"": ""rock and roll""}, {""id"": ...",en,The Adventures of Ford Fairlane,"Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",2.808428,"[{""name"": ""Twentieth Century Fox Film Corporat...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1990-07-11,20423389,104.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Kojak. Columbo. Dirty Harry. Wimps.,The Adventures of Ford Fairlane,6.2,71
1383,32000000,"[{""id"": 18, ""name"": ""Drama""}]",,13920,"[{""id"": 5565, ""name"": ""biography""}, {""id"": 605...",en,Radio,"High school football coach, Harold Jones befri...",9.254647,"[{""name"": ""Revolution Studios"", ""id"": 497}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2003-10-24,52277485,109.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,His courage made them champions.,Radio,6.8,141
2724,18339750,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 36, ""name...",http://www.downfallthefilm.com/,613,"[{""id"": 220, ""name"": ""berlin""}, {""id"": 351, ""n...",de,Der Untergang,"In April of 1945, Germany stands at the brink ...",32.445895,"[{""name"": ""Degeto Film"", ""id"": 986}, {""name"": ...","[{""iso_3166_1"": ""AT"", ""name"": ""Austria""}, {""is...",2004-09-08,92180910,156.0,"[{""iso_639_1"": ""hu"", ""name"": ""Magyar""}, {""iso_...",Released,"April 1945, a nation awaits its...Downfall",Downfall,7.7,1037
3340,7000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...",,713,"[{""id"": 128, ""name"": ""love triangle""}, {""id"": ...",en,The Piano,"After a long voyage from Scotland, pianist Ada...",17.681707,"[{""name"": ""New South Wales Film & Television O...","[{""iso_3166_1"": ""NZ"", ""name"": ""New Zealand""}, ...",1993-05-19,116700000,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,,The Piano,7.1,281
463,0,"[{""id"": 10749, ""name"": ""Romance""}, {""id"": 18, ...",,161795,"[{""id"": 9673, ""name"": ""love""}, {""id"": 14638, ""...",en,Déjà Vu,L.A. shop owner Dana and Englishman Sean meet ...,0.605645,"[{""name"": ""Rainbow Film Company, The"", ""id"": 2...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1998-04-22,0,117.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Your future is set...,Déjà Vu,8.0,1
4168,0,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 53, ""name...",,356987,"[{""id"": 230912, ""name"": ""supervivencia""}]",en,Abandoned,When their yacht capsizes during a storm; four...,3.068463,"[{""name"": ""Making Movies"", ""id"": 71702}]","[{""iso_3166_1"": ""NZ"", ""name"": ""New Zealand""}]",2015-08-30,0,82.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,Abandoned,5.8,27
4057,2160000,"[{""id"": 18, ""name"": ""Drama""}]",,43610,[],en,The Valley of Decision,Mary Rafferty comes from a poor family of stee...,0.1813,"[{""name"": ""Metro-Goldwyn-Mayer (MGM)"", ""id"": 8...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1945-06-01,9132000,119.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,The Valley of Decision,5.8,4
4456,800000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 10749, ""n...",http://www.lhp.com.sg/victor/,25461,"[{""id"": 10183, ""name"": ""independent film""}]",en,Raising Victor Vargas,"The film follows Victor, a Lower East Side tee...",3.643662,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2002-05-16,2816116,88.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,,Raising Victor Vargas,7.8,13


Some of the columns contains nested json data, other contains unique 
information like ids or names, so let's transform our dataset. 

In [5]:
# 2. Preprocess data
# 2.1 Create a interim dataset for transformations, drop unused columns and NAs
df_interim = df_raw.copy()
df_interim = df_interim.drop(columns=['id','original_title','title','vote_count','original_language','homepage'])
df_interim = df_interim.dropna()

Now we can concatenate each id value in json format to form a id collection
and process it for prediction

In [6]:
# 2.2 Concatenate strings in json format and drop changed columns
df_interim['genres_c'] = df_interim['genres'].apply(lambda x: ' '.join([str(y['id']) for y in eval(x)]))
df_interim['keywords_c'] = df_interim['keywords'].apply(lambda x: ' '.join([str(y['id']) for y in eval(x)]))
df_interim['producers_c'] = df_interim['production_companies'].apply(lambda x: ' '.join([str(y['id']) for y in eval(x)]))
df_interim['countries_c'] = df_interim['production_countries'].apply(lambda x: ' '.join([str(y['iso_3166_1']) for y in eval(x)]))
df_interim['languages_c'] = df_interim['spoken_languages'].apply(lambda x: ' '.join([str(y['iso_639_1']) for y in eval(x)]))
df_interim = df_interim.drop(columns=['genres','keywords','production_companies','production_countries','spoken_languages'])

In [7]:
# 2.3 View results and current shape
display(df_interim.sample(2))
df_interim.shape

Unnamed: 0,budget,overview,popularity,release_date,revenue,runtime,status,tagline,vote_average,genres_c,keywords_c,producers_c,countries_c,languages_c
3384,7000000,The lives of three women have a commonality: a...,5.517597,2009-11-07,0,125.0,Released,,6.7,18 10749,8018 9838 10707,1092 10892,ES US,en
1766,27000000,"In her many years as a social worker, Emily Je...",23.917982,2009-08-13,29000000,109.0,Released,Some cases should never be opened.,6.1,27 9648 53,516 703 2438 6152 9826 11857 14751 14819 15017...,838 10039 11581 11582,CA US,en


(4803, 14)

Now some information comes as numeric like budget. But since values in 
budgets are quite large, we can apply some transformations like log. To 
avoid zeros, we can add 1 to all values.

In [8]:
# 2.4 Transform scale in numeric variables
df_interim['budget_log'] = np.log(df_interim['budget']+1)
df_interim['revenue_log'] = np.log(df_interim['revenue']+1)

About the date, we can use a point of reference, like the year of the 
oldest movie as starting point. 

In [9]:
# 2.5 Transform the date
df_interim['Year_t'] = df_interim['release_date'].apply(lambda x: float(str(x)[0:4]) if (str(x)[0:4])!='' else 2000)
df_interim['Month_t'] = df_interim['release_date'].apply(lambda x: float(str(x)[5:7]) if (str(x)[5:7])!='' else 1)
df_interim = df_interim.drop(columns=['release_date'])
df_interim['Year_diff'] = df_interim['Year_t'] - min(df_interim['Year_t'])

Finally, we create our final dataset

In [11]:
# 2.6 Create final dataset
df = df_interim.drop(['budget','revenue'], axis=1).copy()
display(df.sample(3, random_state=2024))
df.shape

Unnamed: 0,overview,popularity,runtime,status,tagline,vote_average,genres_c,keywords_c,producers_c,countries_c,languages_c,budget_log,revenue_log,Year_t,Month_t,Year_diff
2182,Who or what exactly is a Heffalump? The lovabl...,9.03154,68.0,Released,There's something new in the Hundred Acre Wood.,6.4,16 10751,,2,US,en,0.0,0.0,2005.0,2.0,89.0
3274,"An American with a Japanese upbringing, Chris ...",8.403859,79.0,Released,One's a warrior. One's a wise guy. They're two...,5.7,28 53,1794 12670 18098,4234 6194,US,ja en,15.894952,14.637736,1991.0,8.0,75.0
1003,"Ford ""Mr. Rock n' Roll Detective"" Fairlane is ...",2.808428,104.0,Released,Kojak. Columbo. Dirty Harry. Wimps.,6.2,28 35 53 80 9648,578 837 2570 5540 9826 155790,306 1885,US,en,17.707331,16.832191,1990.0,7.0,74.0


(4803, 16)

#### **Create the model and train**

In [12]:
# Step 3. Create the model based on the dataset
# 3.1 Split the dataset into training and testing sets
X =  df.drop(columns='vote_average').copy()
y = df['vote_average'].copy()/10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2024)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=2024)
print(f'Dimensions in train: {X_train.shape}, validation: {X_val.shape} and test: {X_test.shape}')

Dimensions in train: (3073, 15), validation: (769, 15) and test: (961, 15)


In [28]:
X_train

Unnamed: 0,overview,popularity,runtime,status,tagline,genres_c,keywords_c,producers_c,countries_c,languages_c,budget_log,revenue_log,Year_t,Month_t,Year_diff
2088,When their computer hacker friend accidentally...,9.662715,90.0,Released,You are now infected.,27 53,236 2157 4959 6614 9714,7405,US,en pl,17.453097,17.213626,2006.0,8.0,90.0
4660,Give Me Shelter is a documentary to raise awar...,0.278981,90.0,Released,,99,196315,,US,en,0.000000,0.000000,2014.0,6.0,98.0
2544,"On a bet, a gridiron hero at John Hughes High ...",23.423082,89.0,Released,They served you Breakfast. They gave you Pie. ...,35,240 2283 5091 6270 6275 9755 9986 10791 166229...,333 441 2882,US,en,16.588099,18.012236,2001.0,12.0,85.0
1553,Two homicide detectives are on a desperate hun...,79.579532,127.0,Released,Seven deadly sins. Seven ways to die.,80 9648 53,476 703 1470 2231 3597 3857 3927 3932 4138 414...,12 4286 65394,US,en,17.312018,19.606424,1995.0,9.0,79.0
1279,"Alex Rider thinks he is a normal school boy, u...",16.282962,93.0,Released,Rule the school. Save the world.,12 28 10751,392 3272 3650 4391 213102 223438,2268 7289 22514,DE GB US,en,17.504390,16.990972,2006.0,7.0,90.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3735,"Set in 1954, a group of Florida high schoolers...",12.891276,94.0,Released,Keep an eye out for the funniest movie about g...,35,255 293 572 1196 2389 2483 6593 10250 10873 11...,2124 10313,US CA,en,15.201805,18.649633,1981.0,11.0,65.0
417,"Inspired by the E.C. comics of the 1950s, Geor...",13.661289,120.0,Released,The Most Fun You'll Ever Have... BEING SCARED!,27 35 14,1299 3335 5404 7172 7325 9706 9717 10292 11321...,6194 14179 14180,US,en it,15.894952,16.861401,1982.0,11.0,66.0
3550,"This Canadian made comedy/drama, set in Hamilt...",1.688495,95.0,Released,He's hoping for a miracle. He doesn't have a p...,35 18,6075 10183,803 15689 64570,CA,en,0.000000,0.000000,2004.0,9.0,88.0
2951,"In Greenwich Village in the early 1960s, gifte...",36.319205,105.0,Released,,18 10402,6382 6706 10228 14512 159944 193419 208992,694 5490 20664,US FR GB,en,16.213406,17.310056,2013.0,10.0,97.0


Now we'll use tensorflow datasets, with the following function:

In [52]:
X_train['overview'].filter(items=[2088]).to_list()

['When their computer hacker friend accidentally channels a mysterious wireless signal, a group of co-eds rally to stop a terrifying evil from taking over the world.']

In [46]:
def make_data(X,y):
    return tf.data.Dataset.from_tensor_slices(
        {
            'genres_c': X['genres_c'],
            'keywords_c': X['keywords_c'],
            'overview': X['overview'],
            #'producers_c': X['producers_c'],
            #'countries_c': X['countries_c'],
            #'languages_c': X['languages_c'],
            #'tagline': X['tagline'],
            #'scalars': X[['budget_log','revenue_log','popularity','runtime','Year_diff','Month_t']]
        }
    )
train = make_data(X_train, y_train)
val = make_data(X_val, y_val)
test = make_data(X_test, y_test)

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).

In [37]:
y_train['s']

2088    0.50
4660    0.00
2544    0.55
1553    0.81
1279    0.51
        ... 
3735    0.61
417     0.67
3550    0.73
2951    0.72
1435    0.60
Name: vote_average, Length: 3073, dtype: float64

In [53]:
def make_data(X):
    return tf.data.Dataset.from_tensor_slices(
        {
            "overview": X["overview"]
        }
    )
train = make_data(X_train)

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).

In [33]:
y_train

2088    0.50
4660    0.00
2544    0.55
1553    0.81
1279    0.51
        ... 
3735    0.61
417     0.67
3550    0.73
2951    0.72
1435    0.60
Name: vote_average, Length: 3073, dtype: float64

In [32]:
tf.data.Dataset.from_tensor_slices({'genres_c':X_train['genres_c']})

<_TensorSliceDataset element_spec={'genres_c': TensorSpec(shape=(), dtype=tf.string, name=None)}>

### References
[1] https://github.com/PhilChodrow/PIC16B/blob/7d12d32e070e7ff3840b971c0ce4185ef1911796/discussion/tmdb.ipynb#L758