# Predictions on TMDB Dataset Vote Average Classification

This is a machine learning project that will consist on predicting different things about the TMDB Datasets. The objective of this notebook is to use machine learning.

Some columns consist of non-numerical data and could be very useful when finding patterns
between corresponding data. Therefore encoding these features is necessary if you want to
use them in a model. Features such as genres and production companies contain lists of
elements; through the use of sklearnâ€™s preprocessing libraries, multilabelbinarizer helps
convert the information in the lists to a one hot version. This does however create new
columns for each unique label, potentially increasing the size of the dataframe by a large
amount and heavily impacting the training time. 

We had the idea to use all the number of columns created by the genres using on-hot encoding and then use pca to reduce the number of columns or dimensions, and then used them for predicting or input them into the neural network.However, given the time, we could not do that. 

For this reason, when using any model
containing cast or crew we will limit them to the first five elements in an attempt to keep the
dataset uncluttered.

### Predicting vote average based on classification.

For vote average we deleted rows where the number was zero.
We predicted vote average based on revenue, popularity and vote_count, and genres. We had to encode the genres, to find meaningful relationships, the original data was on a json format where we have to format it into an array or genres, and then we used an encoding to encode the genres, as taking all of them would create a lot of columns columns. 

We did tried different methods that were for classification such as KNN, Random Forest and Gaussian Naive Bayes Gaussian Mixtures with support vector machines and neural network.



## Plan

This is the plan that we are going to follow.

1. read csv file
2. remove columns that have little correlation or we don't need.
3. find if there are null values, dealing either by filling them it using media, average or in worst case dropping them.
4. Finding X and y, and use crossvalidation.

In [1]:
import json
import numpy as np
import pandas as pd
import os
import sys
import matplotlib.pyplot as plt
import matplotlib as mlp
from sklearn.preprocessing import MultiLabelBinarizer
from keras.utils import to_categorical
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

## Loading  the dataset 

We are going to load the dataset by merging the two files by movie id.
Then we will remove the columns that are not necesary to our analysis

## Encoding Non-numerical Features 


We are going to encode the different non numerical features such as crew, cast and genres.
We are going to use one-hot encoding, however this will create many columns, so we will make the first ten rows.

Before we actually can encode the data, we first need to conver the json data from the dataset to an appropiate format that we can use to use one-hot encoding.

The convert() method below, converts the data from the dataframe that is in json format to a list of python

In [None]:
def encode_list(dataf, feature):
    enc_df = dataf.join(pd.DataFrame.sparse.from_spmatrix(mlb.fit_transform(temp.pop(feature)),
                                                     index = dataf.index,
                                                     columns = mlb.classes_))
    return enc_df

def convert(df, columns): 
    for c in columns:
        # Convert json format to python list
        df[c]=df[c].apply(json.loads)
        
        # Obtain first 10 from columns cast and crew
        if (c == 'cast' or 'crew' or 'production_companies'): 
            for index,i in zip(df.index,df[c]):
                limit = 5
                if len(i) < 5:
                    limit = len(i)
                
                temp_list=[]
                for j in range(limit):
                    # Json format of 'id' & 'name'
                    temp_list.append((i[j]['name'])) 
                df.loc[index,c]= str(temp_list)

        # For any other columns
        else:    
            for index,i in zip(df.index,df[c]):
                temp_list=[]
                for j in range(len(i)):
                    temp_list.append((i[j]['name'])) 
                df.loc[index,c]= str(temp_list)
    
         
        df[c] = df[c].str.strip('[]')       # Remove Sqr Brackets
        df[c] = df[c].str.replace(' ','')   # Remove empty space 
        df[c] = df[c].str.replace("'",'')   # Remove quotations
        df[c] = df[c].str.split(',')        # Format into list
        
        # Sort elements 
        for i,j in zip(df[c],df.index):
            temp_list = i
            temp_list.sort()
            df.loc[j,c]=str(temp_list)
            
        df[c] = df[c].str.strip('[]')       
        df[c] = df[c].str.replace(' ','')    
        df[c] = df[c].str.replace("'",'')   
       
        lst = df[c].str.split(',')        
        if len(lst) == 0:
            df[c] = None
        else:
            df[c]= df[c].str.split(',')
            
    return df



## Loading  the dataset 

We are going to load the dataset by merging the two files by movie id.
Then we will remove the columns that are not necesary to our analysis

In [None]:
path1 = "./"
filename_read = os.path.join(path1,"tmdb_5000_movies.csv")
movie = pd.read_csv(filename_read,na_values=['NA','?'])

path2 = "./"
filename_read = os.path.join(path2,"tmdb_5000_credits.csv")
credit = pd.read_csv(filename_read,na_values=['NA','?'])

movies = movie.merge(credit, left_on='id', right_on='movie_id', how='left')

movies = movies.drop(columns=['homepage','original_language','title_y', 'title_x',
                              'overview','production_countries','release_date',
                              'spoken_languages','status','tagline'
                              ], axis=1) # , 'runtime', , 'genres'


test = convert(movies,  ['genres', 'keywords', 'production_companies', 'cast', 'crew'])

### Cleaning data to predict classification

We are cleasing the data for classification and removing entries with score of 0.
We then tey to delete rows containing empty values from cast, crew, companies columns

In [None]:
# Removing entries with empty cast/crew/companies
drop = []
for i in test.index:
    if (test['production_companies'][i] == [''] and test['cast'][i] == [''] and 
       test['crew'][i] == ['']): 
        drop.append(i)
test = test.drop(drop, axis = 0)

# Removing entries with a score of 0
mask_avg = (test['vote_average'] != 0)
test = test[mask_avg]
temp = test.copy()

test.shape
test.dtypes

temp['vote_round'] = temp.vote_average.round()
temp = temp.drop(columns = 'vote_average')

countCol = temp.groupby("vote_round")["vote_round"].transform(len)
mask = (countCol >= 10)
temp = temp[mask]

temp.vote_round.value_counts()

mlb = MultiLabelBinarizer(sparse_output=True) 
enc_df = encode_list(temp, feature = 'genres')




encoded = enc_df.copy()

encoded.dtypes
encoded.shape
# Remove unecessary columns
encoded = encoded.drop(columns=['id', 'keywords', 'original_title','movie_id',
                                'production_companies', 'cast', 'crew', 'runtime']) 
encoded.isnull().any()
# Converting datatypes for encoded genres 
for i in range(5, 26):
    name = encoded.columns[i]
    encoded[name] = np.asarray(encoded[name]).astype('float32')

U = encoded.drop(columns=['vote_round'])
v = encoded['vote_round'].astype('int32')

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(v)
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_v = onehot_encoder.fit_transform(integer_encoded)

U.shape
v.shape
onehot_v.shape
onehot_v[:10]
v[:10]

encoded['vote_round'].value_counts()

v_flat = v.values.ravel()
v_flat = to_categorical(v_flat) 
v_flat.shape

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(U, onehot_v, test_size=0.25, random_state=5)

X_train.shape
y_train.shape

X_test.shape
y_test.shape

## Neural Network

In [None]:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import  Flatten
from keras.constraints import maxnorm

In [None]:
############################################## NEURAL NETWORK ###################################################

from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.optimizers import SGD

callbacks = [EarlyStopping(monitor='loss', patience=2),
             ModelCheckpoint(filepath='best_model.h5', monitor='loss', save_best_only=True)]

opt = SGD(learning_rate = 0.01, momentum=0.9, decay=0.01)

model = Sequential()
model.add(Dense(25, input_dim = U.shape[1], activation = 'sigmoid')) #Input Layer -> Hidden Layer 1
model.add(Dense(17, activation = 'sigmoid')) # Hidden Layer 1 -> Hidden Layer 2
model.add(Dense(7,  activation = 'softmax')) # Hidden Layer 2 -> Output Layer
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy']) # Choosing Loss & Optimizer 
net = model.fit(X_train, y_train, verbose = 2, epochs = 200)#, callbacks = callbacks, batch_size = 64) # Training data
pred = model.predict(X_test)  # make predictions 
pred = np.argmax(pred,axis=1) # now pick the most likely outcome
y_compare = np.argmax(y_test,axis=1) 

#calculate accuracy
score = metrics.accuracy_score(y_compare, pred) 
print("Accuracy score: {}".format(score))

for key, value in net.history.items() :
    print (key)
    
plt.plot(net.history['loss'])
#plt.plot(net.history['val_loss'])
plt.title('Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()


plt.plot(net.history['accuracy'])
plt.title('Accuracy')
plt.ylabel('accruacy')
plt.xlabel('epoch')
plt.legend(['accruacy'], loc='upper left')
plt.show()

In [None]:
# Plotting of Confusion Matrix
from sklearn.metrics import confusion_matrix
import seaborn as sn
cm = confusion_matrix(y_compare, pred)
plt.figure(figsize=(10,7))
sn.set(font_scale=1.4) # for label size
sn.heatmap(cm, annot=True, annot_kws={"size": 16}) # font size

plt.show()
pred[:10]

y_test.shape
print(y_test[:10])
print(pred[:20])

## K-Nearest Neighbors Classification

In [None]:
# K-Nearest Neighbors Classification
                                    
# Doesn't work too well with higher dimensiosn as it is more difficult to calculate distances in high dimensions
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(algorithm = 'auto',           # Find the best algorithm 
                            metric = 'minkowski',        # Distance Metric E.g. Minkowski
                            n_neighbors = 7,             # 
                            p = 2,                       # Manhattan (1) / Euclidean (2) -  Distance 
                            weights = 'distance')        # distance weighted points 

knn.fit(X_train, y_train)
#print("Predictions form the classifier:")
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))


                                    # Random Forest Classifier 

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion = 'entropy', max_features = 'auto',
                             n_estimators = 100, n_jobs = (4), oob_score=(True),
                             max_depth=(30)) #min_samples_leaf = 2)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

## Gaussian Naive Bayes

In [None]:

                                    
U = encoded.drop(columns=['vote_round'])
v = encoded['vote_round'].astype('int32')
# run this for naive bayes 
X_train, X_test, y_train, y_test = train_test_split(U, v, test_size=0.25, random_state=3)
 
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

## Gaussian Mixture Model 

In [None]:

from sklearn.svm import SVC
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(covariance_type = 'tied', n_components= 2, n_init = 10,
                      verbose = 2, verbose_interval = 5, random_state=4).fit(U)
proba = gmm.predict_proba(U)
svm = SVC().fit(proba, v)
y_pred = svm.predict(U)
print("Accuracy Score: ", metrics.accuracy_score(v, y_pred))