## Netflix movie Rating predictor

in this notebook the MALIS project by Eirik Morken en Gijs Wissing will be described.
The goal of our project was to predict the personal rating Gijs would give to a unseen movie based on his Netflix Viewing history and other open data from sources like IMDB. The biggest challenge of this project is dealing with unlabeled data, as Gijs did not provide any ratings to the movies he has viewed on Netflix. Having unlabeled data is a very common real world problem for data scientists. The way it was solved in this project was using a unsupervised clustering algorithm as well as a cusotm made voting system.

#### The Data
To start the project all the necessary data had to be put together. This provided an initial challenge as the dataset from IMDB is larger than the RAM in our machines can handle directly. On top of this the labels used between IMDB and Netflix are completely different. For this reason we used some iterative line processors to integrate the IMDB data into our working dataset. Below you will find a cell showing an example of one of these line processors. No further code will used during in this notebook as the importing, filtering and cleaning of the data is quite a time-consuming process. For that reason beyond this point a saved cleaned dataset will be imported.

In [None]:
def readAndFilterIMDBdata():
    tsvList = []
    movieList = []
    with open("ProjectData/title_basics.tsv", encoding='utf-8') as ratingsFile:
        rd = csv.reader(ratingsFile, delimiter="\t",quotechar='"')
        for row in rd:
            if row[1] == "movie" or row[1] == "tvMovie" or row[1] == 'short':
                movieList.append(row)
    return movieList


def readTsvFile(tf):
    tsvList = []
    with open(tf) as ratingsFile:
        rd = csv.reader(ratingsFile, delimiter=",")
        for row in rd:
            tsvList.append(row)
    return tsvList

##### Next the main part of the code is reviewed. The following steps are taken:
  *  All the categorical features are encoded in a 1 of K set
  *  All the numerical features are encoded to be binary as well
  *  The binary features are used to train a Kmodes clustering algorithm.
  *  The clusters are identified using a voting system based on the watched and unwatched criteria
  
*quick note is that the kmodes library can be installed using "pip install kmodes"*

In [2]:
import pandas as pd
import numpy as np
from kmodes.kmodes import KModes
data = pd.read_csv('ProjectData/cleanedData.csv')
data = data.replace('\\N',0)
data = data.fillna(value=0)
data.startYear = data.startYear.values.astype(int)
data.runtimeMinutes = data.runtimeMinutes.values.astype(int)

#### Encoding categorical features
The first step in preparing the data for the kmods algorithm, is the apply 1-of-K encoding on the catagorical features. The only catagorical feature in the dataset in the is the movie genre. The dataset contains in total 27 different genres, which with 1-of-K encoding translates to 27 new features. Each movie can have multiple genres, which is a small complication, but dealth with.

In [None]:
#Reading 
def readTsvFile(tf):
    tsvList = []
    with open(tf) as ratingsFile:
        rd = csv.reader(ratingsFile, delimiter=",")
        for row in rd:
            tsvList.append(row)
    return tsvList

imdbTable = readTsvFile("ProjectData/cleanedData.csv")


#Removing all IMDB movies with no votes
poppedMoviesWithNoVotes = []
for n in range(len(imdbTable)):
    if imdbTable[n][-1] != "":
        poppedMoviesWithNoVotes.append(imdbTable[n])
imdbTable = poppedMoviesWithNoVotes


def mergeWatchTime():
    #Adding column for mergin 
    for x in range(len(imdbTable)):
        imdbTable[x].append('\\N')
    mergeList = []
    for mov in imdbTable:
        for nflix in netflixLocalData:    
            if nflix [0] == mov[2] or nflix [0] == mov[3]:
                mov[-1] = nflix[1]
                
        mergeList.append(mov)
    return mergeList

mergedWatchTimeList = mergeWatchTime()
mergedWatchTimeList.pop(0)


df = pd.DataFrame(data=mergedWatchTimeList)
df.columns = ['tconst','titleType','primaryTitle','originalTitle','isAdult','startYear','endYear','runtimeMinutes','genres','averageRating','numVotes','watchDate']
genresDf = df['genres'].str.get_dummies(sep=',')
merged = pd.concat([df,genresDf],axis='columns')
#print(merged)



#### Encoding numerical features
In the next cell the numerical features are encoded into binary features. This is necessary to be able to use both numerical and categorical features in the kmodes. Both the year of release and the average rating on IMDB were put on a linear scale, while the number of votes was put on a logarithmic scale. This was considered the most logical way to distribute the features. The exact scaling was also validated and tweaked as is elaborated later in this notebook.

In [3]:
data['OaF-release'] = np.ones(len(data.startYear.values))# OaF=="Old as F*"
data.loc[data.startYear>1975,'OaF-release']=0
for i in range(1975,2015,5):
    title = str(i)+'_'+str(i+5)
    data[title] = np.zeros(len(data.startYear.values))
    data.loc[(data.startYear>i)&(data.startYear<=(i+5)),title]=1
data['noVotes'] = np.zeros(len(data.numVotes.values))
data.loc[data.numVotes<=1,'noVotes']=1
for i in range(7):
    title = '10^'+str(i)+'_votes'
    data[title]= np.zeros(len(data.numVotes.values))
    data.loc[(data.numVotes>10**i)&(data.numVotes<=10**(i+1)),title] = 1
data['noRating'] = np.zeros(len(data.averageRating.values))
data.loc[data.averageRating==0,'noRating']=1
for i in range(10):
    title = str(i)+'_rating'
    data[title] = np.zeros(len(data.averageRating.values))
    data.loc[(data.averageRating > i) & (data.averageRating <= (i + 1)), title] = 1
print(data)

          tconst titleType                               primaryTitle  \
0      tt0111161     movie                   The Shawshank Redemption   
1      tt0468569     movie                            The Dark Knight   
2      tt1375666     movie                                  Inception   
3      tt0110912     movie                               Pulp Fiction   
4      tt0109830     movie                               Forrest Gump   
5      tt0068646     movie                              The Godfather   
6      tt1345836     movie                      The Dark Knight Rises   
7      tt0848228     movie                               The Avengers   
8      tt0120689     movie                             The Green Mile   
9      tt0071562     movie                     The Godfather: Part II   
10     tt0169547     movie                            American Beauty   
11     tt2015381     movie                    Guardians of the Galaxy   
12     tt0434409     movie                         

#### "Training" the k-modes model
In the next model the k-modes model is defined and executed. Since this is a unsupervised clustering algorithm it shouldn't be considered training. This model specifically used the Hamming distance, which is defined as the number of features that are not alike between rows. The rest of the model functions the same as the Kmeans where the distance between the cluster centroid and the cluster members is optimized.

In [4]:
# define the k-modes model
km = KModes(n_clusters=10, init='Huang', n_init=11, verbose=1)
# fit the clusters to the encoded features columns
clusters = km.fit_predict(data.iloc[:,11:])
# get an array of cluster modes
kmodes = km.cluster_centroids_
shape = kmodes.shape

data['cluster'] = clusters
print(data[['originalTitle','cluster']])

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 381, cost: 2689.0
Run 1, iteration: 2/100, moves: 134, cost: 2689.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 2, iteration: 1/100, moves: 518, cost: 2719.0
Run 2, iteration: 2/100, moves: 122, cost: 2719.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 3, iteration: 1/100, moves: 480, cost: 2958.0
Run 3, iteration: 2/100, moves: 133, cost: 2890.0
Run 3, iteration: 3/100, moves: 46, cost: 2890.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 4, iteration: 1/100, moves: 142, cost: 2599.0
Run 4, iteration: 2/100, moves: 0, cost: 2599.0
Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 5, iteration: 1/100, moves: 485, cost: 2557.0
Run 5, iteration: 2/100, moves: 98, cost: 2557.0
Init: initializing centroids
Init: initializing cluste

#### Voting the cluster rating

Before utilizing the result from the clusters to the multiple regression algorithm, the weighting of the new feature has to be applied. The weighting of each cluster was decided by the amount of movies categorized in the cluster divided by the total amount of movies. 
