# Movie Recommendation System

To start of we will import some libraries that will help us in this data science project and also import the essential datasets into the panda data frame.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

moviesDF = pd.read_csv('../data/movies.csv')
ratingsDF = pd.read_csv('../data/ratings.csv')
tagsDF = pd.read_csv('../data/tags.csv')
linksDF = pd.read_csv('../data/links.csv')

In [2]:
moviesDF.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratingsDF.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
tagsDF.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Gaining value using Data Exploration

### Idea 1 (Selection):

After further looking into the tags Data Frame. As many algorithms are unable to work with text data we can apply label encoding. However, I realized that there are too many variance for us to effeciently apply label encoding. Although I believe that it would be super valuable to have it included when building a clustering model with the help of Natural Language Processing, I do not currently have the Natural Language Process knowledge to effectively apply it and I also think it would be an overkill for this project. 

I will also not be using the links Data Frame as I am not quite sure how that will add value to our clustering model. Therefore I would be selecting only the features from the movies and ratings Data Frame to train our model.

### Idea 2 (Transformation):

By looking at the movies Data Frame we notice that the genres column contains different attributes describing the movie. As many algorithms are unable to work with categorical text data, we can apply one hot encoding to transform the features. As this attributes does not have as much variance as the attributes on the tag column in the tags Data Frame. I do think it is suitable to apply one hot encoding for this case. 

How I will start transforming the Data Frame is first by splitting all the different possible attributes in the genres into their own respective columns and dropping the genres column. After that, I will use one hot enconding to signify if the movies have the feature. An example would be if previously the movie has title: X and genres: Adventure|Comedy, it will be now tranform to title: X, Adventure: 1, Comedy: 1, and every other genre attributes will have the value 0.

In [5]:
# Provided at data/README.txt
genresListSet = {'Action', 'Adventure', 'Animation', 
                 'Children', 'Comedy', 'Crime', 
                 'Documentary', 'Drama', 'Fantasy',
                 'Film-Noir', 'Mystery', 'Horror', 
                 'Musical', 'Romance', 'Sci-Fi',  
                 'Thriller', 'War', 'Western', 
                 'IMAX', '(no genres listed)'
                }

for genres in genresListSet:
    moviesDF[genres] = 0

for index, row in moviesDF.iterrows():
    genresPerMovie = row['genres'].split('|') 
    for genres in genresPerMovie:
        moviesDF.at[index, genres] = 1

del moviesDF['genres']

### Idea 3 (Derive Statistical Summary):

Using the ratings Data Frame, we can add value by caluclating the avergae rating per movie and adding it as a feature on the movies Data Frame.

In [6]:
def getAverageRating(movieID):
    return ratingsDF.loc[ratingsDF['movieId'] == movieID]['rating'].mean()
def getRatingCount(movieID):
    return ratingsDF.loc[ratingsDF['movieId'] == movieID]['rating'].count()

averageRating = moviesDF['movieId'].map(lambda x: getAverageRating(x))
ratingCount = moviesDF['movieId'].map(lambda x: getRatingCount(x))

moviesDF['rating'] = averageRating
moviesDF['ratingCount'] = ratingCount

Now our data is ready for the clustering model!

In [7]:
moviesDF

Unnamed: 0,movieId,title,Children,Action,Adventure,Mystery,Film-Noir,Romance,Sci-Fi,Comedy,...,Drama,Thriller,War,Musical,Western,Documentary,Crime,(no genres listed),rating,ratingCount
0,1,Toy Story (1995),1,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,3.920930,215
1,2,Jumanji (1995),1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3.431818,110
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,3.259615,52
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,1,...,1,0,0,0,0,0,0,0,2.357143,7
4,5,Father of the Bride Part II (1995),0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,3.071429,49
5,6,Heat (1995),0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,3.946078,102
6,7,Sabrina (1995),0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,3.185185,54
7,8,Tom and Huck (1995),1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2.875000,8
8,9,Sudden Death (1995),0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3.125000,16
9,10,GoldenEye (1995),0,1,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,3.496212,132


## Building the Clustering Model

For this project, I will use the K-means clustering model to build the movie recommendation system. However, before training the model, we need to identify how many clusters (K) we want to use. To pick the number of K clusters, I will be using my intuition. As there are 20 genres, there should be at least greater or equal to 20 clusters. Based on briefly looking through the dataset, there are also a lot of genres that are heavily linked together such as Animation and Children. Due to that, I will add another 5 clusters making it 25 clusters. I will also prepare the data to be clusterd.

In [8]:
numKClusters = 25
dataToCluster = moviesDF.drop(columns=['title', 'movieId', 'rating', 'ratingCount'])

### Training and Prediciton Value for K-means 

In [11]:
kmeans = KMeans(n_clusters=25)
predictionCluster = kmeans.fit_predict(dataToCluster)
moviesDF['predictionKMeans'] = pd.DataFrame(predictionCluster)

Unnamed: 0,movieId,title,Children,Action,Adventure,Mystery,Film-Noir,Romance,Sci-Fi,Comedy,...,Thriller,War,Musical,Western,Documentary,Crime,(no genres listed),rating,ratingCount,predictionKMeans
0,1,Toy Story (1995),1,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,3.920930,215,19
1,2,Jumanji (1995),1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,3.431818,110,7
2,3,Grumpier Old Men (1995),0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,3.259615,52,6
3,4,Waiting to Exhale (1995),0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,2.357143,7,13
4,5,Father of the Bride Part II (1995),0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,3.071429,49,1
5,6,Heat (1995),0,1,0,0,0,0,0,0,...,1,0,0,0,0,1,0,3.946078,102,8
6,7,Sabrina (1995),0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,3.185185,54,6
7,8,Tom and Huck (1995),1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,2.875000,8,7
8,9,Sudden Death (1995),0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,3.125000,16,2
9,10,GoldenEye (1995),0,1,1,0,0,0,0,0,...,1,0,0,0,0,0,0,3.496212,132,8
