# Movie Recommendation System

To start of we will import some libraries that will help us in this data science project and also import the essential datasets into the panda data frame.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

moviesDF = pd.read_csv('../data/movies.csv')
ratingsDF = pd.read_csv('../data/ratings.csv')
tagsDF = pd.read_csv('../data/tags.csv')
linksDF = pd.read_csv('../data/links.csv')

In [2]:
moviesDF.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratingsDF.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
tagsDF.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Gaining value using Data Exploration

### Idea 1 (Selection):

After further looking into the tags Data Frame. As many algorithms are unable to work with text data we can apply label encoding. However, I realized that there are too many variance for us to effeciently apply label encoding. Although I believe that it would be super valuable to have it included when building a clustering model with the help of Natural Language Processing, I do not currently have the Natural Language Process knowledge to effectively apply it and I also think it would be an overkill for this project. 

I will also not be using the links Data Frame as I am not quite sure how that will add value to our clustering model. Therefore I would be selecting only the features from the movies and ratings Data Frame to train our model.

### Idea 2 (Transformation):

By looking at the movies Data Frame we notice that the genres column contains different attributes describing the movie. As many algorithms are unable to work with categorical text data, we can apply one hot encoding to transform the features. As this attributes does not have as much variance as the attributes on the tag column in the tags Data Frame. I do think it is suitable to apply one hot encoding for this case. 

How I will start transforming the Data Frame is first by splitting all the different possible attributes in the genres into their own respective columns and dropping the genres column. After that, I will use one hot enconding to signify if the movies have the feature. An example would be if previously the movie has title: X and genres: Adventure|Comedy, it will be now tranform to title: X, Adventure: 1, Comedy: 1, and every other genre attributes will have the value 0.

In [5]:
# Provided at data/README.txt
genresListSet = {'Action', 'Adventure', 'Animation', 
                 'Children', 'Comedy', 'Crime', 
                 'Documentary', 'Drama', 'Fantasy',
                 'Film-Noir', 'Mystery', 'Horror', 
                 'Musical', 'Romance', 'Sci-Fi',  
                 'Thriller', 'War', 'Western', 
                 'IMAX', '(no genres listed)'
                }

for genres in genresListSet:
    moviesDF[genres] = 0

for index, row in moviesDF.iterrows():
    genresPerMovie = row['genres'].split('|') 
    for genres in genresPerMovie:
        moviesDF.at[index, genres] = 1

del moviesDF['genres']

### Idea 3 (Derive Statistical Summary):

Using the ratings Data Frame, we can add value by caluclating the avergae rating per movie and adding it as a feature on the movies Data Frame.

In [6]:
def getAverageRating(movieID):
    return ratingsDF.loc[ratingsDF['movieId'] == movieID]['rating'].mean()

averageRating = moviesDF['movieId'].map(lambda x: getAverageRating(x))

moviesDF['rating'] = averageRating

Now our data is ready for the clustering model!

In [7]:
moviesDF

Unnamed: 0,movieId,title,Drama,(no genres listed),Action,Mystery,War,Animation,Comedy,Musical,...,Crime,Documentary,Film-Noir,Fantasy,Thriller,Horror,Western,IMAX,Sci-Fi,rating
0,1,Toy Story (1995),0,0,0,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,3.920930
1,2,Jumanji (1995),0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,3.431818
2,3,Grumpier Old Men (1995),0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,3.259615
3,4,Waiting to Exhale (1995),1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,2.357143
4,5,Father of the Bride Part II (1995),0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,3.071429
5,6,Heat (1995),0,0,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,3.946078
6,7,Sabrina (1995),0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,3.185185
7,8,Tom and Huck (1995),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2.875000
8,9,Sudden Death (1995),0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.125000
9,10,GoldenEye (1995),0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,3.496212


## Building the Clustering Model

For this project, I will use the K-means clustering model to build the movie recommendation system. The reason is because it is easy to implement and extremely fast.

After implementing the model and dividing the movies to their respective clusters. I will choose 10 of the highest rated movies to recommend given a movie from the same cluster.

### Choosing the Number  of K-Cluster

Before training the model, we need to identify how many clusters (K) we want to use. To pick the number of K clusters, I will be using my intuition. As there are 20 genres, there should be at least greater or equal to 20 clusters. Based on briefly looking through the dataset, there are also a lot of genres that are heavily linked together such as Animation and Children. Due to that, I will add another 5 clusters making it 25 clusters. I will also prepare the data to be clusterd.

In [8]:
numKClusters = 25
dataToCluster = moviesDF.drop(columns=['title', 'movieId', 'rating'])

### Training and Prediciton Value for K-means 

In [9]:
kmeans = KMeans(n_clusters=25)
predictionCluster = kmeans.fit_predict(dataToCluster)
moviesDF['predictionKMeans'] = pd.DataFrame(predictionCluster)

### Testing our Movie Recommendation system

In [10]:
def recommendMovies(movieTitle):
    clusterNumber = moviesDF[moviesDF['title'] == movieTitle]['predictionKMeans'].values[0]
    clusterMovies = moviesDF[moviesDF['predictionKMeans'] == clusterNumber]
    clusterMoviesTop10 = clusterMovies.sort_values(by='rating', ascending=False).head(10)
    return clusterMoviesTop10

In [11]:
recommendMovies('Jumanji (1995)')

Unnamed: 0,movieId,title,Drama,(no genres listed),Action,Mystery,War,Animation,Comedy,Musical,...,Documentary,Film-Noir,Fantasy,Thriller,Horror,Western,IMAX,Sci-Fi,rating,predictionKMeans
9700,185031,Alpha (2018),0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,4.5,24
2735,3673,Benji the Hunted (1987),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4.5,24
3638,4993,"Lord of the Rings: The Fellowship of the Ring,...",0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,4.106061,24
4137,5952,"Lord of the Rings: The Two Towers, The (2002)",0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,4.021277,24
9428,166203,Sapphire Blue (2014),0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,4.0,24
8659,121007,Space Buddies (2009),0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,4.0,24
6805,60818,Hogfather (Terry Pratchett's Hogfather) (2006),0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,4.0,24
4269,6232,Born Free (1966),1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,4.0,24
7426,80748,Alice in Wonderland (1933),0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,4.0,24
9016,140359,Doctor Who: The Waters of Mars (2009),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,4.0,24


In [12]:
recommendMovies('Titanic (1997)')

Unnamed: 0,movieId,title,Drama,(no genres listed),Action,Mystery,War,Animation,Comedy,Musical,...,Documentary,Film-Noir,Fantasy,Thriller,Horror,Western,IMAX,Sci-Fi,rating,predictionKMeans
9022,140627,Battle For Sevastopol (2015),1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
4109,5889,"Cruel Romance, A (Zhestokij Romans) (1984)",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
4675,6983,Jane Eyre (1944),1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
8334,107771,Only Lovers Left Alive (2013),1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,5.0,2
5532,26587,"Decalogue, The (Dekalog) (1989)",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
2835,3792,Duel in the Sun (1946),1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,5.0,2
5345,8911,Raise Your Voice (2004),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
2319,3073,"Sandpiper, The (1965)",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
2234,2969,"Man and a Woman, A (Un homme et une femme) (1966)",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2
3504,4788,Moscow Does Not Believe in Tears (Moskva sleza...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5.0,2


## Conclusion

As we can see from our 2 previous examples, the movie recommendation system did pretty well in recommending similar movies to the one being given. 

Although.. I do notice some concerns that I would like to address:
1. The movies ranked in the top 10 are not the most well known. A possible reason for this is because I did not take into an account the number of people who rated it and ranked it solely based on its rating. For example a movie who has 5 star rating with 100 rating counts should be ranked higher than a movie with a 5 star rating with 1 rating counts. A possible fix might be to weight the ratings differntly based on the number of people who gave a rating.
2. The way I chose the K number of clusters may not always be the most reliable. Although I went with choosing it based on intuition and convenience, for a real impact and larger scale Data Science project, I would try using the elbow method as it relies on an analytical method backed with data, to make a decision.