In this notebook, we use 2 clustering models to explore how movie ratings and tags can be used to find similar movies
<br>
<br>
First we import our packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn import preprocessing
import matplotlib.pyplot as plt

Now read in the 3 data sets

In [2]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
tags = pd.read_csv('tags.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


We want to relate each movie to its rating, tag, and genre<br>
Let's begin cleaning the data by dropping the timestamp columnw because this isn't likely to be useful beyond reviewers tastes changing throughout the decades
<br>
And merge ratings and tags

In [6]:
rating_movieId = ratings.drop(columns=['timestamp'])
tag_movieId = tags.drop(columns=['timestamp'])

df = pd.merge(left=rating_movieId, right=tag_movieId, on=['movieId', 'userId'])
df.head()

Unnamed: 0,userId,movieId,rating,tag
0,2,60756,5.0,funny
1,2,60756,5.0,Highly quotable
2,2,60756,5.0,will ferrell
3,2,89774,5.0,Boxing story
4,2,89774,5.0,MMA


We want to label encode tags to make it usable by our clustering models

In [7]:
label_encoder = preprocessing.LabelEncoder()
df['tag'] = label_encoder.fit_transform(df['tag'])
df.head()

Unnamed: 0,userId,movieId,rating,tag
0,2,60756,5.0,911
1,2,60756,5.0,227
2,2,60756,5.0,1528
3,2,89774,5.0,72
4,2,89774,5.0,316


We can establish additional value by including the genre of the movie, which will help the model relate movies to each other in a way that is intuitive for us<br>

In [8]:
df = pd.merge(df, movies, on=["movieId"])
df.head()

Unnamed: 0,userId,movieId,rating,tag,title,genres
0,2,60756,5.0,911,Step Brothers (2008),Comedy
1,2,60756,5.0,227,Step Brothers (2008),Comedy
2,2,60756,5.0,1528,Step Brothers (2008),Comedy
3,62,60756,3.5,746,Step Brothers (2008),Comedy
4,62,60756,3.5,911,Step Brothers (2008),Comedy


We have to label encode both the title and genre columns from the movies data frame so they can be used by our models

In [9]:
df['titleCode'] = label_encoder.fit_transform(df['title'])
df['genres'] = label_encoder.fit_transform(df['genres'])
df.head()

Unnamed: 0,userId,movieId,rating,tag,title,genres,titleCode
0,2,60756,5.0,911,Step Brothers (2008),216,1232
1,2,60756,5.0,227,Step Brothers (2008),216,1232
2,2,60756,5.0,1528,Step Brothers (2008),216,1232
3,62,60756,3.5,746,Step Brothers (2008),216,1232
4,62,60756,3.5,911,Step Brothers (2008),216,1232


## Clustering

Let's apply kmeans using rating, tag, and genre

In [10]:
kdata = df
kmeans = KMeans(n_clusters=20)

cluster_value = kmeans.fit_predict(kdata[['rating', 'tag', 'genres']])

kdata['Cluster'] = cluster_value
kdata.head()

Unnamed: 0,userId,movieId,rating,tag,title,genres,titleCode,Cluster
0,2,60756,5.0,911,Step Brothers (2008),216,1232,7
1,2,60756,5.0,227,Step Brothers (2008),216,1232,8
2,2,60756,5.0,1528,Step Brothers (2008),216,1232,10
3,62,60756,3.5,746,Step Brothers (2008),216,1232,12
4,62,60756,3.5,911,Step Brothers (2008),216,1232,7


Now let's test the results of kmeans

In [11]:
def getSimilarMoviesKmeans(movieName):
    
    k = kdata.loc[kdata['title'] == movieName]['Cluster'].values[0]
    kcluster = kdata.loc[kdata['Cluster'] == k]
    kcluster = kcluster.drop_duplicates(subset=['title'], keep='first')
    
    return pd.DataFrame(kcluster['title'].values).head(5)

In [12]:
getSimilarMoviesKmeans('Jumanji (1995)')

Unnamed: 0,0
0,The Interview (2014)
1,Jumanji (1995)
2,Braveheart (1995)
3,Toy Story 2 (1999)
4,Gladiator (2000)


In [13]:
getSimilarMoviesKmeans('Toy Story (1995)')

Unnamed: 0,0
0,Braveheart (1995)
1,Toy Story 2 (1999)
2,Gladiator (2000)
3,"Lord of the Rings: The Return of the King, The..."
4,"Animatrix, The (2003)"


We immediately see that these clusters aren't providing great results<br>
<br>
Now we try the density based spacial clustering<br>The title column is a string should be dropped for this model since it's not a float and can't be interpretted by DBSCAN

In [24]:
dbsc = df.drop(columns=['title'])

dbscan = DBSCAN()
model = dbscan.fit_predict(dbsc)

dbsc_movies = movies
dbsc_movies['DBSC'] = pd.DataFrame(model)

movies['DBSC'] = pd.DataFrame(model)

dbsc_movies.head()

Unnamed: 0,movieId,title,genres,DBSC
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,-1.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,-1.0
2,3,Grumpier Old Men (1995),Comedy|Romance,-1.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,-1.0
4,5,Father of the Bride Part II (1995),Comedy,-1.0


Now we can test this model with some more movies

In [31]:
def getSimilarMoviesDBSC(movieName):
    
    cluster = dbsc_movies.loc[dbsc_movies['title'] == movieName]['DBSC'].values[0]
    similarCluster = dbsc_movies.loc[dbsc_movies['DBSC'] == cluster]
    return pd.DataFrame(similarCluster['title'].values).head(5)


In [32]:
getSimilarMoviesDBSC('Jumanji (1995)')

Unnamed: 0,0
0,Toy Story (1995)
1,Jumanji (1995)
2,Grumpier Old Men (1995)
3,Waiting to Exhale (1995)
4,Father of the Bride Part II (1995)


We see that this has slightly better results