# #Weekend Movie Trip
In this project we create a clustering model on a movie dataset. The movie data set will have a movieId, title, and a genres. We will One hot encode the genres column so that it can be used with our clustering model. We also have a ratings dataset that corresponds to the movies dataset. We will average the ratings out for each movie and add it to our movies data set. We also have a corresponding tags dataset, with this dataset we will find the most frequent tags for each movie and then add it to the corresponding movie in our movies data set. These tags will be label encoded for use with our clusterer.

lets start by importing important libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
from sklearn.preprocessing import MultiLabelBinarizer
from collections import Counter
from sklearn.cluster import KMeans
from sklearn import metrics


lets read some data sets into pandas

In [2]:
movie = pd.read_csv("ml-25m/movies.csv", index_col = "movieId")
ratings = pd.read_csv("ml-25m/ratings.csv", index_col = "userId")
tags = pd.read_csv("ml-25m/tags.csv",index_col = "userId")

  mask |= (ar1 == a)


Now lets check on our data sets

In [3]:
movie.head()

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings = ratings.drop(columns = ["timestamp"])
ratings.head()

Unnamed: 0_level_0,movieId,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,296,5.0
1,306,3.5
1,307,5.0
1,665,5.0
1,899,3.5


In [5]:
tags.head()

Unnamed: 0_level_0,movieId,tag,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,260,classic,1439472355
3,260,sci-fi,1439472256
4,1732,dark comedy,1573943598
4,1732,great dialogue,1573943604
4,7569,so bad it's good,1573943455


In [6]:
print ("\nmovies:")
print (movie.isnull().sum())
print ("\nratings:")
print (ratings.isnull().sum())
print ("\ntags:")
print (tags.isnull().sum())


movies:
title     0
genres    0
dtype: int64

ratings:
movieId    0
rating     0
dtype: int64

tags:
movieId       0
tag          16
timestamp     0
dtype: int64


lets fill the missing values in the tags data

In [7]:
tags.fillna('none',inplace=True)

I think it would be valuable to have an average rating for each movie so movies can be clustered by rating score.

Here we are going sort through ratings and average them up for each movie

In [8]:
avg_ratings = ratings.groupby('movieId').mean()

Now we are going to add this column to our movies data set

In [9]:
movie["avg_rating"] = avg_ratings
movie.head()

Unnamed: 0_level_0,title,genres,avg_rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.893708
2,Jumanji (1995),Adventure|Children|Fantasy,3.251527
3,Grumpier Old Men (1995),Comedy|Romance,3.142028
4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.853547
5,Father of the Bride Part II (1995),Comedy,3.058434


We now have the average ratings for each of the movies. I think it would also be helpful to have our model recognize genres. So we will one hot encode the genres.

In [10]:
#get the unique genres
mlb = MultiLabelBinarizer(sparse_output=True)
#one hot encode genres
movie = movie.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(movie.pop("genres").str.split('|')),
                index=movie.index,
                columns=mlb.classes_))


Another thing that might be useful is the tags, since there are so many unique values for tags we will not be One Hot Encoding it. However I believe it would be useful to have the top 3 most prevalent tags associated with each movie.

first we will get all the tags associated with each movie.

In [11]:
tag1 = tags.groupby(["movieId","tag"]).size().sort_values().groupby(level=0).tail(3)
tag1 = tag1.sort_index()


tag1 = tag1.to_frame().drop(columns = [0]).reset_index(level=[0,1])
tag1.head()
mov = {}
for tag in range(len(tag1["movieId"])):
    if tag1.at[tag,"movieId"] in mov:
        mov[tag1.at[tag,"movieId"]].append(tag1.at[tag,"tag"])
    else:
        mov[tag1.at[tag,"movieId"]] = [tag1.at[tag,"tag"]] 
print(mov)




In [12]:
movdata = pd.DataFrame(columns = ['movieId', 'tag1', 'tag2','tag3']) 
for key in mov.keys():
    value = mov[key]
    if(len(value)==3):
        movdata = movdata.append({'movieId' : key, 'tag1' : value[0], 'tag2' : value[1],'tag3' : value[2]},ignore_index = True) 
    elif(len(value)==2):
        movdata = movdata.append({'movieId' : key, 'tag1' : value[0], 'tag2' : value[1],'tag3' : value[1]},ignore_index = True) 
    elif(len(value)==1):
        movdata = movdata.append({'movieId' : key, 'tag1' : value[0], 'tag2' : value[0],'tag3' : value[0]},ignore_index = True) 
    

In [13]:
movdata = movdata.set_index("movieId")
movdata.head()
#result = pd.concat([movie,movdata] ,axis=[1])

Unnamed: 0_level_0,tag1,tag2,tag3
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Pixar,animation,pixar
2,Robin Williams,fantasy,time travel
3,Jack Lemmon,fishing,sequel
4,CLV,characters,chick flick
5,family,pregnancy,steve martin


Lets now label encode the tags.

In [14]:
stacked = movdata.stack().astype('category')
unstacked = stacked.cat.codes.unstack()
unstacked.fillna(0)

Unnamed: 0_level_0,tag1,tag2,tag3
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,6622,9377,19533
2,7079,14101,22953
3,4035,14437,21264
4,1532,11460,11541
5,14045,19869,22181
...,...,...,...
208813,17899,17899,17899
208933,10486,12757,12757
209035,4162,12093,17644
209037,11425,12093,15082


In [15]:
result = pd.concat([movie, unstacked], axis=1)
result["tag1"] = result["tag1"].fillna(0)
result["tag2"] = result["tag2"].fillna(0)
result["tag3"] = result["tag3"].fillna(0)
result["tag1"] = result["tag1"].astype(int)
result["tag2"] = result["tag2"].astype(int)
result["tag3"] = result["tag3"].astype(int)
result.head()

Unnamed: 0_level_0,title,avg_rating,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,tag1,tag2,tag3
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),3.893708,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,6622,9377,19533
2,Jumanji (1995),3.251527,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,7079,14101,22953
3,Grumpier Old Men (1995),3.142028,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,4035,14437,21264
4,Waiting to Exhale (1995),2.853547,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1532,11460,11541
5,Father of the Bride Part II (1995),3.058434,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,14045,19869,22181


Now that we have the Genre, most frequent tags, and average rating we will start making a Clustering model. The clustering model has a large number of clusters so that there are less movies in each cluster. Less movies in each cluster will give me better recommendations based on how my recommendation system will work. My recommendation system will be randomly selecting movies from the same cluster as the given input.

In [16]:
result = result.fillna(1)
result.isnull().sum()
model = KMeans(n_clusters=100, init='k-means++', max_iter=500, n_init=15)
pred = model.fit_predict(result.drop(columns=["title"]))


Next we have to add our predictions to our result data set.

In [17]:
prediction = pd.DataFrame(pred).shift(1)
result['pred_cat'] = prediction

In [18]:
result.head()

Unnamed: 0_level_0,title,avg_rating,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Mystery,Romance,Sci-Fi,Thriller,War,Western,tag1,tag2,tag3,pred_cat
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Toy Story (1995),3.893708,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,6622,9377,19533,87.0
2,Jumanji (1995),3.251527,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,7079,14101,22953,7.0
3,Grumpier Old Men (1995),3.142028,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,4035,14437,21264,58.0
4,Waiting to Exhale (1995),2.853547,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,1532,11460,11541,13.0
5,Father of the Bride Part II (1995),3.058434,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,14045,19869,22181,74.0


now we will write a function that will show us our recommendations given a movie title. This system will randomly choose movies from the same cluster as the given input.

In [19]:
def recommend_movies(title):
    inmovie= result[result["title"] == title]
    movCat = int(inmovie['pred_cat'])
    df = result.loc[result['pred_cat'] == movCat]
    df = df.sample(10)
    
    return list(df['title'])

In [24]:
recs = recommend_movies("Jumanji (1995)")
print(recs)

['Ghost in the Machine (1993)', 'Sahara (2005)', 'Bikini Beach (1964)', 'Jalla! Jalla! (2000)', 'Day of the Locust, The (1975)', 'Further Gesture, A (1996)', 'Ten Commandments, The (1956)', 'Stay (2005)', 'From Dusk Till Dawn (1996)', 'Kagemusha (1980)']
