# DSC 630 Week 10
## Arbaz Khan

For this recommender system, we will be utilizing Cosine Similarity alongside the list of tags generated by the user. First, we will clean the data by mergining these different datasets based on userID and movieID, and removing the timestamp column since it will not impact our focus in this recommender.

In [193]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

# Read in each dataframe
movies = pd.read_csv("movies.csv", encoding_errors='ignore', low_memory=False)
tags = pd.read_csv("tags.csv", encoding_errors='ignore', low_memory=False)
ratings = pd.read_csv("ratings.csv", encoding_errors='ignore', low_memory=False)

# Remove timestamp column, since time of rating 
# is not a consequential element in our current analysis
tags.drop('timestamp', axis=1, inplace=True)
ratings.drop('timestamp', axis=1, inplace=True)

# Merge based on movieId and userID
df = pd.merge(tags, ratings, on=['movieId', 'userId'], how="inner")
df

Unnamed: 0,userId,movieId,tag,rating
0,2,60756,funny,5.0
1,2,60756,Highly quotable,5.0
2,2,60756,will ferrell,5.0
3,2,89774,Boxing story,5.0
4,2,89774,MMA,5.0
...,...,...,...,...
3471,606,6107,World War II,4.0
3472,606,7382,for katie,4.5
3473,610,3265,gun fu,5.0
3474,610,3265,heroic bloodshed,5.0


Now, we will merge these based on movieId and userID, giving us a dataset with all of the different reviews on each movie, organized by movieId and then userId.

In [194]:
# Merge based on movieId and userID
df = pd.merge(movies, df, on='movieId', how="inner")
df

Unnamed: 0,movieId,title,genres,userId,tag,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,336,pixar,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,474,pixar,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,567,fun,3.5
3,2,Jumanji (1995),Adventure|Children|Fantasy,62,fantasy,4.0
4,2,Jumanji (1995),Adventure|Children|Fantasy,62,magic board game,4.0
...,...,...,...,...,...,...
3471,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,62,star wars,4.0
3472,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,anime,3.5
3473,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,comedy,3.5
3474,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,184,gintama,3.5


Following this, we want to combine reviews from the same userID on the same movie. So, we will combine based on userID, and then use the .agg function alongside pandas groupby() in order to make all of the tags submitted by the user into one comma-separated list that is under one row instead of multiple by the same user.

In [195]:
# This gets most common value from a group, so we can take the userId 
# for a series of columns
def stringMode(series):
    # Get mode value
    mode_values = series.mode()  
    return mode_values.iloc[0] if not mode_values.empty else None

# Group by userID and tag and concatenate values
df_grouped = df.groupby('movieId', as_index=False).agg({'tag': ', '.join,
                                                       'userId': stringMode,})

# Merge once again with movies csv to add tag and title to each
df = pd.merge(movies, df_grouped, on='movieId', how="inner")
df

Unnamed: 0,movieId,title,genres,tag,userId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"pixar, pixar, fun",336
1,2,Jumanji (1995),Adventure|Children|Fantasy,"fantasy, magic board game, Robin Williams, game",62
2,3,Grumpier Old Men (1995),Comedy|Romance,"moldy, old",289
3,5,Father of the Bride Part II (1995),Comedy,"pregnancy, remake",474
4,7,Sabrina (1995),Comedy|Romance,remake,474
...,...,...,...,...,...
1459,183611,Game Night (2018),Action|Comedy|Crime|Horror,"Comedy, funny, Rachel McAdams",62
1460,184471,Tomb Raider (2018),Action|Adventure|Fantasy,"adventure, Alicia Vikander, video game adaptation",62
1461,187593,Deadpool 2 (2018),Action|Comedy|Sci-Fi,"Josh Brolin, Ryan Reynolds, sarcasm",62
1462,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,"Emilia Clarke, star wars",62


Now, we have a datset containing information on the movieId, userId, and the list of tags that user submitted for that movie. Based on this, we will create a MultiLabelBinarizer. This is because each movie has several tags corresponding to it, and so using the multi-label binarizer rather than pd.get_dummies() allows us to utilize lists with multiple separate values for our recommender.

In [196]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer

# Convert tag into a list
df['tag'] = df['tag'].apply(lambda x: x.split(','))

# Use MLB instead of get_dummies() because each value is a list containing multiple strings (the tags)
mlb = MultiLabelBinarizer()
tagFit = pd.DataFrame(mlb.fit_transform(df['tag']), columns=mlb.classes_)

# Merge with df
df = df.drop('tag', axis=1).join(tagFit)

Now, we will utilize a cosine similarity based on the list of tags submitted for each movie in order to recommend a movie. We will use the function recommend_movies to get the top 10 similar movies based on this cosine similarity in order to make recommendations.

In [197]:
# Compute similarity matrix
sim = cosine_similarity(tagFit)

# Convert to DataFrame for easier interpretation
simdf = pd.DataFrame(sim, index=df['movieId'], columns=df['movieId'])

# Given movie ID, the dataframe with similar movie values, our original dataframe,
# outputs a list of 10 movies of cosine similarity as recommendations
def recommend_movies(movie_id, simdf, df):
    # Based on movie ID, get 10 most similar movies from simdf
    similarMov = simdf[movie_id].iloc[1:11]
    # List the indices (movieId) of these movies
    recommend = similarMov.index.tolist()
    # Output them based on title instead of movieId
    return df[df['movieId'].isin(recommend)]['title']

movie = input("Input the name of the movie and its year in parenthesis i.e.: Pulp Fiction (1994): ")

id = df.loc[df['title'] == movie, 'movieId']

print(recommend_movies(id, simdf, df))

Input the name of the movie and its year in parenthesis i.e.: Pulp Fiction (1994):  Pulp Fiction (1994)


1                         Jumanji (1995)
2                Grumpier Old Men (1995)
3     Father of the Bride Part II (1995)
4                         Sabrina (1995)
5         American President, The (1995)
6                           Nixon (1995)
7                          Casino (1995)
8           Sense and Sensibility (1995)
9                      Get Shorty (1995)
10                        Copycat (1995)
Name: title, dtype: object


Here, we have an example where I entered Pulp Fiction and got a list of 10 movies recommended based on the tags assigned to this movie.