Preface:
In an era where information is abundant and easily accessible, the sheer volume of available literature can be overwhelming for readers seeking quality recommendations. 

The Problem:
Current recommender systems are content based filters which usually rely on multiple features of each movie which can be extremely vague or non specific enough which need to be hard-engineered. Data on such features is limited and usually inaccurate. Furthermore, it leads to the recommendation of extremely niche movies and provides mainly novelty movies, rather than allowing a reader to explore popular movies from a range of genres and expand their interests.

The Solution:
Collaborative Filtering is the solution to these probelms. It doesn't rely on feature engineering, rather we are able to find latent features within the data when implementing matrix factorisation to create predictions. Furthermore, collaborative filtering enables users similar to one another to provide recommendations, enabling a user to explore movies from a range of genres.


The solution will include three different approaches:
- Using user-to-user collaborative filtering (CF)
- Item-to-item CF
- Matrix factorisation to find latent features

The input will be the name of a user, and the output should be 5 movie recommendations

In [96]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import random
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
from sklearn.model_selection import ParameterGrid



The first step will include data preperation. 
- Ensure the data is prepared in the correct format and remove any unessary data

In [97]:
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv', nrows = 10000)
links = pd.read_csv('links.csv')
tags = pd.read_csv('tags.csv')

In [98]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
87580,292731,The Monroy Affaire (2022),Drama
87581,292737,Shelter in Solitude (2023),Comedy|Drama
87582,292753,Orca (2023),Drama
87583,292755,The Angry Breed (1968),Drama


In [99]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858
...,...,...,...,...
9995,66,58559,4.0,1449462234
9996,66,59315,4.0,1449462462
9997,66,59725,4.0,1449463282
9998,66,59784,4.0,1449462794


In [100]:
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
87580,292731,26812510,1032473.0
87581,292737,14907358,986674.0
87582,292753,12388280,948139.0
87583,292755,64027,182776.0


In [101]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,22,26479,Kevin Kline,1583038886
1,22,79592,misogyny,1581476297
2,22,247150,acrophobia,1622483469
3,34,2174,music,1249808064
4,34,2174,weird,1249808102
...,...,...,...,...
2000067,162279,90645,Rafe Spall,1320817734
2000068,162279,91079,Anton Yelchin,1322337407
2000069,162279,91079,Felicity Jones,1322337400
2000070,162279,91658,Rooney Mara,1325828398


Step 1: Determine methodology
- Approach #1: Find similar users and recommend their top movies (user-to-user CF)
- Approach #2: Find similar movies based on ratings given by other users (item-to-item CF)
- Approach #3: Matrix factorisation to discover latent features

Lets evaluate each methodology.

Approach #1: User to User Collaborative Filtering (CF)
Process: This approach identifies users with similar tastes and recommends movies that those users have rated highly.
Pros:
- Captures nuanced user preferences by comparing entire user profiles.
- Can introduce users to a broader range of movies outside their usual preferences.
Cons:
- Can suffer from the "cold start" problem for new users with few ratings.
- Computationally expensive as it requires comparing each user with every other user.

Approach #2: Item to Item Collaborative Filtering (CF)
Process: This approach finds movies similar to ones the user has rated highly, based on ratings from other users.
Pros:
- Provides more consistent recommendations by focusing on item similarities.
Cons:
- May not capture the full range of a user's preferences.
- Can be limited in diversity, recommending similar types of movies repeatedly.

Approach #3: Matrix Factorisation
Process: This approach decomposes (factors) the user-item ratings matrix into two separate matrices. The first matrix, called the user-feature matrix, assigns a value to each user's preference for certain latent features, such as genre or action. The second matrix, known as the item-feature matrix, represents how strongly each movie exhibits those latent features.
Pros:
- It discovers latent features like genre preferences or actor preferences leading to more accurate predictions
- It models user preferences and item characteristics directly, matrix factorization can offer highly personalized recommendations 
Cons:
- The latent features generated by matrix factorization are often abstract and difficult to interpre
- the initial factorization process can be computationally intensive, especially with very large datasets
- Without proper regularization, matrix factorization models can overfit the training data

From our evaluation, Approach #1: User to User CF seems to be be more appropriate because we can overcome the cold-start problem by hardcoding parameters to ensure movies with plenty of ratings are chosen. Furthermore we can employ packages to reduce the computation required. This method can introduce users to a wider variety of movies, as it leverages the tastes of similar users, offering more personalized and diverse recommendations. 

However, item-to-item CF will provide more predictable and similar recommendations to the target movie due to the similarity finding similarity between the movies rather than the user. 

Lastly, Matrix factorization can uncover hidden factors that explain the patterns in user-item interactions, leading to more nuanced and accurate recommendations. For example, in movie recommendations, it might discover latent features like genre preferences or actor preferences that aren’t explicitly stated. We can overcome its downsides of it being computationally expensive by employing stochastic gradient descent when finding k, the number of latent features. Furthermore we can regularise the data beforehand to prevent overfitting.


1) The approach for method #1 user-to-user CF

- 1.1 Quanitfy similarity between the target user and the rest of the users
    - use cosine distance between users
    - euclidean distance
    
Note: The image is simplified into 2 dimensions. Each dimension represents the movie for which a rating is given. Each vector represents the userratings that has been given to each movie.

<img src="Cosine_sim.png" style="height:200px" />

- 1.2 Find the top 5 rated movies by the most similar user and recommend it to the target user

2) The approach for method #2 item-to-item CF
- 2.1 Find a group of the top 10 rated items by the target user
- 2.2 Use of the of movies from the group of movies as the target movie to qunaitify similarity
    - using cosine distance
    - euclidean distance
- 2.3 Lastly recommend the 5 most similar items (movies) to the user


3) The approach for method #3 matrix factorisation
- 3.1 Define the (hyper)parameters that will be used latent features, learning rate, learning cycles
- 3.2 Create an outer loop for latent_features, then training cycles, inner loops to traverse the ratings matrix
- 3.3 within the loop predict, find error of each prediction and update the latent feature matrices
- 3.4 record the mean squared error of the aggregated preduction matrix with the corresponding latent feature vectors and dimensions
- 3.5 recommend the movies with the highest predicted ratings=


Step 2: Clean and keep relevant data

In [102]:
#drop unessesary columns 
ratings = ratings.drop('timestamp', axis = 1)
movies = movies.drop('genres', axis = 1)

In [103]:
#Include the movie title name in the ratings table
ratings = pd.merge(ratings, movies, on='movieId')
ratings

Unnamed: 0,userId,movieId,rating,title
0,1,17,4.0,Sense and Sensibility (1995)
1,3,17,5.0,Sense and Sensibility (1995)
2,15,17,4.5,Sense and Sensibility (1995)
3,28,17,4.0,Sense and Sensibility (1995)
4,29,17,4.0,Sense and Sensibility (1995)
...,...,...,...,...
9995,66,8974,4.0,"SpongeBob SquarePants Movie, The (2004)"
9996,66,37729,4.0,Corpse Bride (2005)
9997,66,44193,4.5,She's the Man (2006)
9998,66,56152,4.5,Enchanted (2007)


In [104]:
#Keep movies which have more than 50 ratings to overcome cold start problem
ratings_count = ratings['movieId'].value_counts()

#movies with more than 10 ratings
movies_greater_than_10 = ratings_count[ratings_count > 10].index
#use the list that is created above to keep the movie_id's with more than 10 ratings
ratings = ratings[ratings['movieId'].isin(movies_greater_than_10)]

ratings

Unnamed: 0,userId,movieId,rating,title
0,1,17,4.0,Sense and Sensibility (1995)
1,3,17,5.0,Sense and Sensibility (1995)
2,15,17,4.5,Sense and Sensibility (1995)
3,28,17,4.0,Sense and Sensibility (1995)
4,29,17,4.0,Sense and Sensibility (1995)
...,...,...,...,...
5934,36,8961,3.5,"Incredibles, The (2004)"
5935,44,8961,3.5,"Incredibles, The (2004)"
5936,52,8961,4.0,"Incredibles, The (2004)"
5937,64,8961,4.0,"Incredibles, The (2004)"


In [105]:
#find out how many movies we are providing recommendations
num_movies = ratings['movieId'].nunique()
print(num_movies)

124


Step 3: Create a table to represent the ratings data of each user for a movie

In [106]:
#We will create pivot table to represent the relevant data (users,movies,ratings)
user_ratings = ratings.pivot_table(values = 'rating', index = 'userId', columns = 'title')
#fill all the null entries with 0's to produce cosine similarity later on
user_ratings = user_ratings.fillna(0)
user_ratings

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),Apocalypse Now (1979),Apollo 13 (1995),...,Toy Story (1995),Traffic (2000),Trainspotting (1996),True Lies (1994),Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),"Usual Suspects, The (1995)",Waterworld (1995),While You Were Sleeping (1995),X-Men (2000)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,5.0,0.0,0.0,5.0,5.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,5.0,0.0
3,0.0,0.5,4.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,...,0.0,0.0,0.0,2.5,0.0,0.0,0.0,3.0,4.0,3.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,5.0,0.0,0.0,0.0,3.0,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,5.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,4.5,...,0.0,0.0,0.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0
64,0.0,0.0,0.0,4.5,4.5,0.0,4.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,2.5
65,2.0,1.0,1.0,3.0,1.0,4.0,4.0,3.5,5.0,3.0,...,2.0,3.5,2.0,3.5,2.0,2.0,4.0,2.0,0.0,1.0


In [107]:
#since we are executing user-to-user CF, it calculates the similarity row-wise, meaning the similarity is shown between each user in the array below
similarity_matrix = cosine_similarity(user_ratings)
similarity_matrix


array([[1.        , 0.04975509, 0.31751663, ..., 0.39130546, 0.47672793,
        0.04588684],
       [0.04975509, 1.        , 0.51025086, ..., 0.1272402 , 0.36529003,
        0.07209938],
       [0.31751663, 0.51025086, 1.        , ..., 0.35433623, 0.58618336,
        0.19756082],
       ...,
       [0.39130546, 0.1272402 , 0.35433623, ..., 1.        , 0.58118164,
        0.22609752],
       [0.47672793, 0.36529003, 0.58618336, ..., 0.58118164, 1.        ,
        0.24978076],
       [0.04588684, 0.07209938, 0.19756082, ..., 0.22609752, 0.24978076,
        1.        ]])

In [108]:
#convert the similarity between the users into a dataframe
#here we can see the similarity between each user
user_similarity = pd.DataFrame(similarity_matrix,index = user_ratings.index, columns = user_ratings.index)
user_similarity


userId,1,2,3,4,5,6,7,8,9,10,...,57,58,59,60,61,62,63,64,65,66
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.049755,0.317517,0.100377,0.087049,0.000000,0.113585,0.334042,0.302323,0.348697,...,0.201033,0.252395,0.467832,0.222722,0.137361,0.280362,0.255164,0.391305,0.476728,0.045887
2,0.049755,1.000000,0.510251,0.000000,0.567943,0.000000,0.605253,0.012119,0.068248,0.307515,...,0.000000,0.000000,0.139378,0.126575,0.422349,0.341625,0.172674,0.127240,0.365290,0.072099
3,0.317517,0.510251,1.000000,0.142971,0.592698,0.090200,0.459953,0.131085,0.093504,0.565526,...,0.128099,0.281695,0.404079,0.197606,0.412525,0.552969,0.342408,0.354336,0.586183,0.197561
4,0.100377,0.000000,0.142971,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.133897,...,0.000000,0.375293,0.115556,0.000000,0.000000,0.181369,0.000000,0.114939,0.060405,0.000000
5,0.087049,0.567943,0.592698,0.000000,1.000000,0.072347,0.611342,0.098520,0.068748,0.383932,...,0.000000,0.000000,0.139690,0.178514,0.678532,0.541449,0.303091,0.066452,0.406379,0.061160
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,0.280362,0.341625,0.552969,0.181369,0.541449,0.104022,0.345556,0.178362,0.079910,0.477059,...,0.305888,0.068067,0.303196,0.281478,0.496390,1.000000,0.478126,0.331458,0.513820,0.042210
63,0.255164,0.172674,0.342408,0.000000,0.303091,0.226795,0.210599,0.200141,0.204169,0.425816,...,0.059546,0.000000,0.213965,0.147051,0.257681,0.478126,1.000000,0.233114,0.427604,0.000000
64,0.391305,0.127240,0.354336,0.114939,0.066452,0.118660,0.207697,0.321094,0.121329,0.663763,...,0.153465,0.258096,0.443467,0.397985,0.148386,0.331458,0.233114,1.000000,0.581182,0.226098
65,0.476728,0.365290,0.586183,0.060405,0.406379,0.155902,0.370380,0.460297,0.355550,0.756009,...,0.211712,0.237465,0.640431,0.435787,0.400052,0.513820,0.427604,0.581182,1.000000,0.249781


In [109]:

user = user_similarity.loc[10]
sim_user = user[user < 0.99].idxmax()

print(sim_user)

28


In [110]:
#Input is userId, and we want output to be 3 items
#1. get the most similar user to the inputted user
#2. get the 5 highest rated movies by the most similar user and recommend it
def get_rec_user(userId,num_rec,sim_table,cosine):
    #getting the inputted userID as a row
    user = sim_table.loc[userId]
    #column (movie) with the highest value
    #this part of the code is for cosine similarity as the similarity measure (need to find greatest similarity)
    if cosine == True:
        #the similarity needs to be below 0.99 to ensure the target user is not returned
        sim_user = user[user < 0.99].idxmax()
        max_value_below_1 = user[user < 0.99].max()
        print(max_value_below_1)
    else:
    #this part of the code is for euclidean distance as the similarity measure (as we need to find the least distance)
        sim_user = user[user >0].idxmin()
        min_value_greater_0 = user[user >0].min()
        print(min_value_greater_0)
    print(sim_user)
    
    #here we need to filter the list movies that the user has not yet seen to provide those as recommendations
    unseen_movies = user_ratings.loc[userId]
    #an movie that is unseen is one that has a rating of 0
    unseen_movies = unseen_movies[unseen_movies==0].reset_index()
    unseen_movies.columns = ['title','rating']
    
    #once we have the userId, we can find the top 5 rated movies by that user
    sim_user_ratings = ratings[ratings['userId'] == sim_user]
    # Sort by rating in descending order
    sim_user_sorted = sim_user_ratings.sort_values(by='rating', ascending=False)

    # Get the top 5 ratings and corresponding movies for those that are unseen
    sim_user_sorted = sim_user_sorted[sim_user_sorted['title'].isin(unseen_movies['title'])]
    top_ratings = sim_user_sorted.head(num_rec)
    recommendations=top_ratings['title'].reset_index(drop=True)
    recommendations.index = np.arange(1, len(recommendations) + 1)
    print(recommendations)

get_rec_user(5,10,user_similarity,cosine = True)

0.8040900107264123
49
1                          Babe (1995)
2                Mrs. Doubtfire (1993)
3                         Speed (1994)
4                     Mask, The (1994)
5    Terminator 2: Judgment Day (1991)
Name: title, dtype: object


In [111]:
get_rec_user(5,10,user_similarity,cosine = True)

0.8040900107264123
49
1                          Babe (1995)
2                Mrs. Doubtfire (1993)
3                         Speed (1994)
4                     Mask, The (1994)
5    Terminator 2: Judgment Day (1991)
Name: title, dtype: object


Now that we have the 5 recommendations using cosine similarity as our distance measure, lets use euclidean distance.

In [112]:
# Calculate Euclidean distances
user_distances_matrix = euclidean_distances(user_ratings.values)
user_distances_matrix

array([[ 0.        , 27.64054992, 28.73586609, ..., 25.90849282,
        29.39812919, 22.79802623],
       [27.64054992,  0.        , 24.77397828, ..., 30.89093718,
        31.86298793, 22.73213584],
       [28.73586609, 24.77397828,  0.        , ..., 30.76524013,
        28.16025568, 28.18687638],
       ...,
       [25.90849282, 30.89093718, 30.76524013, ...,  0.        ,
        27.80287755, 26.15339366],
       [29.39812919, 31.86298793, 28.16025568, ..., 27.80287755,
         0.        , 32.28002478],
       [22.79802623, 22.73213584, 28.18687638, ..., 26.15339366,
        32.28002478,  0.        ]])

In [113]:
#convert distances array into a dataframe
user_distances_df = pd.DataFrame(user_distances_matrix, index = user_ratings.index, columns = user_ratings.index)
user_distances_df

userId,1,2,3,4,5,6,7,8,9,10,...,57,58,59,60,61,62,63,64,65,66
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,27.640550,28.735866,19.849433,26.324893,21.748563,25.922963,22.152878,20.808652,32.549962,...,20.273135,20.988092,24.228083,24.994999,25.806976,29.137605,25.144582,25.908493,29.398129,22.798026
2,27.640550,0.000000,24.773978,20.396078,18.248288,21.977261,17.435596,27.143139,24.041631,33.458930,...,22.293497,24.176435,30.577770,26.659895,21.260292,28.053520,26.650516,30.890937,31.862988,22.732136
3,28.735866,24.773978,0.000000,27.708302,22.666054,28.543826,25.588083,31.408598,30.294389,29.133314,...,28.368116,27.390692,29.491524,31.088583,26.659895,26.263092,28.757608,30.765240,28.160256,28.186876
4,19.849433,20.396078,27.708302,0.000000,19.261360,9.219544,19.235384,18.648056,14.832397,33.339166,...,9.949874,12.509996,25.787594,20.389949,19.595918,27.184554,21.453438,25.927784,33.064331,12.399597
5,26.324893,18.248288,22.666054,19.261360,0.000000,20.346990,16.763055,25.134637,23.130067,31.630681,...,21.260292,23.227139,29.966648,25.134637,15.394804,23.537205,23.837995,31.260998,30.761177,21.880356
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,29.137605,28.053520,26.263092,27.184554,23.537205,28.035692,27.549955,30.294389,30.116441,31.709620,...,26.267851,29.790938,31.622777,29.219001,24.596748,0.000000,25.519600,31.052375,30.302640,29.609965
63,25.144582,26.650516,28.757608,21.453438,23.837995,21.053503,25.342652,25.169426,23.157072,31.188940,...,22.743131,25.074888,29.820295,27.055499,24.784067,25.519600,0.000000,29.563491,30.757113,24.413111
64,25.908493,30.890937,30.765240,25.927784,31.260998,26.518861,28.917987,26.673957,28.288690,25.421448,...,26.367594,26.062425,27.463612,25.865034,30.103986,31.052375,29.563491,0.000000,27.802878,26.153394
65,29.398129,31.862988,28.160256,33.064331,30.761177,32.897568,31.484123,29.563491,31.068473,23.307724,...,32.453813,32.622845,25.869867,30.347982,30.955613,30.302640,30.757113,27.802878,0.000000,32.280025


In [129]:
#we can use the previous function we created
get_rec_user(5,10,user_similarity, cosine=False)

0.009405849800627291
53
1             Star Wars: Episode IV - A New Hope (1977)
2                     Terminator 2: Judgment Day (1991)
3                                      Gladiator (2000)
4     Lord of the Rings: The Fellowship of the Ring,...
5                               Dark Knight, The (2008)
6                                      Inception (2010)
7                                       Die Hard (1988)
8                                Terminator, The (1984)
9                                    Matrix, The (1999)
10        Lord of the Rings: The Two Towers, The (2002)
Name: title, dtype: object


The first method above uses cosine_similarity to find the most similar user and their highest rated movies (user-to-user)
Second method will be item to item similarity
fourth will be matrix factorisation

Method #2
Now that we have completed user-to-user similarity, we can move onto using item-to-item similarity.
This method is deconstructed into the following steps
- find a target - group or singular movie that the target user has rated highly
- develop the item similarity matrix
- find other movies in the similarity matrix that have a high similarity rating as the target movies
- recommend the movies

In [115]:
#transpose the user_ratings matrix to get the individual movies as rows
item_ratings = user_ratings.T

In [116]:
item_similarity_matrix = cosine_similarity(item_ratings)
item_similarity_matrix

array([[1.        , 0.24813052, 0.27075978, ..., 0.27023046, 0.17852775,
        0.44689137],
       [0.24813052, 1.        , 0.61057142, ..., 0.51778254, 0.47816659,
        0.39843629],
       [0.27075978, 0.61057142, 1.        , ..., 0.57695845, 0.70018828,
        0.3075099 ],
       ...,
       [0.27023046, 0.51778254, 0.57695845, ..., 1.        , 0.58110903,
        0.16703095],
       [0.17852775, 0.47816659, 0.70018828, ..., 0.58110903, 1.        ,
        0.18120466],
       [0.44689137, 0.39843629, 0.3075099 , ..., 0.16703095, 0.18120466,
        1.        ]])

In [117]:
item_similarity = pd.DataFrame(item_similarity_matrix, index = item_ratings.index, columns = item_ratings.index)
item_similarity

title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),Apocalypse Now (1979),Apollo 13 (1995),...,Toy Story (1995),Traffic (2000),Trainspotting (1996),True Lies (1994),Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),"Usual Suspects, The (1995)",Waterworld (1995),While You Were Sleeping (1995),X-Men (2000)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001: A Space Odyssey (1968),1.000000,0.248131,0.270760,0.535611,0.284892,0.478252,0.502932,0.353440,0.766790,0.371834,...,0.379654,0.500172,0.698194,0.217723,0.461270,0.235654,0.644888,0.270230,0.178528,0.446891
Ace Ventura: Pet Detective (1994),0.248131,1.000000,0.610571,0.372857,0.220111,0.090041,0.203490,0.349249,0.259033,0.612340,...,0.249027,0.196762,0.374007,0.396197,0.285106,0.341366,0.286715,0.517783,0.478167,0.398436
Aladdin (1992),0.270760,0.610571,1.000000,0.303820,0.156198,0.210089,0.326061,0.306433,0.220915,0.673079,...,0.397965,0.228706,0.446466,0.585375,0.423909,0.297398,0.450792,0.576958,0.700188,0.307510
Alien (1979),0.535611,0.372857,0.303820,1.000000,0.570330,0.427878,0.401418,0.442348,0.555979,0.282490,...,0.396348,0.418439,0.569383,0.168155,0.501870,0.368434,0.503866,0.113424,0.091325,0.549593
Aliens (1986),0.284892,0.220111,0.156198,0.570330,1.000000,0.372627,0.291604,0.435085,0.407876,0.241408,...,0.339807,0.485103,0.361917,0.220556,0.551955,0.280765,0.365067,0.094411,0.101356,0.427490
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Twister (1996),0.235654,0.341366,0.297398,0.368434,0.280765,0.236754,0.193833,0.294265,0.237593,0.287431,...,0.409054,0.401399,0.367164,0.352392,0.511335,1.000000,0.308093,0.238314,0.073683,0.305300
"Usual Suspects, The (1995)",0.644888,0.286715,0.450792,0.503866,0.365067,0.621988,0.582512,0.558632,0.614631,0.397750,...,0.517775,0.648186,0.646970,0.299264,0.600647,0.308093,1.000000,0.270885,0.201705,0.528173
Waterworld (1995),0.270230,0.517783,0.576958,0.113424,0.094411,0.166668,0.202326,0.111694,0.206669,0.681996,...,0.191931,0.112229,0.283352,0.597728,0.351581,0.238314,0.270885,1.000000,0.581109,0.167031
While You Were Sleeping (1995),0.178528,0.478167,0.700188,0.091325,0.101356,0.074039,0.118777,0.084642,0.085199,0.640215,...,0.248453,0.077110,0.187196,0.642353,0.245064,0.073683,0.201705,0.581109,1.000000,0.181205


In [118]:
#now that we have our similarity matrix lets find a set of movies that we should find similar movies for
#for the user get the top 10 rated movies
def get_rec_item(userId, sim_table, ratings_df,num_suggestions,cosine):
    #first get the user's top 10 rated movies
    filtered_ratings = ratings_df[ratings_df['userId']==userId].sort_values(by=['rating'],ascending=False).head(10)

    #get the names of the movies in a list
    top_10 = filtered_ratings['title'].tolist()

    #as we aim to provide modular recommendations, we will use a randomly chosen movie from their top 10 movies to be the target movie
    global target_movie
    target_movie = random.choice(top_10)
    print('Target movie:',target_movie)

    #We need to recommend movies that the user has not seen yet
    #create a list of movies the user has not seen (rating = 0)
    unseen_movies = user_ratings.loc[userId]
    unseen_movies = unseen_movies[unseen_movies==0].reset_index()
    unseen_movies.columns = ['title','rating']
    
    #now lets find similar movies to our target, from our similarity matrix
    if cosine == True:
        sim_with_target = sim_table.loc[target_movie].to_frame(name='sim_score').sort_values(by=['sim_score'],ascending=False)
    #now we will filter the top 5 similar values
    #remove the first row because that is the target
    else:
        sim_with_target = sim_table.loc[target_movie].to_frame(name='sim_score').sort_values(by=['sim_score'],ascending=True)

    #manipulate the df to show the the top unseen movies
    sim_with_target = sim_with_target.reset_index()
    top_5=sim_with_target[sim_with_target['title'].isin(unseen_movies['title'])].reset_index().drop(['sim_score','index'],axis = 1).head(num_suggestions)
    top_5.index = np.arange(1, len(top_5) + 1)
    
    print(top_5)

get_rec_item(5,item_similarity,ratings,10,cosine = True)


Target movie: Batman (1989)
                                title
1               Mrs. Doubtfire (1993)
2                   Home Alone (1990)
3         Sleepless in Seattle (1993)
4                        Speed (1994)
5                Trainspotting (1996)
6             Schindler's List (1993)
7                         Babe (1995)
8   Terminator 2: Judgment Day (1991)
9                    Mask, The (1994)
10                Pretty Woman (1990)


Now lets compare the results of recommendations between cosine similarity and euclidean distance measures.
First using euclidean distance for user-to-user CF

Finally, we can employ euclidean distance for item-to-item CF

In [119]:
item_distances_matrix = euclidean_distances(item_ratings.values)
item_distances_matrix

array([[ 0.        , 14.94991639, 18.33712082, ..., 15.15750639,
        16.83745824, 13.78404875],
       [14.94991639,  0.        , 12.61942946, ...,  9.367497  ,
        10.79351657, 11.3137085 ],
       [18.33712082, 12.61942946,  0.        , ..., 13.01921657,
        11.368817  , 16.45448267],
       ...,
       [15.15750639,  9.367497  , 13.01921657, ...,  0.        ,
        10.03742995, 13.82931669],
       [16.83745824, 10.79351657, 11.368817  , ..., 10.03742995,
         0.        , 14.73091986],
       [13.78404875, 11.3137085 , 16.45448267, ..., 13.82931669,
        14.73091986,  0.        ]])

In [120]:
item_similarity = pd.DataFrame(item_distances_matrix, index = item_ratings.index, columns = item_ratings.index)
item_similarity


title,2001: A Space Odyssey (1968),Ace Ventura: Pet Detective (1994),Aladdin (1992),Alien (1979),Aliens (1986),"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",American Beauty (1999),American History X (1998),Apocalypse Now (1979),Apollo 13 (1995),...,Toy Story (1995),Traffic (2000),Trainspotting (1996),True Lies (1994),Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Twister (1996),"Usual Suspects, The (1995)",Waterworld (1995),While You Were Sleeping (1995),X-Men (2000)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001: A Space Odyssey (1968),0.000000,14.949916,18.337121,13.720423,16.248077,16.416455,17.449928,16.830033,10.087121,19.006578,...,18.241436,13.892444,10.943034,19.235384,18.179659,15.676415,14.474115,15.157506,16.837458,13.784049
Ace Ventura: Pet Detective (1994),14.949916,0.000000,12.619429,13.573872,13.820275,18.801596,19.710403,14.688431,15.419144,15.141004,...,17.937391,14.525839,13.294736,15.149257,19.065676,11.213831,18.261982,9.367497,10.793517,11.313708
Aladdin (1992),18.337121,12.619429,0.000000,17.748239,18.728321,20.934421,20.742469,18.316659,19.352002,14.317821,...,18.641352,18.268826,15.700318,14.654351,19.293781,16.232683,18.282505,13.019217,11.368817,16.454483
Alien (1979),13.720423,13.573872,17.748239,0.000000,12.459936,17.036725,18.901058,15.475788,13.765900,20.099751,...,17.874563,14.790199,12.903488,19.640519,17.457090,14.106736,16.800298,16.324828,17.442764,12.298374
Aliens (1986),16.248077,13.820275,18.728321,12.459936,0.000000,17.262677,19.899749,14.958275,15.239751,20.031226,...,18.103867,13.209845,14.941553,18.289341,16.431677,13.991069,18.357560,15.402922,16.340135,12.903488
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Twister (1996),15.676415,11.213831,16.232683,14.106736,13.991069,18.020821,20.242282,15.700318,16.186414,18.761663,...,16.658331,13.238202,13.874437,15.945219,16.830033,0.000000,18.391574,12.649111,15.091388,12.874393
"Usual Suspects, The (1995)",14.474115,18.261982,18.282505,16.800298,18.357560,15.716234,17.449928,16.209565,15.173991,20.609464,...,17.993054,14.282857,14.343988,20.784610,17.117243,18.391574,0.000000,18.661458,19.937402,15.937377
Waterworld (1995),15.157506,9.367497,13.019217,16.324828,15.402922,18.540496,20.018741,17.233688,16.294171,14.089003,...,18.788294,15.755951,14.525839,13.047988,18.594354,12.649111,18.661458,0.000000,10.037430,13.829317
While You Were Sleeping (1995),16.837458,10.793517,11.368817,17.442764,16.340135,20.186629,21.482551,18.350749,18.268826,14.500000,...,18.808243,17.029386,16.271140,12.529964,20.186629,15.091388,19.937402,10.037430,0.000000,14.730920


In [121]:
get_rec_item(5,item_similarity,ratings,5,cosine = False)

Target movie: Fugitive, The (1993)
                         title
1  Sleepless in Seattle (1993)
2        Mrs. Doubtfire (1993)
3                 Speed (1994)
4             Mask, The (1994)
5            Home Alone (1990)


Next steps
1. Order the code where it does user-to-user CF then item-to-item (make it written, )
2. Compare euclidean vs cosine similarity for each method (need to compare it) {AT THE END COMPARISONS ARE MADE}


The next method is Matrix Factorisation


In [122]:
#data prep from dataframe to numpy array format
ratings_matrix = user_ratings.astype(np.float32)
matrix = ratings_matrix.values
users, movies = ratings_matrix.index, ratings_matrix.columns


#define the parameters and hyperparameters
num_users, num_movies = matrix.shape
#hyperparameter: number of latent features we want to test for (50 values, every even number of latent features will be tested)
#latent_features_list = list(range(2,102,2))
#hyperparameter: learning rate
#learning_rate = [0.01,0.1]

#define the parameter grid to find the optimum hyperparameters
parameters_grid = { 
    'latent_features': [5], #,10,30,40,50,60
    'learning_rates': [0.001] #,0.01
}
#0.01, 30,40,50,60
#since there are 100 different combinations of parameters that could be used it will be compuationally expensive
#hence will cap the number of combinations to 20 
#eppchs = 1000 because our learning rate is low, hence it needs to compensate for the low learning rate
n_epochs = 500
best_U = np.empty((num_users,num_movies))
best_V = np.empty((num_users,num_movies))
best_mse = 99999
best_params = None

for parameters in ParameterGrid(parameters_grid):
    latent_features = parameters_grid['latent_features']
    learning_rates = parameters_grid['learning_rates']

    #in matrix factorisation we have our user-feature matrix and movie-feature matrix U and V respectively which we will define
    #we will create matrices of random values that are normally distributed with the corresponding sizes
    #the scale is the stdev of which the random values are picked 1/latent_features ensures they are small so it is easy for convergence
    U = np.random.normal(scale=1.0/parameters['latent_features'], size=(num_users, parameters['latent_features']))
    V = np.random.normal(scale=1.0/parameters['latent_features'], size=(num_movies, parameters['latent_features']))
    #an epoch is 1 training cycle
    for epoch in range(n_epochs):
        for i in range(num_users):
            for j in range(num_movies):
                #the condition below checks if the ratings are non-zero (user has seen these movies)
                if matrix[i,j] > 0:
                    #establish a prediction for each rating using the values we have in the U,V matrices
                    predicted = np.dot(U[i,:],V[j,:])
                    #from the predicted compute the error (how far off the prediction was)
                    error = matrix[i,j] - predicted

                    #with the error and learning rate, we update the values in the U,V matrix to get closer to the actual
                    #updates the latent feature vector for user i: error * V[j,:] represents gradient of loss function
                    #update the latent vectors
                    U[i,:] = U[i,:] + parameters['learning_rates'] * error * V[j,:]
                    V[j,:] = V[j,:] + parameters['learning_rates'] * error * U[i,:]

    #At this point we can calculate the total error of predictions made for the latent feature and learning rate used
    ratings_predicted = np.dot(U, V.T) #need to multiplied by transposed T due to 
     #the reason is why the inner matrix > 0 to check for error for rated movies
    mse = mean_squared_error(matrix[matrix>0],ratings_predicted[matrix>0])

    #track the best parameters and latent-feature matrices
    if mse < best_mse:
        best_mse = mse
        best_params = parameters
        best_U = U
        best_V = V

print('Best MSE:',best_mse)
print('Best parameters:',best_params)


Best MSE: 0.29563948904448323
Best parameters: {'latent_features': 5, 'learning_rates': 0.001}


In [123]:
#using the optimal U and V latent feature matrices we can multiply them to get our predicted ratings dataframe
predicted_ratings = np.dot(best_U, best_V.T)
predicted_ratings = pd.DataFrame(predicted_ratings, index = users, columns=movies)
predicted_ratings.reset_index()
print(predicted_ratings)

title   2001: A Space Odyssey (1968)  Ace Ventura: Pet Detective (1994)  \
userId                                                                    
1                           3.668292                           2.811275   
2                           2.711093                           1.009282   
3                           3.549617                           1.902477   
4                           2.201812                           1.244610   
5                           4.221524                           2.901475   
...                              ...                                ...   
62                          3.694405                           1.789117   
63                          4.743833                           2.935345   
64                          4.256119                           2.781355   
65                          2.643150                           1.382579   
66                          4.380695                           3.038089   

title   Aladdin (1992)  

In [124]:
#now that we have our predicted ratings dataframe, we can develop our function to provide recommendations
def recommend_MF(predicted_df, userID):
    #firstly lets filter data for the specific user, and the movies the user has not seen
    unseen_movies = user_ratings.loc[userID]
    unseen_movies = unseen_movies[unseen_movies==0].reset_index()
    
    #now lets filter out the specific user (row) from our predicted ratings df
    predicted = predicted_df.loc[userID].reset_index()
    #now filter the movies that have not been seen
    predicted = predicted[predicted['title'].isin(unseen_movies['title'])].sort_values(by=userID,ascending = False)
    #lastly we will provide the top 5 movies
    predicted = predicted.head().drop(columns = [userID],axis =1)
    predicted.index= np.arange(1, len(predicted) + 1)
    print(predicted)

recommend_MF(predicted_df=predicted_ratings, userID = 20)

                         title
1    American History X (1998)
2               Platoon (1986)
3  English Patient, The (1996)
4          Shining, The (1980)
5               Titanic (1997)


In summary we have employed a basket of recommender methods with specific distance metrics:
1. User to User CF 
    - cosine similarity
    - euclidean distance
2. Item to Item CF
    - cosine similarity
    - euclidean distance
3. Matrix Factorisation

Now we will carry on evaluate each method we employed
1. For User to User CF and Item to Item CF, we will develop predictions for movies that have already been rated and carry out the following evaluative metrics
 - precision, recall, f1 score
2. For Matrix Factorisation we will use 
 - MSE and MAE

In [125]:
#Evaluation for User to User CF
def eval_U_U(userId, similarity_table,user_ratings_matrix,cosine):
    
    #####evaluate only for user number 10#######

    #getting the user and the similarity measure between the target user and the most similar user
    user = similarity_table.loc[userId]
    if cosine == True:
        sim_user = user[user < 0.99].idxmax()
        max_value_below_1 = user[user < 0.99].max()
        print('cosine similarity with target user',userId,':',max_value_below_1)
    else:
        sim_user = user[user >0].idxmin()
        min_value_greater_0 = user[user >0].min()
        print('euclidean dustance with target user:',userId,':',min_value_greater_0)
    print('most similar user:',sim_user)

    #get the list of seen movies by target user
    seen_movies = user_ratings_matrix.loc[userId]
    #an movie that is unseen is one that has a rating of 0
    seen_movies = seen_movies[seen_movies!=0].reset_index()
    seen_movies.columns = ['title','rating']

    #now get teh seen movies by the most similar user
    user2_seen_movies = user_ratings_matrix.loc[sim_user]
    user2_seen_movies = user2_seen_movies[user2_seen_movies!=0].reset_index()
    user2_seen_movies.columns = ['title', 'rating']
    
    #lets join in the two dataframes based on the titles that they have both seen
    common_movies = pd.merge(seen_movies,user2_seen_movies,on='title',suffixes=('_target_user', '_similar_user'))
    print(common_movies)
    
    #calculate root mean squared error
    rmse = np.sqrt(mean_squared_error(common_movies['rating_target_user'],common_movies['rating_similar_user']))
    print('root mean squared error:',rmse)
    #calculate mean absolute error
    mae = mean_absolute_error(common_movies['rating_target_user'],common_movies['rating_similar_user'])
    print('mean absolute error:',mae)
    

eval_U_U(5,user_similarity,user_ratings,True)

cosine similarity with target user 5 : 0.8040900107264123
most similar user: 49
                                                title  rating_target_user  \
0                   Ace Ventura: Pet Detective (1994)                 3.0   
1                                      Aladdin (1992)                 3.0   
2                                    Apollo 13 (1995)                 3.0   
3                                       Batman (1989)                 4.0   
4                               Batman Forever (1995)                 3.0   
5                         Beauty and the Beast (1991)                 3.0   
6                                   Braveheart (1995)                 4.0   
7                     Clear and Present Danger (1994)                 4.0   
8                                  Cliffhanger (1993)                 4.0   
9                                 Crimson Tide (1995)                 4.0   
10                          Dances with Wolves (1990)                 3.0

In [126]:
eval_U_U(5,user_similarity,user_ratings,False)

euclidean dustance with target user: 5 : 0.009405849800627291
most similar user: 53
                              title  rating_target_user  rating_similar_user
0  Shawshank Redemption, The (1994)                 1.0                  3.5
root mean squared error: 2.5
mean absolute error: 2.5


In [127]:
def eval_U_M(userId, similarity_table,user_ratings_matrix,target_movie,cosine):
    #now we will evaluate user to movie filtering
    
    #from the specific user in our example #10, we will take the target movie and find the top 10 predictions from a list of seen movies
    #then caluclagte the rmse and mae
    movies_user = user_ratings_matrix.loc[userId]
    #an movie that is unseen is one that has a rating of 0
    seen_movies = movies_user[movies_user!=0].reset_index()
    unseen_movies = movies_user[movies_user == 0].reset_index()
    unseen_movies.columns = ['title_unseen','sim_score_unseen']
    seen_movies.columns = ['title_seen','sim_score_seen']

    if cosine == True:
        sim_with_target = similarity_table.loc[target_movie].to_frame(name='sim_score').sort_values(by=['sim_score'],ascending=False)

    else:
        sim_with_target = similarity_table.loc[target_movie].to_frame(name='sim_score').sort_values(by=['sim_score'],ascending=True)
    
    sim_with_target = sim_with_target.reset_index()
    #get the most similar 
    top_seen=sim_with_target[sim_with_target['title'].isin(seen_movies['title_seen'])].head(20)
    top_unseen = sim_with_target[sim_with_target['title'].isin(unseen_movies['title_unseen'])].head(20)
       
    #The method of evaluation will take the seen movies and unseen movies list and compare the similarity score
    #take the similarity rating of unseen and seen and find the mae, and rmse

    #now that we have the lists, we will compare the top 20 similarity scores with rmse and mae
    #evaluation finding mae and rmse of similarity scores 

    mae = mean_squared_error(top_seen['sim_score'],top_unseen['sim_score'])
    rmse = np.sqrt(mean_squared_error(top_seen['sim_score'],top_unseen['sim_score']))
    print('mean absolute error:',mae)
    print('root mean squared error:',rmse)
    
eval_U_M(5,item_similarity,user_ratings,target_movie,True)

mean absolute error: 41.17880855678028
root mean squared error: 6.417071649653001


An extension could be implement precision@K and recall@k

How precision@k works: Precision@K measures the proportion of relevant items among the top-K recommended items. It answers the question: "Of the top-K items recommended to a user, how many are actually relevant?. precision@k = Number of relevant items in top-K/K. We can take the number of relveant items in the top-k as those movies with rating 4 or above. We will take k as 10 as it is the number of recommendations that we expect the user to interract with.
        
How recall@k works: Recall@K measures the proportion of relevant items that are included in the top-K recommendations. It answers the question: "Of all the relevant items for a user, how many are in the top-K recommendations?". recall@k = Number of relevant items in top-K/total number of relevant items

Another extension would be to standardize rating values. The standardization process would adjust the rating values unique to each user because the benchmark that two different users use to rate a movie as a 5 could vary. Hence, standardization ensures that ratings are comparable across users by normalizing them relative to each user's rating tendencies. This process helps to reduce bias introduced by individual rating scales and allows for more accurate comparison and aggregation of ratings across different users.