# Simple Collaborative Filtering

This notebook is a overly simplified illustration of collaborative filtering on a user-product matrix based on the critics data from Toby Segaran's "Programming Collective Intelligence" book. Below is a preview of the critics dataset.

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

critics = pd.read_csv("../data/critics/critics.csv")
print(critics.shape)
critics.head()

(42, 3)


Unnamed: 0,User,Movie,Rating
0,Lisa Rose,Lady in the Water,2.5
1,Lisa Rose,Snakes on a Plane,3.5
2,Lisa Rose,Just My Luck,3.0
3,Lisa Rose,Superman Returns,3.5
4,Lisa Rose,"You, Me and Dupree",2.5


## Intuition and overall process

The overall intuition of naive collaborative filtering is to:
1. Identify "similar" users in user item matrix 
2. Estimate what the user's rating for an unknown item would be based on similar users

There are many ways of estimating the "similarity" between users.

## Estimating similarity

### User item matrix

For better visual, let's turn the critic dataset into something more resembling of a user-item matrix.

In [2]:
critics_matrix = critics.pivot(index = "User", columns = "Movie", values = "Rating")
critics_matrix

Movie,Just My Luck,Lady in the Water,Snakes on a Plane,Superman Returns,The Night Listener,"You, Me and Dupree"
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Claudia Puig,3.0,,3.5,4.0,4.5,2.5
Gene Seymour,1.5,3.0,3.5,5.0,3.0,3.5
Jack Matthews,,3.0,4.0,5.0,3.0,3.5
Lisa Rose,3.0,2.5,3.5,3.5,3.0,2.5
Michael Phillips,,2.5,3.0,3.5,4.0,
Mick LaSalle,2.0,3.0,4.0,3.0,3.0,2.0
Toby,,,4.5,4.0,,1.0


Suppose in this case, we are trying to predict what rating Toby would give to "The Night Listener". The first step will be to identify how similar Toby is to all other users.

### Estimating similarities

There are many ways of calculating similarities between the users. We will focus on the following measures in this notebook:
- Pearson correlation
- Cosine similarity
- Euclidean distance

In the case of the above user-item matrix, similarities will be calculated for all row vectors against Toby's row vector.

In [3]:
def calculate_pearson_correlation(matrix: pd.DataFrame, user: str) -> dict:
    remaining_user_list = list(matrix.index)
    remaining_user_list.remove(user)

    user_corr = {}

    for u in remaining_user_list:
        target_user_vector = matrix.loc[user].to_numpy()
        user_vector = matrix.loc[u].to_numpy()
        bad = ~np.logical_or(np.isnan(target_user_vector), np.isnan(user_vector))

        target_user_vector = np.compress(bad, target_user_vector)
        user_vector = np.compress(bad, user_vector)

        corr_matrix = np.corrcoef(target_user_vector, user_vector)
        user_corr[u] = float(corr_matrix[0, 1])
    
    return user_corr

pearson_correlation = calculate_pearson_correlation(critics_matrix, "Toby")
pearson_correlation


{'Claudia Puig': 0.8934051474415642,
 'Gene Seymour': 0.3812464258315117,
 'Jack Matthews': 0.6628489803598702,
 'Lisa Rose': 0.9912407071619304,
 'Michael Phillips': -0.9999999999999999,
 'Mick LaSalle': 0.924473451641905}

In [4]:
from scipy.spatial import distance

def calculate_cosine_distance(matrix: pd.DataFrame, user: str) -> dict:
    remaining_user_list = list(matrix.index)
    remaining_user_list.remove(user)

    cosine = {}

    for u in remaining_user_list:
        target_user_vector = matrix.loc[user].to_numpy()
        user_vector = matrix.loc[u].to_numpy()
        bad = ~np.logical_or(np.isnan(target_user_vector), np.isnan(user_vector))

        target_user_vector = np.compress(bad, target_user_vector)
        user_vector = np.compress(bad, user_vector)

        cosine_measure = distance.cosine(target_user_vector, user_vector)
        cosine[u] = float(cosine_measure)

    return cosine

cosine_distance = calculate_cosine_distance(critics_matrix, "Toby")
cosine_distance

{'Claudia Puig': 0.04459416410942463,
 'Gene Seymour': 0.08594164421271389,
 'Jack Matthews': 0.06819475196784597,
 'Lisa Rose': 0.04710725503783031,
 'Michael Phillips': 0.009169831955701091,
 'Mick LaSalle': 0.026383689432199042}

In [5]:
def calculate_euclidian_distance(matrix: pd.DataFrame, user: str) -> dict:
    remaining_user_list = list(matrix.index)
    remaining_user_list.remove(user)

    euclidean = {}

    for u in remaining_user_list:
        target_user_vector = matrix.loc[user].to_numpy()
        user_vector = matrix.loc[u].to_numpy()
        bad = ~np.logical_or(np.isnan(target_user_vector), np.isnan(user_vector))

        target_user_vector = np.compress(bad, target_user_vector)
        user_vector = np.compress(bad, user_vector)

        cosine_measure = distance.euclidean(target_user_vector, user_vector)
        euclidean[u] = float(cosine_measure)

    return euclidean

euclidean_distance = calculate_euclidian_distance(critics_matrix, "Toby")
euclidean_distance

{'Claudia Puig': 1.8027756377319948,
 'Gene Seymour': 2.8722813232690143,
 'Jack Matthews': 2.73861278752583,
 'Lisa Rose': 1.8708286933869704,
 'Michael Phillips': 1.5811388300841898,
 'Mick LaSalle': 1.5}

### Estimating ratings

With the similarity measures calculated, we can then use it to estimate what each user will give to "The Night Listener" and other movies if they are in Toby's shoes. There are 2 naive approaches we can go with this:

1. Find the most similar K users to Toby and calculate an average of their applicable ratings of "The Night Listener"
2. Using the similarity measures as weights, calculate a weighted average amongst ALL existing ratings of "The Night Listener"

It is obvious to see that going with approach 1 will bring problems. Here are some of the problems just to think about:
- what is the fair K? (especially in this case when we only have 6 other users)
- the similarity measure become useless once the filtered list of similar users is selected
- what if all the "similar" users didn't watch "The Night Listener"?

We will proceed with approach 2. The predicted rating of Toby for "The Night Listener" will be calculated using the following weighted average expression:

$$
    Rating_{Toby, The Night Listener} = \sum_{u}^{not Toby} \frac{R_{u, The Night Listener} \cdot S_{Toby, u}}{\sum_{v}^{not Toby} S_{Toby, v}}
$$

R = rating for respective user for The Night Listener \
S = similarity measure between respective user and Toby

In [6]:
the_night_listener = critics[(critics["Movie"] == "The Night Listener") & (critics["User"] != "Toby")]

the_night_listener["pearson"] = the_night_listener["User"].map(pearson_correlation)
the_night_listener["cosine"] = the_night_listener["User"].map(cosine_distance)
the_night_listener["euclidean"] = the_night_listener["User"].map(euclidean_distance)

pearson_denominator = the_night_listener["pearson"].sum()
cosine_denominator = the_night_listener["cosine"].sum()
euclidean_denominator = the_night_listener["euclidean"].sum()

the_night_listener

Unnamed: 0,User,Movie,Rating,pearson,cosine,euclidean
5,Lisa Rose,The Night Listener,3.0,0.991241,0.047107,1.870829
11,Gene Seymour,The Night Listener,3.0,0.381246,0.085942,2.872281
17,Michael Phillips,The Night Listener,4.0,-1.0,0.00917,1.581139
23,Claudia Puig,The Night Listener,4.5,0.893405,0.044594,1.802776
29,Mick LaSalle,The Night Listener,3.0,0.924473,0.026384,1.5
35,Jack Matthews,The Night Listener,3.0,0.662849,0.068195,2.738613


In [7]:
# Pearson estimation

float((the_night_listener["Rating"] * the_night_listener["pearson"]).sum()/pearson_denominator)

3.1192015867855516

In [8]:
# cosine estimation

float((the_night_listener["Rating"] * the_night_listener["cosine"]).sum()/cosine_denominator)

3.2703035530787545

In [9]:
# Euclidean estimation

float((the_night_listener["Rating"] * the_night_listener["euclidean"]).sum()/euclidean_denominator)

3.346549247112904

## Implementation

Below is an code snippet for simple implementation of the what was discussed above.

In [10]:
import sys
sys.path.append("../src")

from naive.user_collaborative_filter import UserCollaborativeFilter

ucf = UserCollaborativeFilter(critics, user_column="User", item_column="Movie", rating_column="Rating")
ucf.fit()
ucf.predict(user="Toby", item="The Night Listener")

3.2703035530787545