# Collaborative Filtering

In collaborative filtering, we observe similar users or items when making predictions on the **ratings**.


In [1]:
import numpy as np
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Preparing Dataset

Below we have a rating matrix. The rows are users, and the columns are movie genre.  We want to predict the rating of `anime` genre for `user b`. The unknown ratings are filled with `np.nan`.

In [2]:
critics = {
    "Lisa Rose": {
        "Lady in the Water": 2.5,
        "Snakes on a Plane": 3.5,
        "Just My Luck": 3.0,
        "Superman Returns": 3.5,
        "You, Me and Dupree": 2.5,
        "The Night Listener": 3.0,
    },
    "Gene Seymour": {
        "Lady in the Water": 3.0,
        "Snakes on a Plane": 3.5,
        "Just My Luck": 1.5,
        "Superman Returns": 5.0,
        "The Night Listener": 3.0,
        "You, Me and Dupree": 3.5,
    },
    "Michael Phillips": {
        "Lady in the Water": 2.5,
        "Snakes on a Plane": 3.0,
        "Superman Returns": 3.5,
        "The Night Listener": 4.0,
    },
    "Claudia Puig": {
        "Snakes on a Plane": 3.5,
        "Just My Luck": 3.0,
        "The Night Listener": 4.5,
        "Superman Returns": 4.0,
        "You, Me and Dupree": 2.5,
    },
    "Mick LaSalle": {
        "Lady in the Water": 3.0,
        "Snakes on a Plane": 4.0,
        "Just My Luck": 2.0,
        "Superman Returns": 3.0,
        "The Night Listener": 3.0,
        "You, Me and Dupree": 2.0,
    },
    "Jack Matthews": {
        "Lady in the Water": 3.0,
        "Snakes on a Plane": 4.0,
        "The Night Listener": 3.0,
        "Superman Returns": 5.0,
        "You, Me and Dupree": 3.5,
    },
    "Toby": {
        "Snakes on a Plane": 4.5,
        "You, Me and Dupree": 1.0,
        "Superman Returns": 4.0,
    },
}

df = pd.DataFrame(critics).T
df

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener
Lisa Rose,2.5,3.5,3.0,3.5,2.5,3.0
Gene Seymour,3.0,3.5,1.5,5.0,3.5,3.0
Michael Phillips,2.5,3.0,,3.5,,4.0
Claudia Puig,,3.5,3.0,4.0,2.5,4.5
Mick LaSalle,3.0,4.0,2.0,3.0,2.0,3.0
Jack Matthews,3.0,4.0,,5.0,3.5,3.0
Toby,,4.5,,4.0,1.0,


We convert all `NaN` to 0.

In [3]:
df.fillna(0, inplace=True)

In [4]:
df

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener
Lisa Rose,2.5,3.5,3.0,3.5,2.5,3.0
Gene Seymour,3.0,3.5,1.5,5.0,3.5,3.0
Michael Phillips,2.5,3.0,0.0,3.5,0.0,4.0
Claudia Puig,0.0,3.5,3.0,4.0,2.5,4.5
Mick LaSalle,3.0,4.0,2.0,3.0,2.0,3.0
Jack Matthews,3.0,4.0,0.0,5.0,3.5,3.0
Toby,0.0,4.5,0.0,4.0,1.0,0.0


We won't use the default `.corr()` method from `pandas`, because it does not take zeros into account.
We want to skip the row/col with zeros when calculating the Pearson correlation.

In [5]:
df.T.corr()

Unnamed: 0,Lisa Rose,Gene Seymour,Michael Phillips,Claudia Puig,Mick LaSalle,Jack Matthews,Toby
Lisa Rose,1.0,0.396059,0.510754,0.701287,0.594089,0.331618,0.795744
Gene Seymour,0.396059,1.0,0.531008,0.236088,0.411765,0.958785,0.703861
Michael Phillips,0.510754,0.531008,1.0,0.328336,0.783869,0.604105,0.374818
Claudia Puig,0.701287,0.236088,0.328336,1.0,0.152763,0.170544,0.389391
Mick LaSalle,0.594089,0.411765,0.783869,0.152763,1.0,0.564764,0.640828
Jack Matthews,0.331618,0.958785,0.604105,0.170544,0.564764,1.0,0.687269
Toby,0.795744,0.703861,0.374818,0.389391,0.640828,0.687269,1.0


In [6]:
def custom_pearson_correlation(m, n):
    # Skip zeros (unrated).
    mask = np.logical_and(m > 0, n > 0)

    m = m[mask]
    n = n[mask]

    return pd.Series(m).corr(pd.Series(n))


df.T.corr(custom_pearson_correlation)

Unnamed: 0,Lisa Rose,Gene Seymour,Michael Phillips,Claudia Puig,Mick LaSalle,Jack Matthews,Toby
Lisa Rose,1.0,0.396059,0.40452,0.566947,0.594089,0.747018,0.991241
Gene Seymour,0.396059,1.0,0.204598,0.31497,0.411765,0.963796,0.381246
Michael Phillips,0.40452,0.204598,1.0,1.0,-0.258199,0.13484,-1.0
Claudia Puig,0.566947,0.31497,1.0,1.0,0.566947,0.028571,0.893405
Mick LaSalle,0.594089,0.411765,-0.258199,0.566947,1.0,0.211289,0.924473
Jack Matthews,0.747018,0.963796,0.13484,0.028571,0.211289,1.0,0.662849
Toby,0.991241,0.381246,-1.0,0.893405,0.924473,0.662849,1.0


Below we observe the difference in the score if we exclude the zero-ratings (not rated) movies:

In [7]:
all_ratings = df.T["Lisa Rose"].corr(df.T["Michael Phillips"])
rated_only = df.T["Lisa Rose"].corr(
    df.T["Michael Phillips"], custom_pearson_correlation
)
all_ratings, rated_only

(0.5107539184552492, 0.40451991747794525)

In [8]:
def similar_to(df, user, n=5):
    """
    Finding the top-n users is as simple as just computing the pearson correlation scores,
    and returning the sorted result.
    """
    return sorted(
        df.T.corr(custom_pearson_correlation)[user].drop(user).items(),
        key=lambda t: t[1],
        reverse=True,
    )[:n]

In [9]:
similar_to(df, "Toby")

[('Lisa Rose', 0.9912407071619304),
 ('Mick LaSalle', 0.924473451641905),
 ('Claudia Puig', 0.8934051474415642),
 ('Jack Matthews', 0.6628489803598703),
 ('Gene Seymour', 0.3812464258315117)]

In [10]:
# For item based collaborative filtering, we just transpose the df.
similar_to(df.T, "Just My Luck")

[('The Night Listener', 0.5555555555555556),
 ('Snakes on a Plane', -0.3333333333333333),
 ('Superman Returns', -0.42289003161103106),
 ('You, Me and Dupree', -0.4856618642571827),
 ('Lady in the Water', -0.944911182523068)]

In [11]:
def recommend(df, user):
    similarity_scores = similar_to(df, user)
    recs = []

    # Only select movies that has np.nan ratings.
    not_watched = df.columns[df.loc[user] == 0]

    for movie in not_watched:
        # Ratings for the movie from other users.
        rated_by_user = dict(df[movie].fillna(0))

        sum_weight = 0
        sum_rating = 0

        for user, weight in similarity_scores:
            # Ignore users that did not give rating.
            rating = rated_by_user[user]
            if rating == 0:
                continue

            sum_weight += weight
            sum_rating += weight * rating

        recs.append((movie, sum_rating / sum_weight))

    # Sort by rating, in descending order (highest to lowest rating)
    return sorted(recs, key=lambda t: t[1], reverse=True)

In [12]:
recommend(df, "Toby")

[('The Night Listener', 3.3477895267131013),
 ('Lady in the Water', 2.8325499182641622),
 ('Just My Luck', 2.5309807037655645)]

In [13]:
recommend(df, "Michael Phillips")

[('Just My Luck', 2.963951538816175),
 ('You, Me and Dupree', 2.8153523713809516)]

## Cold Start

A cold start problem is when we do not have enough information from a new user to provide recommendation. This cannot be solve by machine learning. What we can do is just taking existing information about the product to make recommendation, e.g. how popular or trending a product is.


Below, we will just show how we recommend by converting the ratings to `like` and `dislike`, and calculating the probability that a user will like the movie.

We convert the ratings into like/dislike. Anything below 2.5 will be treated as dislike.

In [14]:
likes = np.where(df <= 2.5, 0, 1)
like_df = pd.DataFrame(likes, index=df.index, columns=df.columns)
like_df

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,"You, Me and Dupree",The Night Listener
Lisa Rose,0,1,1,1,0,1
Gene Seymour,1,1,0,1,1,1
Michael Phillips,0,1,0,1,0,1
Claudia Puig,0,1,1,1,0,1
Mick LaSalle,1,1,0,1,0,1
Jack Matthews,1,1,0,1,1,1
Toby,0,1,0,1,0,0


In [15]:
like_df.count(axis=0)

Lady in the Water     7
Snakes on a Plane     7
Just My Luck          7
Superman Returns      7
You, Me and Dupree    7
The Night Listener    7
dtype: int64

In [None]:
like_df.sum(axis=0)

Lady in the Water     3
Snakes on a Plane     7
Just My Luck          2
Superman Returns      7
You, Me and Dupree    2
The Night Listener    6
dtype: int64

We can see that `Superman Returns` is liked by all users that provided ratings. However, we cannot say that the probability of recommending it is 100%.

We can calculate the probability that I will like/dislike it by just adding 2 new implicit feedback, 1 like and 1 dislike and see how the ratings changes:

```
# For Superman Returns
prob = (7 + 1) / (7 + 2)
     = 0.88
```

In [None]:
sorted(
    ((like_df.sum(axis=0) + 1) / (like_df.count(axis=0) + 2)).items(),
    key=lambda t: t[1],
    reverse=True,
)