# KNN Item-Item Collaborative Filtering

Item-Item CF is similar to User-User CF but works on the other side. Instead of finding similar users, this method identify the similar  items which the current user has interacted with, get the current user's ratings for them and weighted average the actual ratings.

Based on [Introduction to Recommender System](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada):
> The user-user method is based on the search of similar users in terms of interactions with items. As, in general, every user have only interacted with a few items, it makes the method pretty sensitive to any recorded interactions (high variance). On the other hand, as the final recommendation is only based on interactions recorded for users similar to our user of interest, we obtain more personalized results (low bias).

> Conversely, the item-item method is based on the search of similar items in terms of user-item interactions. As, in general, a lot of users have interacted with an item, the neighbourhood search is far less sensitive to single interactions (lower variance). As a counterpart, interactions coming from every kind of users (even users very different from our reference user) are then considered in the recommendation, making the method less personalised (more biased). Thus, this approach is less personalized than the user-user approach but more robust.

In this notebook, we would take a very simple and straight-forward approach to implement KNN Item-Item CF. We formulate the problem as predicting the rating between user U and item I based on the rating records.

Specifically, we will:
- Collect input containing user-item rating
- Build item-item similarity matrix
- Measure similarities between items by using plain user rating vectors
- Predict the rating between user U and item I by getting all items which the current user have rated, identify N item neighbors and weighted average their ratings based on how similar the items are to the target item I

# Set up

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys

import mlflow
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from loguru import logger
from pydantic import BaseModel

load_dotenv()

sys.path.insert(0, "..")

from src.eval import (
    create_label_df,
    create_rec_df,
    log_classification_metrics,
    log_ranking_metrics,
    merge_recs_with_target,
)
from src.id_mapper import IDMapper
from src.math_utils import sigmoid
from src.model import Item2ItemCollaborativeFiltering
from src.train_utils import map_indice
from src.viz import blueq_colors

# Controller

In [3]:
class Args(BaseModel):
    testing: bool = False
    log_to_mlflow: bool = True
    experiment_name: str = "FSDS RecSys - L4 - Reco Algo"
    run_name: str = "003-cf-i2i"
    notebook_persist_dp: str = None
    random_seed: int = 41

    user_col: str = "user_id"
    item_col: str = "parent_asin"
    rating_col: str = "rating"
    timestamp_col: str = "timestamp"

    top_K: int = 100
    top_k: int = 10

    batch_size: int = 128

    def init(self):
        self.notebook_persist_dp = os.path.abspath(f"data/{self.run_name}")

        if not os.environ.get("MLFLOW_TRACKING_URI"):
            logger.warning(
                f"Environment variable MLFLOW_TRACKING_URI is not set. Setting self.log_to_mlflow to false."
            )
            self.log_to_mlflow = False

        if self.log_to_mlflow:
            logger.info(
                f"Setting up MLflow experiment {self.experiment_name} - run {self.run_name}..."
            )

            mlflow.set_experiment(self.experiment_name)
            mlflow.start_run(run_name=self.run_name)

        return self


args = Args().init()

print(args.model_dump_json(indent=2))

[32m2024-09-24 06:41:17.696[0m | [1mINFO    [0m | [36m__main__[0m:[36minit[0m:[36m29[0m - [1mSetting up MLflow experiment FSDS RecSys - L4 - Reco Algo - run 003-cf-i2i...[0m


{
  "testing": false,
  "log_to_mlflow": true,
  "experiment_name": "FSDS RecSys - L4 - Reco Algo",
  "run_name": "003-cf-i2i",
  "notebook_persist_dp": "/home/jupyter/frostmourne/reco-algo/notebooks/data/003-cf-i2i",
  "random_seed": 41,
  "user_col": "user_id",
  "item_col": "parent_asin",
  "rating_col": "rating",
  "timestamp_col": "timestamp",
  "top_K": 100,
  "top_k": 10,
  "batch_size": 128
}


# Implement

In [4]:
def init_model(n_users, n_items):
    model = Item2ItemCollaborativeFiltering(n_users, n_items)
    return model

# Test implementation

In [5]:
# Mock data
user_indices = [0, 0, 1, 1, 2, 2, 2]
item_indices = [0, 1, 1, 2, 3, 1, 2]
ratings = [1, 4, 4, 5, 3, 2, 4]
n_users = len(set(user_indices))
n_items = len(set(item_indices))

val_user_indices = [0, 1, 2]
val_item_indices = [2, 1, 2]
val_ratings = [2, 4, 5]

print("Mock User IDs:", user_indices)
print("Mock Item IDs:", item_indices)
print("Ratings:", ratings)

model = init_model(n_users, n_items)

users = [0, 1, 2]
items = [2, 2, 0]
predictions = model.predict(users, items)
print(predictions)

Mock User IDs: [0, 0, 1, 1, 2, 2, 2]
Mock Item IDs: [0, 1, 1, 2, 3, 1, 2]
Ratings: [1, 4, 4, 5, 3, 2, 4]
[0.5 0.5 0.5]


In [6]:
model.fit(user_indices, item_indices, ratings)
predictions = model.predict(users, items)
print(predictions)

[0.98201379 0.98201379 0.88079708]


#### 🧐 Go into details

In [7]:
model.user_item_matrix.T

array([[1., 0., 0.],
       [4., 4., 2.],
       [0., 5., 4.],
       [0., 0., 3.]])

In [8]:
model.item_similarity

array([[0.        , 0.66666667, 0.        , 0.        ],
       [0.66666667, 0.        , 0.72881089, 0.33333333],
       [0.        , 0.72881089, 0.        , 0.62469505],
       [0.        , 0.33333333, 0.62469505, 0.        ]])

In [9]:
item = 3
user = 1

# Compute prediction using weighted average of ratings from similar items
sim_scores = model.item_similarity[item]
print(f"{sim_scores=}")

sim_scores=array([0.        , 0.33333333, 0.62469505, 0.        ])


In [10]:
# Only consider items that have been rated by the current user
item_ratings = model.user_item_matrix[user, :]
print(f"Ratings of current user for all items:\n{item_ratings=}")
sim_scores = sim_scores[item_ratings != 0]
print(
    f"Cosine similarity score of target item towards all other items where current user has rated:\n{sim_scores}"
)
item_ratings = item_ratings[item_ratings != 0]

Ratings of current user for all items:
item_ratings=array([0., 4., 5., 0.])
Cosine similarity score of target item towards all other items where current user has rated:
[0.33333333 0.62469505]


In [11]:
# Weighted average of ratings
print(f"Weighted average: {np.dot(sim_scores, item_ratings)}")
print(f"Normalization factor: {np.sum(sim_scores)}")
print(f"Predicted rating: {np.dot(sim_scores, item_ratings) / np.sum(sim_scores)}")
print(
    f"Predicted rating - sigmoid: {sigmoid(np.dot(sim_scores, item_ratings) / np.sum(sim_scores))}"
)

Weighted average: 4.456808571105455
Normalization factor: 0.9580283808877577
Predicted rating: 4.652063195638892
Predicted rating - sigmoid: 0.9905482923878774


In [12]:
recommendations = model.recommend(val_user_indices, k=2)

Generating Recommendations:   0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
recommendations

{'user_indice': [0, 0, 1, 1, 2],
 'recommendation': [2, 3, 3, 0, 0],
 'score': [0.9820137900379085,
  0.9820137900379085,
  0.9905482923878774,
  0.9820137900379085,
  0.8807970779778823]}

# Prep data

In [14]:
train_df = pd.read_parquet("../data/train_features_neg_df.parquet")
val_df = pd.read_parquet("../data/val_features_neg_df.parquet")
idm = IDMapper().load("../data/idm.json")
# val_timestamp = 1628643414042  # https://amazon-reviews-2023.github.io/data_processing/5core.html
assert (val_df[args.timestamp_col].min() - train_df[args.timestamp_col].max()) > 0
val_timestamp = train_df[args.timestamp_col].max() + 1
print(f"{val_timestamp=}")

val_timestamp=np.int64(1628641464793)


In [15]:
user_ids = train_df[args.user_col].values
item_ids = train_df[args.item_col].values
unique_user_ids = list(set(user_ids))
unique_item_ids = list(set(item_ids))
n_users = len(unique_user_ids)
n_items = len(unique_item_ids)

logger.info(f"{len(unique_user_ids)=:,.0f}, {len(unique_item_ids)=:,.0f}")

[32m2024-09-24 06:41:22.250[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1mlen(unique_user_ids)=20,366, len(unique_item_ids)=4,696[0m


In [16]:
train_df = train_df.pipe(map_indice, idm, args.user_col, args.item_col)
val_df = val_df.pipe(map_indice, idm, args.user_col, args.item_col)

user_indices = [idm.get_user_index(user_id) for user_id in user_ids]
item_indices = [idm.get_item_index(item_id) for item_id in item_ids]
ratings = train_df[args.rating_col].values.tolist()

val_user_indices = [idm.get_user_index(user_id) for user_id in val_df[args.user_col]]
val_item_indices = [idm.get_item_index(item_id) for item_id in val_df[args.item_col]]
val_ratings = val_df[args.rating_col].values.tolist()

# Train

In [17]:
model = init_model(n_users, n_items)

#### Predict before train

In [18]:
user_id = val_df.sample(1)[args.user_col].values[0]
test_df = val_df.loc[lambda df: df[args.user_col].eq(user_id)]
test_df

Unnamed: 0,user_id,parent_asin,rating,timestamp,user_indice,item_indice,main_category,title,description,categories,price,item_sequence
9,AHMJVCKVHJIT2R5NWWV4HG4TDH6A,B007W8S2MG,0.0,1643407022270,11374,1775,Video Games,Persona 4 Golden - PlayStation Vita,"[From the Manufacturer, Following in the foots...","[Video Games, Legacy Systems, PlayStation Syst...",127.95,"[3834, 4416, 4552, 3762, 2860, 4586, 2254, 211..."
132,AHMJVCKVHJIT2R5NWWV4HG4TDH6A,B07C2XYDW8,5.0,1643407022270,11374,2483,Video Games,Dynasty Warriors 9 - Xbox One,"[In Dynasty Warriors 9, you will experience an...","[Video Games, Xbox One, Games]",53.26,"[3834, 4416, 4552, 3762, 2860, 4586, 2254, 211..."


In [19]:
item_id = test_df.loc[lambda df: df[args.rating_col].gt(0)][args.item_col].values[0]
logger.info(
    f"Test predicting before training with {args.user_col} = {user_id} and {args.item_col} = {item_id}"
)
user_indice = idm.get_user_index(user_id)
item_indice = idm.get_item_index(item_id)

model.predict([user_indice], [item_indice])

[32m2024-09-24 06:41:22.952[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mTest predicting before training with user_id = AHMJVCKVHJIT2R5NWWV4HG4TDH6A and parent_asin = B07C2XYDW8[0m


array([0.5])

#### Training loop

In [20]:
model.fit(user_indices, item_indices, ratings)

# Predict

In [21]:
logger.info(
    f"Test predicting before training with {args.user_col} = {user_id} and {args.item_col} = {item_id}"
)
model.predict([user_indice], [item_indice])

[32m2024-09-24 06:41:29.812[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mTest predicting before training with user_id = AHMJVCKVHJIT2R5NWWV4HG4TDH6A and parent_asin = B07C2XYDW8[0m


array([0.99330715])

# Evaluate

## Ranking metrics

In [22]:
recommendations = model.recommend(val_user_indices, k=args.top_K)

Generating Recommendations:   0%|          | 0/1898 [00:00<?, ?it/s]

In [23]:
recommendations_df = pd.DataFrame(recommendations).pipe(create_rec_df, idm)
recommendations_df

Unnamed: 0,user_indice,recommendation,score,rec_ranking,user_id,parent_asin
0,12853,16,0.993307,1.0,AGXTHABHPC3XO4VAMCFM2TQR3GFQ,B0050SYY5E
1,12853,4095,0.993307,2.0,AGXTHABHPC3XO4VAMCFM2TQR3GFQ,B00KUZEFBK
2,12853,1008,0.993307,3.0,AGXTHABHPC3XO4VAMCFM2TQR3GFQ,B00N2KKSNO
3,12853,1224,0.993307,4.0,AGXTHABHPC3XO4VAMCFM2TQR3GFQ,B00CMQTVUA
4,12853,292,0.993307,5.0,AGXTHABHPC3XO4VAMCFM2TQR3GFQ,B0037Z0HEE
...,...,...,...,...,...,...
189795,14633,4027,0.993307,196.0,AHAKU6TTWIHJPZIODW7MGC52M2DA,B079FPFV3X
189796,14633,4034,0.993307,197.0,AHAKU6TTWIHJPZIODW7MGC52M2DA,B07NQNQ7WN
189797,14633,4078,0.993307,198.0,AHAKU6TTWIHJPZIODW7MGC52M2DA,B00HM3QANO
189798,14633,4051,0.993307,199.0,AHAKU6TTWIHJPZIODW7MGC52M2DA,B00JM3R6M6


In [24]:
label_df = create_label_df(val_df)
label_df

Unnamed: 0,user_id,parent_asin,rating,rating_rank
1727,AEOY2365QPPEVDTOXL6N7ZA4NSAA,B00PDRZG9U,5.0,1.0
451,AFGHX4VLP6P5XORLDJX3LZKUAAZA,B00Z9TJBUW,5.0,1.0
204,AFCH2PDOFM2S3622QFV6PHCHGMCA,B00KSQHX1K,5.0,1.0
1344,AEURBISVS35ALE7YQLR5L4K7AHCA,B07QQ8N7LL,1.0,1.0
334,AEMA3SW3WPNLEH3IACW23K2ZSUFA,B09JDLC31H,4.0,1.0
...,...,...,...,...
1332,AFB6FYPPCN33UMUU5536IHXNOHCQ,B002I0K3CK,0.0,18.0
910,AESD4RLWUKM6JTD6SNNWYLHLLQQA,B07BLRF329,0.0,18.0
695,AG4RCXKPTC6QRORJLUSBY4SO2IAA,B001ELJDWA,0.0,18.0
1177,AFB6FYPPCN33UMUU5536IHXNOHCQ,B003S6N7OO,0.0,19.0


In [25]:
eval_df = merge_recs_with_target(recommendations_df, label_df, k=args.top_K)
eval_df

Unnamed: 0,user_indice,recommendation,score,rec_ranking,user_id,parent_asin,rating,rating_rank
80,9912.0,2380.0,0.993307,1,AE2AZ2MNROPF33U6SS53VI22OXJA,B008J35YFQ,0,
38,9912.0,1640.0,0.993307,2,AE2AZ2MNROPF33U6SS53VI22OXJA,B001EYUQKU,0,
96,9912.0,2363.0,0.993307,3,AE2AZ2MNROPF33U6SS53VI22OXJA,B00GH7UA32,0,
152,9912.0,1561.0,0.993307,4,AE2AZ2MNROPF33U6SS53VI22OXJA,B0764PLDQV,0,
6,9912.0,2552.0,0.993307,5,AE2AZ2MNROPF33U6SS53VI22OXJA,B00005R5PO,0,
...,...,...,...,...,...,...,...,...
191628,2049.0,766.0,0.993307,196,AHZNHP6OKXRZV2UJMYDPLWCKFKEA,B09V3885Y3,0,
191555,2049.0,765.0,0.993307,197,AHZNHP6OKXRZV2UJMYDPLWCKFKEA,B00MOR0FLG,0,
191591,2049.0,764.0,0.993307,198,AHZNHP6OKXRZV2UJMYDPLWCKFKEA,B01953Z1QA,0,
191533,2049.0,3573.0,0.993307,199,AHZNHP6OKXRZV2UJMYDPLWCKFKEA,B00DR8V74A,0,


In [26]:
ranking_report = log_ranking_metrics(args, eval_df)

  return (1 + beta_sqr) * precision_arr * recall_arr / (beta_sqr * precision_arr + recall_arr)


## Classification metrics

In [27]:
val_user_indices = val_df["user_indice"].values
val_item_indices = val_df["item_indice"].values

In [28]:
classifications = model.predict(val_user_indices, val_item_indices)

In [29]:
eval_classification_df = val_df.assign(
    classification_proba=classifications,
    label=lambda df: df[args.rating_col].gt(0).astype(int),
)
eval_classification_df

Unnamed: 0,user_id,parent_asin,rating,timestamp,user_indice,item_indice,main_category,title,description,categories,price,item_sequence,classification_proba,label
0,AGXTHABHPC3XO4VAMCFM2TQR3GFQ,B00TEDK8FQ,0.0,1643101921864,12853,2636,Video Games,Ortz PS4 Vertical Stand with Cooling Fan [Keep...,[],"[Video Games, PlayStation 4, Accessories, Case...",,"[111, 3920, 3879, 3261, 3402, 1230, 2239, 3974...",0.985923,0
1,AESD4RLWUKM6JTD6SNNWYLHLLQQA,B07BMRGKX2,0.0,1653590691326,141,253,Video Games,Agony - PlayStation 4,"[Agony is a first-person, survival horror game...","[Video Games, PlayStation 4, Games]",28.0,"[99, 4672, 4434, 1551, 1561, 2497, 3615, 3196,...",0.992141,0
2,AEXFEQ7QOP6EHDEZ3K6NN27MQ7KA,B0774N9JKW,0.0,1651718479413,13969,1217,Video Games,Sword Art Online: Hollow Realization - PlaySta...,"[""Link start"" into SWORD ART ONLINE -Hollow Re...","[Video Games, PlayStation 4, Games]",18.11,"[1016, 3999, 2944, 742, 3161, 3580, 2267, 3623...",0.500000,0
3,AGDAPPCYV472FOUKDGAHZRW766GA,B07B416X7V,0.0,1649310595659,18594,3519,Video Games,Burnout Paradise Remastered - Xbox One [Digita...,[Make action your middle name as you rule the ...,"[Video Games, Xbox One, Games]",,"[3097, 2497, 3203, 3937, 3803, 2323, 2310, 751...",0.500000,0
4,AG2KBJG5DMEIISPJVF3OVMRB4ALA,B001D8Q5MA,0.0,1636861380056,5042,3122,Video Games,Grand Theft Auto IV [Online Game Code],"[From the Manufacturer, What does the American...","[Video Games, PC]",,"[-1, -1, -1, -1, -1, 3602, 22, 1193, 2486, 1293]",0.974227,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1893,AGVBR47VU2BE4EVWFAXOO26SOWSA,B0C39GFK7P,1.0,1647361584062,388,4128,Computers,"Logitech G640 Large Cloth Gaming Mouse Pad, Op...",[The cloth surface of G640 provides ideal surf...,"[Video Games, PC, Accessories, Gaming Mice]",29.99,"[1938, 2662, 262, 3903, 3610, 1896, 1372, 4160...",0.855999,1
1894,AFUWPAK6VCGEL2OVIL2YGZNFQJZQ,B08N6NCR3Q,4.0,1642699950266,4205,3269,Video Games,Thrustmaster T 16000M SPACE SIM DUO STICK (PC),[The THRUSTMASTER T.16000M FCS Space Sim Duo c...,"[Video Games, PC, Accessories, Controllers, Fl...",119.51,"[-1, -1, -1, -1, 1058, 3558, 377, 1187, 2169, ...",0.993307,1
1895,AFH63KLSVQQYRNFS7NLQGD3GSP3A,B094YHB1QK,5.0,1652564728981,20004,4190,Video Games,PlayStation DualSense Wireless Controller – Ga...,[Plot a course for astronomical adventures on ...,"[Video Games, PlayStation 5, Accessories, Cont...",74.99,"[-1, 832, 3126, 1490, 4335, 2035, 1270, 605, 3...",0.993307,1
1896,AFPPTJOEUPVXA5C63SNRGID3EQNA,B0BVVTQ5JP,4.0,1635968491390,4984,2257,Computers,Logitech G502 HERO High Performance Wired Gami...,[Logitech updated its iconic G502 gaming mouse...,"[Video Games, PC, Accessories, Gaming Mice]",45.87,"[-1, -1, -1, -1, -1, 4269, 4366, 396, 3060, 464]",0.973120,1


In [30]:
classification_report = log_classification_metrics(
    args,
    eval_classification_df,
    target_col="label",
    prediction_col="classification_proba",
)


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



# Clean up

In [31]:
all_params = [args]

if args.log_to_mlflow:
    for params in all_params:
        params_dict = params.dict()
        params_ = {f"{params.__repr_name__()}.{k}": v for k, v in params_dict.items()}
        mlflow.log_params(params_)

    mlflow.end_run()

2024/09/24 06:41:39 INFO mlflow.tracking._tracking_service.client: 🏃 View run 003-cf-i2i at: http://localhost:5003/#/experiments/2/runs/0a87ffb1a39a4550ad421945c7ead14b.
2024/09/24 06:41:39 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5003/#/experiments/2.


# Appendix

## Model returning same score for every user-item in top 100