# Item-based Collaborative Filtering

Core idea
“If two movies get similar rating patterns from many users, then someone who liked one of those movies will probably like the other as well.”

How it works
  1. For every movie the target user has rated, find similar movies (e.g., by cosine similarity of rating vectors).
  2. Score those similar movies—weight by how much the user liked the original movie and by the similarity strength.
  3. Rank the unseen movies by the aggregated scores.
  4. Recommend the top-ranked ones to the user.

Example
Many users who liked Inception also liked Interstellar and The Matrix.
Alice rated Inception and The Matrix highly but hasn’t watched Interstellar.
Because both of Alice’s liked movies point to Interstellar as a close neighbour, the system recommends Interstellar to Alice.

In [1]:
# Load datasets
import pandas as pd
movies = pd.read_csv("../data/csv/movies.csv")
ratings = pd.read_csv("../data/csv/ratings.csv")

In [2]:
# Merge ratings with movie titles
movies_ratings = ratings.merge(movies[['movieId', 'title', 'genres']], on='movieId', how='left')

print(movies_ratings.shape)
movies_ratings.head()

(25000095, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,296,5.0,1147880044,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
1,1,306,3.5,1147868817,Three Colors: Red (Trois couleurs: Rouge) (1994),Drama
2,1,307,5.0,1147868828,Three Colors: Blue (Trois couleurs: Bleu) (1993),Drama
3,1,665,5.0,1147878820,Underground (1995),Comedy|Drama|War
4,1,899,3.5,1147868510,Singin' in the Rain (1952),Comedy|Musical|Romance


## Option 1: Filter to “Active” Users and/or “Popular” Movies

We do this, because the full dataset is too computationally expensive for personal laptops.

In [None]:
# Keep users with at least 500 ratings
user_counts = movies_ratings['userId'].value_counts()
active_users = user_counts[user_counts >= 500].index

# Keep movies with at least 1000 ratings
movie_counts = movies_ratings['movieId'].value_counts()
popular_movies = movie_counts[movie_counts >= 1000].index

# Filter the DataFrame
movies_ratings_filtered = movies_ratings[
    movies_ratings['userId'].isin(active_users) &
    movies_ratings['movieId'].isin(popular_movies)
]

print(movies_ratings_filtered.shape)
movies_ratings_filtered.head()

(2305789, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
23893,187,1,3.5,1277374478,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
23894,187,2,3.5,1277374864,Jumanji (1995),Adventure|Children|Fantasy
23895,187,3,3.0,1277839361,Grumpier Old Men (1995),Comedy|Romance
23897,187,19,4.5,1277373060,Ace Ventura: When Nature Calls (1995),Comedy
23898,187,32,3.5,1277372429,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller


### Lenskit implementation

In [4]:
from lenskit.data import from_interactions_df

# convert df to a Dataset (new in LensKit 2025.2.0)
# https://lkpy.lenskit.org/stable/guide/data/
lk_dataset = from_interactions_df(movies_ratings_filtered, 
                                   user_col='userId', 
                                   item_col='movieId', 
                                   rating_col='rating', 
                                   timestamp_col='timestamp')
lk_dataset
pd_lk_dataset = lk_dataset.interaction_matrix(format='pandas')
pd_lk_dataset

Unnamed: 0,user_num,item_num,rating,timestamp,title,genres
0,0,0,3.5,1277374478,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,0,1,3.5,1277374864,Jumanji (1995),Adventure|Children|Fantasy
2,0,2,3.0,1277839361,Grumpier Old Men (1995),Comedy|Romance
3,0,16,4.5,1277373060,Ace Ventura: When Nature Calls (1995),Comedy
4,0,27,3.5,1277372429,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
...,...,...,...,...,...,...
2305784,2674,2106,4.0,1545875212,"Three Billboards Outside Ebbing, Missouri (2017)",Crime|Drama
2305785,2674,2107,3.5,1546134016,Coco (2017),Adventure|Animation|Children
2305786,2674,2108,3.5,1537240233,Star Wars: The Last Jedi (2017),Action|Adventure|Fantasy|Sci-Fi
2305787,2674,2110,3.5,1549163417,Deadpool 2 (2018),Action|Comedy|Sci-Fi


In [5]:
# we also can get some statistics from the Dataset object 
lk_dataset.item_stats()
# lk_dataset.user_stats()

  stats.loc[stats["count"] == 0, "first_time"] = pd.NaT
  stats.loc[stats["count"] == 0, "last_time"] = pd.NaT


Unnamed: 0_level_0,record_count,user_count,rating_count,mean_rating,count,first_time,last_time
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2421,2421,2421,3.868030,2421,855002560,1573921592
2,1928,1928,1928,3.022822,1928,854478093,1573622111
3,932,932,932,2.774678,932,855000435,1573255519
4,292,292,292,2.467466,292,861979082,1566146945
5,944,944,944,2.574153,944,854394351,1572444967
...,...,...,...,...,...,...,...
177765,480,480,480,3.843750,480,1510507326,1573921425
179819,696,696,696,3.431753,696,1513174120,1573951151
187541,434,434,434,3.573733,434,1528981148,1573951136
187593,568,568,568,3.617077,568,1526426308,1574008997


In [6]:
# split into test and train sets
from lenskit.splitting import sample_users, SampleFrac

# DOCS: https://lkpy.lenskit.org/stable/api/lenskit.splitting.sample_users
split = sample_users(lk_dataset, rng=42, method=SampleFrac(0.2), size=1000) 
"""
Splits the dataset based on users interactions. 
The `method=LastN(10)` means that for each user, the last 10 interactions will be used for testing, and the rest will be used for training. 
"""

print(f"Train size: {split.train.interaction_count}, Test size: {len(split.test)}")

Train size: 2132486, Test size: 1000


In [7]:
# Build recommendation pipeline and train
from lenskit.knn import ItemKNNScorer
from lenskit.pipeline import RecPipelineBuilder
from lenskit.basic import UnratedTrainingItemsCandidateSelector
from lenskit import recommend

# 1. Initialize the pipeline builder
# DOCS: https://lkpy.lenskit.org/stable/api/pipeline.html#
builder = RecPipelineBuilder()

# 2. Add the item-item CF scoring model 
# DOCS: https://lkpy.lenskit.org/stable/api/lenskit.knn.item.html#lenskit.knn.item.ItemKNNScorer
scorer = ItemKNNScorer(k=20) 
builder.scorer(scorer)
# Training described: https://github.com/lenskit/lkpy/blob/16e5fc7dc8056dc3c55d2349c7bfa21565f4fe40/src/lenskit/knn/item.py#L131

# 3. Set the candidate selector to filter out items the user has rated
builder.candidate_selector(UnratedTrainingItemsCandidateSelector())
# DOCS: https://lkpy.lenskit.org/stable/api/lenskit.basic.html#lenskit.basic.UnratedTrainingItemsCandidateSelector

# 4. Set the ranker to produce Top-N recommendations (e.g., Top-10)
builder.ranker(n=10) 

# 5. Build the pipeline
pipe = builder.build("Simple ItemKNN Pipeline")

# 6. Train
pipe.train(split.train)

  return torch.sparse_csr_tensor(


In [8]:
# batch recommend to users in test set
from lenskit.batch import recommend as batch_recommend

# https://lkpy.lenskit.org/stable/guide/batch
rec = batch_recommend(pipe, list(split.test.keys()), n=10) 

In [26]:
# define functions to measure performance
from lenskit.metrics import RunAnalysis, Precision, Recall, Hit, NDCG
from sklearn.metrics import mean_squared_error
from lenskit.data import ItemListCollection, UserIDKey

analysis = RunAnalysis()
analysis.add_metric(Precision())
analysis.add_metric(Recall())
analysis.add_metric(NDCG())
analysis.add_metric(Hit())

def measure_performance(test: ItemListCollection, rec: ItemListCollection[UserIDKey]):
  df_rec = rec.to_df()
  df_test = test.to_df()

  # keep only the columns we need and join on user & item
  hits = (
    df_test[['user_id', 'item_id', 'rating']]
      .merge(df_rec[['user_id', 'item_id', 'score']],
            on=['user_id', 'item_id'],
            how='inner')          # drop pairs without predictions
  )

  mse  = mean_squared_error(hits['rating'], hits['score'])
  rmse = mse ** 0.5 

  # Measure the recommendations against the test data
  results = analysis.measure(rec, test)
  metrics = results.list_metrics().mean()             # Series: metric → mean value

  # build single-row DataFrame and append MSE / RMSE
  df = metrics.to_frame().T                        # rows → columns
  df['MSE']  = mse
  df['RMSE'] = rmse
  return df

measure_performance(split.test, rec)


Unnamed: 0,Precision,Recall,NDCG,Hit,MSE,RMSE
0,0.493084,0.028456,0.085661,0.949533,0.282754,0.531746


In [10]:
# TODO: input: userid and interactions df
# TODO: output: movieID, title, genres

# test recommendations for a specific user
user_id = lk_dataset.users.index[3]
rec = recommend(pipe, user_id, n=10)
df_rec = rec.to_df()

output_columns = ['movieId', 'title', 'genres']

print("Recommendations for user", user_id)
user_rec = df_rec.merge(
  movies[output_columns],   # just the needed cols
  left_on='item_id',
  right_on='movieId',
  how='left'
)[output_columns]

# Movies the user has already seen
seen = movies_ratings[movies_ratings['userId'] == user_id].sort_values('rating', ascending=False)[output_columns]

# Which recommendations accidentally overlap (should be empty!)
rec_seen = user_rec[user_rec['movieId'].isin(seen)]

print('Already seen recommendations (should be empty):\n', rec_seen)
assert rec_seen.empty, 'Candidate selector failed - user got already-seen movies'

user_rec

Recommendations for user 548
Already seen recommendations (should be empty):
 Empty DataFrame
Columns: [movieId, title, genres]
Index: []


Unnamed: 0,movieId,title,genres
0,858,"Godfather, The (1972)",Crime|Drama
1,1201,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western
2,904,Rear Window (1954),Mystery|Thriller
3,2019,Seven Samurai (Shichinin no samurai) (1954),Action|Adventure|Drama
4,1203,12 Angry Men (1957),Drama
5,908,North by Northwest (1959),Action|Adventure|Mystery|Romance|Thriller
6,922,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),Drama|Film-Noir|Romance
7,1193,One Flew Over the Cuckoo's Nest (1975),Drama
8,912,Casablanca (1942),Drama|Romance
9,954,Mr. Smith Goes to Washington (1939),Drama


In [11]:
seen

Unnamed: 0,movieId,title,genres
68476,318,"Shawshank Redemption, The (1994)",Crime|Drama
69493,7022,Battle Royale (Batoru rowaiaru) (2000),Action|Drama|Horror|Thriller
69076,4011,Snatch (2000),Comedy|Crime|Thriller
68700,1748,Dark City (1998),Adventure|Film-Noir|Sci-Fi|Thriller
69584,8533,"Notebook, The (2004)",Drama|Romance
...,...,...,...
69561,7976,Ken Park (2002),Drama
68635,1431,Beverly Hills Ninja (1997),Action|Comedy
70009,52715,Kickin It Old Skool (2007),Comedy
69622,8906,Cannibal Holocaust (1980),Horror


In [12]:
# TODO: check recommending based on passing movies df to recommend 

# Cross Validation

In [13]:
# Base for pipeline


In [45]:
# perform a crossfold-validation 
from collections import defaultdict
from lenskit.data import MutableItemListCollection, UserIDKey
from lenskit.splitting import crossfold_users

# why only tuning this hyperparameter: https://lkpy.lenskit.org/stable/api/lenskit.knn.item.html#lenskit.knn.item.ItemKNNScorer.train
param_grid = [1e-6, 0.01, 0.05, 0.1, 0.5, 0.9] # 1e-6 is default
results = defaultdict(list) 

# https://lkpy.lenskit.org/stable/api/lenskit.splitting.crossfold_users.html#lenskit.splitting.crossfold_users
folds = list(crossfold_users(lk_dataset, partitions=5, method=SampleFrac(0.2), rng=42))

for p in param_grid:
  all_test = MutableItemListCollection(UserIDKey)
  all_rec = MutableItemListCollection(UserIDKey)
  print(f'\n=== min_sim = {p} ===')

  # Build fresh pipeline for this fold
  builder = RecPipelineBuilder()
  builder.candidate_selector(UnratedTrainingItemsCandidateSelector())
  builder.ranker(n=10) 
  scorer = ItemKNNScorer(min_sim=p) 
  builder.scorer(scorer)
  pipe = builder.build(f"CV ItemKNN Pipeline {p}")

  for f, split in enumerate(folds):
      print(f"=== fold {f} ===")
      print(f"=== Train size: {split.train.interaction_count}, Test size: {len(split.test)} ===")

      algo = pipe.clone()
      algo.train(split.train)

      # Generate top-10 recommendations for each user in the test set of this fold
      user_ids = [k.user_id for k in split.test.keys()]
      print(f"=== Generating recommendations for {len(user_ids)} users ===")
      rec = batch_recommend(algo, user_ids, n=10)

      # results[k].append({'fold': f, 'test': split.test, 'rec': rec})
      all_test.add_from(split.test)
      all_rec.add_from(rec)
      print(f"=== recommendations: {len(rec)} ===")

  results[p].append({
      'test': all_test,
      'rec': all_rec
  })



=== min_sim = 1e-06 ===
=== fold 0 ===
=== Train size: 2213110, Test size: 535 ===


  builder = PipelineBuilder.from_config(config)


=== Generating recommendations for 535 users ===
=== recommendations: 535 ===
=== fold 1 ===
=== Train size: 2213899, Test size: 535 ===
=== Generating recommendations for 535 users ===
=== recommendations: 535 ===
=== fold 2 ===
=== Train size: 2213502, Test size: 535 ===
=== Generating recommendations for 535 users ===
=== recommendations: 535 ===
=== fold 3 ===
=== Train size: 2214799, Test size: 535 ===
=== Generating recommendations for 535 users ===
=== recommendations: 535 ===
=== fold 4 ===
=== Train size: 2212513, Test size: 535 ===
=== Generating recommendations for 535 users ===
=== recommendations: 535 ===

=== min_sim = 0.01 ===
=== fold 0 ===
=== Train size: 2213110, Test size: 535 ===
=== Generating recommendations for 535 users ===
=== recommendations: 535 ===
=== fold 1 ===
=== Train size: 2213899, Test size: 535 ===
=== Generating recommendations for 535 users ===
=== recommendations: 535 ===
=== fold 2 ===
=== Train size: 2213502, Test size: 535 ===
=== Generating re

In [46]:
for k, res in results.items():
  print(f"\n=== k = {k} ===")
  print(measure_performance(res[0]['test'], res[0]['rec']))
  print('---')


=== k = 1e-06 ===
   Precision    Recall      NDCG       Hit      MSE      RMSE
0   0.481645  0.028342  0.085109  0.962991  0.31477  0.561044
---

=== k = 0.01 ===
   Precision    Recall      NDCG       Hit      MSE      RMSE
0   0.481645  0.028342  0.085109  0.962991  0.31477  0.561044
---

=== k = 0.05 ===
   Precision    Recall      NDCG       Hit      MSE      RMSE
0   0.481645  0.028342  0.085109  0.962991  0.31477  0.561044
---

=== k = 0.1 ===
   Precision    Recall      NDCG       Hit      MSE      RMSE
0    0.48071  0.028275  0.084873  0.962243  0.31479  0.561062
---

=== k = 0.5 ===
   Precision   Recall      NDCG       Hit       MSE      RMSE
0    0.63276  0.03807  0.112106  0.997009  0.381956  0.618026
---

=== k = 0.9 ===


  df_rec = rec.to_df()


ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

In [49]:
results[0.9][0]['rec'].to_df()

  results[0.9][0]['rec'].to_df()


Unnamed: 0,user_id,item_id,score,rank
