<a href="https://colab.research.google.com/github/dashatenoff/recsys-vk/blob/main/notebooks/tem_based_cf_jaccard_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Item-based Collaborative Filtering (Jaccard)

This notebook implements a simple item-based collaborative filtering
recommender system using Jaccard similarity.

The model is based on implicit user-item interactions (views) from the VK-LSVD dataset.
The implementation focuses on clarity and conceptual correctness rather than efficiency.


In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
import os
os.listdir('/content/drive/MyDrive')
train = pd.read_parquet('/content/drive/MyDrive/VK/train.parquet')
test = pd.read_parquet('/content/drive/MyDrive/VK/test.parquet')

Mounted at /content/drive


#Функция похожести айтемов

In [2]:
user_items = train.groupby('user_id')['item_id'].apply(set)
item_users = train.groupby('item_id')['user_id'].apply(set)

def jaccard(item1, item2):
  users1 = item_users.get(item1,set())
  users2 = item_users.get(item2, set())

  if not users1 or not users2:
    return 0.0

  return len(users1 & users2) / len(users1 | users2)


popular_items = train['item_id'].value_counts().head(2000).index


#RECOMPUTE ПОХОЖЕСТЕЙ АЙТЕМОВ

In [3]:
from collections import defaultdict
item_sim = defaultdict(list)
items = list(popular_items)

for i in range(len(items)):
  for j in range(i+1, len(items)):
    i1, i2 = items[i], items[j]
    sim = jaccard(i1, i2)
    if sim >= 0.01:
      item_sim[i1].append((i2, sim))
      item_sim[i2].append((i1, sim))


#Функция рекомендаций для ОДНОГО пользователя

In [4]:

def recommend_for_user(user_id, user_items, item_sim, k=10):
  scores = {}

  seen_items = user_items.get(user_id, set())
  for item in seen_items:
    for other_item, sim in item_sim.get(item, []):
        if other_item in seen_items:
          continue


        scores[other_item] = scores.get(other_item, 0) + sim

  ranked_items = sorted(scores.items(), key = lambda x: x[1], reverse=True)

  return [item for item, score in ranked_items[:k]]


In [5]:
test_user_id = list(user_items.keys())[0]
recommend_for_user(test_user_id, user_items, item_sim, k=10)




[329362081,
 178221643,
 377181597,
 225067555,
 434582233,
 31692283,
 319579439,
 547469448,
 26828817,
 566979306]

#MAP@10

In [6]:
test_user = test['user_id'].unique()
recs = []

for user_id in test_user:
  recs_items = recommend_for_user(user_id, user_items, item_sim, k=10)
  for item in recs_items:
    recs.append({
        'user_id' : user_id,
        'recs' : item
    })


In [9]:
submission = pd.DataFrame(recs)
submission.to_csv('submission_item-based', index=False)
submission.head(20)


Unnamed: 0,user_id,recs
0,506947605,353095663
1,506947605,263869106
2,506947605,418868193
3,506947605,270728763
4,506947605,222689569
5,506947605,225067555
6,506947605,178221643
7,506947605,164213967
8,506947605,377181597
9,506947605,434650826


In [11]:
from string import octdigits
rel = test.groupby('user_id')['item_id'].apply(set)
pred = submission.groupby('user_id')['recs'].apply(list)

aps = []
for user_id in pred.index:
  ord = 1
  cor = 0
  score =0.0
  for item_id in pred[user_id]:
    if item_id in rel.get(user_id, set()):
      cor+=1
      score+=cor/ord
    ord+=1
  aps.append(score/min((len(rel.get(user_id, [])), 10)))
map10 = sum(aps)/len(aps)
map10

0.017041417254831888

Item-based collaborative filtering with Jaccard similarity
significantly outperforms the TopPop baseline,
improving MAP@10 from 0.0073 to 0.0170 (×2.3).