# ALS applications

## Dzen dataset

Data comes from [dzen.ru](https://dzen.ru/) site and consists of likes which users put to text articles

### Columns
1. item_id - unique id of an item (article)
2. user_id - unique id of a user
3. source_id - unique id of an author. If two items have same source_id, then they come from one author
4. Name of item is name of the article
5. Raw dataset represents user_id and list of item_ids which user liked

In [2]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from tqdm.notebook import tqdm
import ast

In [3]:
item_names = pd.read_csv("zen_item_to_name.csv")
item_sources = pd.read_csv("zen_item_to_source.csv")
dataset = pd.read_csv("zen_ratings.csv", converters={"item_ids": ast.literal_eval})

In [4]:
item_names

Unnamed: 0,id,name
0,94962,Что обычно ожидало русских казачек в руках у к...
1,3972,Почему Россия решила строить новую скоростную ...
2,94644,"5 неприличных фактов об Андрее Макаревиче, кот..."
3,82518,"Что стало с красавицей Хмельницкой, которую му..."
4,53264,"Понять и Простить: Почему угонщики, бежавшие и..."
...,...,...
104498,36769,"Плюс один источник мифа о рыцарях, неспособных..."
104499,9190,Мой сад - малоуходный
104500,52731,Купил первую в жизни циркулярную пилу. Честный...
104501,72660,Решили предложить Марине помощь в лечении ч.10


In [5]:
item_sources

Unnamed: 0,id,source
0,94962,2919814402697966089
1,3972,3263022753228392991
2,94644,-3857390427602554682
3,82518,-9036908390349249792
4,53264,3353856219169766284
...,...,...
104498,36769,3818746211375738614
104499,9190,4975535765688979937
104500,52731,3720366796439288909
104501,72660,-7860042973720636310


In [6]:
dataset

Unnamed: 0,user_id,item_ids
0,993675863667353526,"[15267, 61075, 81203, 17066, 25471, 88427, 638..."
1,4250619547882954185,"[4555, 94644, 84972, 17774, 94962, 78217, 2485..."
2,3847785305345691076,"[1898, 26703, 16525, 86939, 55017, 31069, 4035..."
3,1785181112918558233,"[75601, 102458, 28716, 100694, 5757, 47104, 60..."
4,5078748097863903181,"[72260, 40825, 2615, 42549, 379, 100818, 56827..."
...,...,...
75905,4954138831959898373,"[11881, 55520, 63054, 48015, 66952, 103830, 21..."
75906,4967793435819938014,"[74697, 11830, 63858, 87245, 41956, 62089, 686..."
75907,7137764184903122777,"[10353, 1775, 103680, 29704, 9782, 13295, 9975..."
75908,2624987805086334956,"[24324, 18854, 73319, 66641, 64078, 97387, 426..."


In [None]:
%pip install ipywidgets

In [7]:
total_interactions_count = dataset.item_ids.map(len).sum()
user_coo = np.zeros(total_interactions_count, dtype=np.int64)
item_coo = np.zeros(total_interactions_count, dtype=np.int64)
pos = 0

for user_id, item_ids in enumerate(tqdm(dataset.item_ids)):
    user_coo[pos : pos + len(item_ids)] = user_id
    item_coo[pos : pos + len(item_ids)] = item_ids
    pos += len(item_ids)

shape = (max(user_coo) + 1, max(item_coo) + 1)
user_item_matrix = sp.coo_matrix(
    (np.ones(len(user_coo)), (user_coo, item_coo)), shape=shape
)
user_item_matrix = user_item_matrix.tocsr()
sp.save_npz("data_train.npz", user_item_matrix)
# Cleanup memory. Later you need just data_train.npz
del user_coo
del item_coo
del dataset

  0%|          | 0/75910 [00:00<?, ?it/s]

In [3]:
# you could start here if you already done precomputing
user_item_matrix = sp.load_npz("data_train.npz")

In [4]:
user_item_matrix

<75910x104503 sparse matrix of type '<class 'numpy.float64'>'
	with 5792423 stored elements in Compressed Sparse Row format>

In [5]:
def sparce_matrix_report(matrix):
    print("Size of raw data:", matrix.data.nbytes / 10**6, "Mb")
    print("Feedback matrix size:", matrix.shape)

In [6]:
sparce_matrix_report(user_item_matrix)

Size of raw data: 46.339384 Mb
Feedback matrix size: (75910, 104503)


In [7]:
item_weights = np.array(user_item_matrix.tocsc().sum(0))[0]
top_to_bottom_order = np.argsort(-item_weights)
item_mapping = np.empty(top_to_bottom_order.shape, dtype=int)
item_mapping[top_to_bottom_order] = np.arange(len(top_to_bottom_order))
total_item_count = (item_weights > 0).sum()
total_user_count = user_item_matrix.shape[0]


def build_debug_dataset(user_item_matrix, item_pct: float, user_pct: float):
    """Get given percent of top rated items and given percent of random users"""
    user_count = (int(total_user_count * user_pct),)
    item_count = int(total_item_count * item_pct)
    item_ids = top_to_bottom_order[:item_count]
    user_ids = np.random.choice(
        np.arange(user_item_matrix.shape[0]), size=user_count, replace=False
    )
    train = user_item_matrix[user_ids]
    train = train[:, item_ids]
    return train

In [8]:
debug_dataset = build_debug_dataset(user_item_matrix, 0.05, 0.05)

sparce_matrix_report(debug_dataset)

Size of raw data: 1.10536 Mb
Feedback matrix size: (3795, 5019)


In [9]:
type(debug_dataset[1])

scipy.sparse._csr.csr_matrix

This is useful for debugging (just to save time).

**Final answers should use full dataset!!!**

## Split dataset matrix (5 points)

in the following way: for 20% of users (random) remove one like - this will be test data. The rest is train data. (10 points)

In [14]:
def split_data(ratings):
    n_users, n_items = ratings.shape
    test_user_indices = np.random.choice(n_users, int(n_users * 0.2), replace=False)
    test_item_indices = []

    for user_idx in test_user_indices:
        item_indices = ratings[user_idx].nonzero()[1]
        if len(item_indices) > 0:
            test_item_indices.append((user_idx, np.random.choice(item_indices)))

    test_matrix = ratings.copy()
    for user_idx, item_idx in test_item_indices:
        test_matrix[user_idx, item_idx] = 0

    train_matrix = ratings

    return train_matrix, test_matrix

In [43]:
train_ratings, test_ratings = split_data(user_item_matrix)

## Implement ALS, IALS (10 points each)

Note that due to size of data you need to implement algorithms with _sparce matrices_!

In [26]:
lam = 1e-4
k = 20
max_iter = 30
tol = 1e-5


def calc_oposite_vectors(Y, A):
    B = Y.T @ Y + lam * np.eye(k)
    C = A @ Y
    return (np.linalg.inv(B) @ C.T).T


In [31]:
def als(ratings, k: int, lam: float):
    """Alternating Least Squares algorithm

    Args:
        ratings: sparce matrix of ratings
        k: size of embeddings
        lam: regularization term

    Returns:
        two matrices: of user embeddings and of item embeddings
    """

    user_embeddings, item_embeddings = (
        np.random.randn(ratings.shape[0], k),
        np.random.randn(ratings.shape[1], k),
    )

    for i in range(max_iter):
        user_embeddings = calc_oposite_vectors(item_embeddings, ratings)
        item_embeddings = calc_oposite_vectors(user_embeddings, ratings.T)

        # error = np.linalg.norm(ratings - user_embeddings @ item_embeddings.T)
        # if error < tol:
        #     break

    return user_embeddings, item_embeddings

## Compute MRR@100 metric for test users (10 points)

For ALS and IALS algorithms.

**Don't forget to use full dataset!**

In [50]:
def mrr100(user_embeddings, item_embeddings, test_ratings):
    n_users = test_ratings.shape[0]
    reciprocal_ranks = []

    for u in range(n_users):
        user_embedding = user_embeddings[u]
        user_ratings = test_ratings[u].toarray().flatten()
        non_zero_mask = user_ratings > 0

        similarities = user_embedding @ item_embeddings.T

        sorted_indices = np.argsort(-similarities[non_zero_mask])

        try:
            rank = (
                np.where(user_ratings[non_zero_mask][sorted_indices[:100]] > 0)[0][0]
                + 1
            )
        except IndexError:
            rank = 101

        reciprocal_rank = 1.0 / min(rank, 100)

        reciprocal_ranks.append(reciprocal_rank)
    mrr_value = np.mean(reciprocal_ranks)

    return mrr_value

In [44]:
user_als_embeddings, item_als_embeddings = als(train_ratings, k=20, lam=1e-4)

In [None]:
mrr_als = mrr100(user_als_embeddings, item_als_embeddings, test_ratings)
print(mrr_als)