# ALS applications

## Implement ALS, IALS (20 points each)

In [1]:
import numpy as np

In [2]:
def als(ratings: np.ndarray, k: int):
    '''Alternating Least Squares algorithm

    Args:
        ratings: matrix of ratings
        k: size of embeddings
        
    Returns:
        two matrices: of user embeddings and of item embeddings
    '''
    # your code here
    return user_embeddings, item_embeddings

In [None]:
def ials(ratings: np.ndarray, k: int):
    '''Implicit Alternating Least Squares algorithm

    Args:
        ratings: matrix of ratings
        k: size of embeddings

    Returns:
        two matrices: of user embeddings and of item embeddings
    '''
    # your code here
    return user_embeddings, item_embeddings

## Read Dzen dataset (no points, already imlemented)

Data comes from [dzen.ru](http://dzen.ru/) site and consists of likes which users put to text articles

### Columns
1. item_id - unique id of an item (article)
2. user_id - unique id of a user
3. source_id - unique id of an author. If two items have same source_id, then they come from one author
4. Name of item is name of the article
5. Raw dataset represents user_id and list of item_ids which user liked

### Data files

You need to download it to current directory

1. Dataset: https://disk.yandex.ru/d/uUx1MMsZUR87Sw
2. Item names: https://disk.yandex.ru/d/_ZMXsmki-OtLJA
3. item_id and source_id links: https://disk.yandex.ru/d/GCryohhLbYPFoA

In [1]:
import pandas as pd
import scipy.sparse as sp
from tqdm.notebook import tqdm

In [2]:
all_names = pd.read_json("item_id_to_name.json", lines=False)
item_links = pd.read_json("item_id_to_source_id.json", lines=False)
dataset = pd.read_json("dataset_zen.json", lines=False)

In [3]:
all_names

Unnamed: 0,id,name
0,94962,Что обычно ожидало русских казачек в руках у к...
1,3972,Почему Россия решила строить новую скоростную ...
2,94644,"5 неприличных фактов об Андрее Макаревиче, кот..."
3,82518,"Что стало с красавицей Хмельницкой, которую му..."
4,53264,"Понять и Простить: Почему угонщики, бежавшие и..."
...,...,...
104498,36769,"Плюс один источник мифа о рыцарях, неспособных..."
104499,9190,Мой сад - малоуходный
104500,52731,Купил первую в жизни циркулярную пилу. Честный...
104501,72660,Решили предложить Марине помощь в лечении ч.10


In [4]:
item_links

Unnamed: 0,id,source
0,94962,2919814402697966089
1,3972,3263022753228392991
2,94644,-3857390427602554682
3,82518,-9036908390349249792
4,53264,3353856219169766284
...,...,...
104498,36769,3818746211375738614
104499,9190,4975535765688979937
104500,52731,3720366796439288909
104501,72660,-7860042973720636310


In [5]:
dataset

Unnamed: 0,user_id,item_ids
0,993675863667353526,"[15267, 61075, 81203, 17066, 25471, 88427, 638..."
1,4250619547882954185,"[4555, 94644, 84972, 17774, 94962, 78217, 2485..."
2,3847785305345691076,"[1898, 26703, 16525, 86939, 55017, 31069, 4035..."
3,1785181112918558233,"[75601, 102458, 28716, 100694, 5757, 47104, 60..."
4,5078748097863903181,"[72260, 40825, 2615, 42549, 379, 100818, 56827..."
...,...,...
75905,4954138831959898373,"[11881, 55520, 63054, 48015, 66952, 103830, 21..."
75906,4967793435819938014,"[74697, 11830, 63858, 87245, 41956, 62089, 686..."
75907,7137764184903122777,"[10353, 1775, 103680, 29704, 9782, 13295, 9975..."
75908,2624987805086334956,"[24324, 18854, 73319, 66641, 64078, 97387, 426..."


In [6]:
total_interactions_count = dataset.item_ids.map(len).sum()
user_coo = np.zeros(total_interactions_count, dtype=np.int64)
item_coo = np.zeros(total_interactions_count, dtype=np.int64)
pos = 0

for user_id, item_ids in enumerate(tqdm(dataset.item_ids)):
    user_coo[pos : pos + len(item_ids)] = user_id
    item_coo[pos : pos + len(item_ids)] = item_ids
    pos += len(item_ids)
shape = (max(user_coo) + 1, max(item_coo) + 1)
user_item_matrix = sp.coo_matrix(
    (np.ones(len(user_coo)), (user_coo, item_coo)), shape=shape
)
user_item_matrix = user_item_matrix.tocsr()
sp.save_npz("data_train.npz", user_item_matrix)
# Cleanup memory. Later you need just data_train.npz
del user_coo
del item_coo
del dataset

  0%|          | 0/75910 [00:00<?, ?it/s]

In [7]:
# you could start here if you already done precomputing
user_item_matrix = sp.load_npz("data_train.npz")

In [8]:
user_item_matrix

<75910x104503 sparse matrix of type '<class 'numpy.float64'>'
	with 5792423 stored elements in Compressed Sparse Row format>

In [9]:
item_weights = np.array(user_item_matrix.tocsc().sum(0))[0]
top_to_bottom_order = np.argsort(-item_weights)
item_mapping = np.empty(top_to_bottom_order.shape, dtype=int)
item_mapping[top_to_bottom_order] = np.arange(len(top_to_bottom_order))
total_item_count = (item_weights > 0).sum()
total_user_count = user_item_matrix.shape[0]


def build_dataset(user_item_matrix, item_pct, user_pct):
    user_count, item_count = int(total_user_count * user_pct), int(
        total_item_count * item_pct
    )
    item_ids = top_to_bottom_order[:item_count]
    user_ids = np.random.choice(
        np.arange(user_item_matrix.shape[0]), size=user_count, replace=False
    )
    train = user_item_matrix[user_ids]
    train = train[:, item_ids]
    return train

In [10]:
small_dataset = build_dataset(user_item_matrix, 0.05, 0.05)

This is useful for debugging (just to save time).
Final answers should use full dataset

In [11]:
small_dataset

<3795x5019 sparse matrix of type '<class 'numpy.float64'>'
	with 137275 stored elements in Compressed Sparse Row format>

## Split dataset matrix (10 points)

in the following way: for 20% of users (random) remove one like - this will be test data. The rest is train data. (10 points)

In [3]:
def split_data(ratings):
    # your code here
    return train_matrix, test_matrix

In [None]:
train_ratings, test_ratings = split_data(...)

## Compute MRR@100 metric for test users (10 points)

For ALS and IALS algorithms

In [None]:
def mmr(predictions, test_ratings):
    # your code here
    return mrr_value

In [None]:
mrr_als = mrr(als_predictions, test_ratings)
print(mrr_als)

In [None]:
mrr_ials = mrr(ials_predictions, test_ratings)
print(mrr_ials)

## Adjust hyperparameters of ALS and IALS to maximize MRR (30 points)

Main hyperparameters are regularization and weights for implicit case.

In [None]:
# your code here

Optimal parameters of ALS are:

....

Optimal parameters of IALS are:

....

## Get similarities from item2item CF and SLIM (10 points each)

Item2item can be taken from the first homework, SLIM was implemented in the class.

You need to compute only item similarities, not predictions for users.

In [None]:
i2i_similarities = ... # your code here

In [None]:
slim_similarities = ... # your code here

## Compare similarities from four algorithms (20 points)

* plot distributions
* compute metrics (which you think are relevant)
* look at several top similar lists

Make conclusion how these methods differ in computing similarities

In [4]:
# your code here

Conclusion:

....