### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from lightfm.datasets import fetch_movielens

In [379]:
from pathlib import Path
from tqdm import tqdm
from collections import defaultdict
from sklearn.metrics import roc_auc_score

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [3]:
dataset_dir = Path('..', 'data', 'ml-1m')

In [4]:
ratings = pd.read_csv(dataset_dir / 'ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [5]:
movie_info = pd.read_csv(dataset_dir / 'movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [6]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [7]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [8]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [9]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [10]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)

В качестве loss здесь всеми любимый RMSE

In [11]:
model.fit(user_item_t_csr)

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [12]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [14]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1838    Mulan (1998)',
 '2618    Tarzan (1999)',
 '1526    Hercules (1997)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [15]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [16]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [17]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [18]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1182    Aliens (1986)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '2502    Matrix, The (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1179    Princess Bride, The (1987)',
 '847    Godfather, The (1972)',
 '1892    Rain Man (1988)',
 '3402    Close Encounters of the Third Kind (1977)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Шаг 0: вспомогательные классы для данных и моделей

In [232]:
class EvaluationData:

    def __init__(self, ratings, explicit=True, train_val_test_split=(0.8, 0.1, 0.1)):
        train_ratio, val_ratio, test_ratio = train_val_test_split
        self.n_ratings = len(ratings)
        self.n_train = int(self.n_ratings * train_ratio)
        self.n_val = int(self.n_ratings * val_ratio)
        self.n_test = self.n_ratings - self.n_train - self.n_val

        self.n_users = max(ratings.user_id) + 1
        self.n_movies = max(ratings.movie_id) + 1
        self.present_users = ratings.user_id.unique()
        self.present_movies = ratings.movie_id.unique()
        
        index = np.random.permutation(ratings.index)
        self.train_ratings = ratings.loc[index[:self.n_train]]
        self.val_ratings = ratings.loc[index[self.n_train:self.n_train + self.n_val]]
        self.test_ratings = ratings.loc[index[self.n_train + self.n_val:]]

In [233]:
eval_data = EvaluationData(ratings)

In [151]:
evaluator.n_users, evaluator.n_movies

(6041, 3953)

In [394]:
from abc import ABC

def movie_name(movie_id):
    return f'{movie_info[movie_info["movie_id"] == movie_id]["name"].iloc[0]} | ' \
           f'{movie_info[movie_info["movie_id"] == movie_id]["category"].iloc[0]}'

class Model(ABC):
    def __init__(self, n_users, n_movies, dim=64, training_epochs=10, lr=0.001, reg=0.1, early_stopping_epochs=3):
        self.dim = dim
        self.epochs = training_epochs
        self.early_stopping_epochs = early_stopping_epochs
        self.lr = lr
        self.reg = reg
        self.n_users = n_users
        self.n_movies = n_movies
        self.uniform_range = 1 / np.sqrt(dim)
        self.user_embedding = np.zeros((n_users, dim))
        self.movie_embedding = np.zeros((n_users, dim))
        self.user_bias = np.zeros(dim)
        self.movie_bias = np.zeros(dim)
        self.bias = 0
        self.historical_data = None
        
    
    def _init_embeddings(self, data):
        self.user_embedding = np.random.uniform(0, self.uniform_range, (self.n_users, self.dim))
        self.movie_embedding = np.random.uniform(0, self.uniform_range, (self.n_movies, self.dim))
        self.user_bias = np.zeros(self.n_users)
        self.movie_bias = np.zeros(self.n_movies)
        self.bias = data['rating'].mean()
    
    def _store_state(self):
        self._best_state = (
            self.user_embedding.copy(), 
            self.movie_embedding.copy(), 
            self.user_bias.copy(),
            self.movie_bias.copy(),
            self.bias
        )
        
    def _restore_state(self):
        self.user_embedding, self.movie_embedding, self.user_bias, self.movie_bias, self.bias  = self._best_state
    
    def _predict_rating(self, u, m):
        return np.dot(self.user_embedding[u], self.movie_embedding[m]) + self.user_bias[u] + self.movie_bias[m] + self.bias
    
    def _update_embeddings(self, u, m, error):
        user_correction = self.lr * (error * self.movie_embedding[m] + self.reg * self.user_embedding[u])
        movie_correction = self.lr * (error * self.user_embedding[u] + self.reg * self.movie_embedding[m])
        user_bias_correction = self.lr * (error + self.reg * self.user_bias[u])
        movie_bias_correction = self.lr * (error + self.reg * self.movie_bias[m])
        self.user_embedding[u] -= user_correction
        self.movie_embedding[m] -= movie_correction
        self.user_bias[u] -=  user_bias_correction
        self.movie_bias[m] -= movie_bias_correction
    
    def _create_implicit_data(self, data):
        return data.loc[(data['rating'] >= 4)]
        
    def _implicit_to_csr(self, implicit_data):
        users = implicit_ratings["user_id"]
        movies = implicit_ratings["movie_id"]
        user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
        user_item_t_csr = user_item.T.tocsr()
        user_item_csr = user_item.tocsr()
        return user_item_csr
    
    def print_history(self, user_id):
        print(f'User {user_id} watched:')
        for movie_id in self.historical_data.getrow(user_id).indices:
            print(movie_name(movie_id))
        print()
        
    def print_recommendations(self, user_id, k=10):
        watched = {x for x in self.historical_data.getrow(user_id).indices}
        predicted_ratings = np.dot(self.movie_embedding, self.user_embedding[user_id]) / np.linalg.norm(self.movie_embedding, axis=1)
        sorted_indices = np.argsort(predicted_ratings)
        predicted = 0
        print(f'Recommended movies for user {user_id}')
        for movie_id in reversed(sorted_indices):
            if movie_id not in watched:
                predicted += 1
                print(movie_name(movie_id))
            if k <= predicted:
                break
                
    def print_similar_movies(self, movie_id, k=10):
        predicted_ratings = np.dot(self.movie_embedding, self.movie_embedding[movie_id]) / np.linalg.norm(self.movie_embedding, axis=1)
        sorted_indices = np.argsort(predicted_ratings)
        print(f'Similar movies to movie {movie_name(movie_id)}')
        for movie_id in list(reversed(sorted_indices))[:k]:
            print(movie_name(movie_id))

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [395]:
class SvdSgd(Model):
    
    def __init__(self, n_users, n_movies, dim=64, training_epochs=10, lr=1e-3, reg=1e-5, early_stopping_epochs=3):
        super().__init__(n_users, n_movies, dim, training_epochs, lr, reg, early_stopping_epochs)
    
    def fit(self, train_data, val_data):
        self._init_embeddings(train_data)
        self._store_state()
        self.historical_data = self._implicit_to_csr(self._create_implicit_data(train_data))
        best_rmse, best_epoch = np.inf, -1
        for epoch in range(self.epochs):
            print(f'Training epoch {epoch + 1}:', flush=True)
            indices = np.random.permutation(train_data.index)
            total_se = 0
            
            for ind in tqdm(indices):
                row = train_data.loc[ind]
                user, movie, rating = row['user_id'], row['movie_id'], row['rating']
                prediction = self._predict_rating(user, movie)
                error = prediction - rating
                self._update_embeddings(user, movie, error)
                total_se += error ** 2
            
            mean_se = total_se / len(indices)
            rmse = np.sqrt(mean_se)
            validation_rmse = self.evaluate(val_data)
            
            print(f'After epoch {epoch + 1}:\n'
                  f'Training RMSE = {rmse:.4f}\n'
                  f'Validation RMSE = {validation_rmse:.4f}')
            if validation_rmse < best_rmse:
                best_rmse = validation_rmse
                best_epoch = epoch
                print(f'RMSE improved! Storing the parameters.')
                self._store_state()
            elif self.early_stopping_epochs <= epoch - best_epoch:
                print(f'No validation improvements for {epoch - best_epoch} epochs. Stopping.')
                self._restore_state()
                break
        print(f'Best validation RMSE: {best_rmse:.4f}')
            
    def evaluate(self, data):
        print(f'Running evaluation:', flush=True)
        total_se = 0
        for ind in tqdm(data.index):
            row = data.loc[ind]
            user, movie, rating = row['user_id'], row['movie_id'], row['rating']
            prediction = self._predict_rating(user, movie)
            error = prediction - rating
            total_se += error ** 2
        mean_se = total_se / len(data.index)
        rmse = np.sqrt(mean_se)
        return rmse

In [396]:
model = SvdSgd(eval_data.n_users, eval_data.n_movies, training_epochs=20, lr=0.005, reg=0.01)

In [397]:
model.fit(eval_data.train_ratings, eval_data.val_ratings)

Training epoch 1:


100%|██████████| 800167/800167 [01:22<00:00, 9662.52it/s] 

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11672.15it/s]

After epoch 1:
Training RMSE = 0.9864
Validation RMSE = 0.9375
RMSE improved! Storing the parameters.
Training epoch 2:



100%|██████████| 800167/800167 [01:24<00:00, 9435.36it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11797.99it/s]

After epoch 2:
Training RMSE = 0.9248
Validation RMSE = 0.9201
RMSE improved! Storing the parameters.
Training epoch 3:



100%|██████████| 800167/800167 [01:23<00:00, 9549.50it/s] 

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 12031.32it/s]

After epoch 3:
Training RMSE = 0.9113
Validation RMSE = 0.9136
RMSE improved! Storing the parameters.
Training epoch 4:



100%|██████████| 800167/800167 [01:25<00:00, 9336.31it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11691.28it/s]

After epoch 4:
Training RMSE = 0.9043
Validation RMSE = 0.9095
RMSE improved! Storing the parameters.
Training epoch 5:



100%|██████████| 800167/800167 [01:23<00:00, 9579.63it/s] 

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11697.34it/s]

After epoch 5:
Training RMSE = 0.8992
Validation RMSE = 0.9068
RMSE improved! Storing the parameters.
Training epoch 6:



100%|██████████| 800167/800167 [01:24<00:00, 9417.04it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11776.49it/s]

After epoch 6:
Training RMSE = 0.8944
Validation RMSE = 0.9041
RMSE improved! Storing the parameters.
Training epoch 7:



100%|██████████| 800167/800167 [01:25<00:00, 9385.06it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11866.34it/s]

After epoch 7:
Training RMSE = 0.8896
Validation RMSE = 0.9011
RMSE improved! Storing the parameters.
Training epoch 8:



100%|██████████| 800167/800167 [01:24<00:00, 9443.68it/s] 

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11627.23it/s]

After epoch 8:
Training RMSE = 0.8843
Validation RMSE = 0.8980
RMSE improved! Storing the parameters.
Training epoch 9:



100%|██████████| 800167/800167 [01:27<00:00, 9155.75it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11440.27it/s]

After epoch 9:
Training RMSE = 0.8784
Validation RMSE = 0.8944
RMSE improved! Storing the parameters.
Training epoch 10:



100%|██████████| 800167/800167 [01:25<00:00, 9353.45it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11561.78it/s]

After epoch 10:
Training RMSE = 0.8717
Validation RMSE = 0.8909
RMSE improved! Storing the parameters.
Training epoch 11:



100%|██████████| 800167/800167 [01:24<00:00, 9458.15it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11619.88it/s]

After epoch 11:
Training RMSE = 0.8640
Validation RMSE = 0.8868
RMSE improved! Storing the parameters.
Training epoch 12:



100%|██████████| 800167/800167 [01:25<00:00, 9334.90it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11569.85it/s]

After epoch 12:
Training RMSE = 0.8553
Validation RMSE = 0.8827
RMSE improved! Storing the parameters.
Training epoch 13:



100%|██████████| 800167/800167 [01:26<00:00, 9267.55it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11991.73it/s]

After epoch 13:
Training RMSE = 0.8455
Validation RMSE = 0.8782
RMSE improved! Storing the parameters.
Training epoch 14:



100%|██████████| 800167/800167 [01:25<00:00, 9401.81it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11921.95it/s]

After epoch 14:
Training RMSE = 0.8349
Validation RMSE = 0.8741
RMSE improved! Storing the parameters.
Training epoch 15:



100%|██████████| 800167/800167 [01:24<00:00, 9430.65it/s] 

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11814.70it/s]

After epoch 15:
Training RMSE = 0.8236
Validation RMSE = 0.8701
RMSE improved! Storing the parameters.
Training epoch 16:



100%|██████████| 800167/800167 [01:25<00:00, 9324.62it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11643.82it/s]

After epoch 16:
Training RMSE = 0.8118
Validation RMSE = 0.8669
RMSE improved! Storing the parameters.
Training epoch 17:



100%|██████████| 800167/800167 [01:26<00:00, 9259.16it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11694.59it/s]

After epoch 17:
Training RMSE = 0.7995
Validation RMSE = 0.8640
RMSE improved! Storing the parameters.
Training epoch 18:



100%|██████████| 800167/800167 [01:26<00:00, 9208.40it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11901.89it/s]

After epoch 18:
Training RMSE = 0.7870
Validation RMSE = 0.8620
RMSE improved! Storing the parameters.
Training epoch 19:



100%|██████████| 800167/800167 [01:26<00:00, 9248.22it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11636.83it/s]

After epoch 19:
Training RMSE = 0.7743
Validation RMSE = 0.8603
RMSE improved! Storing the parameters.
Training epoch 20:



100%|██████████| 800167/800167 [01:25<00:00, 9332.75it/s]

Running evaluation:



100%|██████████| 100020/100020 [00:08<00:00, 11509.53it/s]

After epoch 20:
Training RMSE = 0.7615
Validation RMSE = 0.8596
RMSE improved! Storing the parameters.
Best validation RMSE: 0.8596





In [398]:
print(f"RMSE on test data: {model.evaluate(eval_data.test_ratings):.4f}")

Running evaluation:


100%|██████████| 100022/100022 [00:08<00:00, 11681.19it/s]

RMSE on test data: 0.8603





In [399]:
def print_unbiased_recommendations(model, user_id, k=10):
    watched = {x for x in model.historical_data.getrow(user_id).indices}
    predicted_ratings = np.dot(model.movie_embedding, model.user_embedding[user_id])
    sorted_indices = np.argsort(predicted_ratings)
    predicted = 0
    print(f'Recommended unbiased movies for user {user_id}')
    for movie_id in reversed(sorted_indices):
        if movie_id not in watched:
            predicted += 1
            print(predicted_ratings[movie_id], movie_name(movie_id))
        if k <= predicted:
            break
    print()

In [400]:
model.print_history(4)

print_unbiased_recommendations(model, 4)
model.print_recommendations(4)

User 4 watched:
Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Fantasy|Sci-Fi
Jurassic Park (1993) | Action|Adventure|Sci-Fi
Die Hard (1988) | Action|Thriller
E.T. the Extra-Terrestrial (1982) | Children's|Drama|Fantasy|Sci-Fi
Raiders of the Lost Ark (1981) | Action|Adventure
Good, The Bad and The Ugly, The (1966) | Action|Western
Alien (1979) | Action|Horror|Sci-Fi|Thriller
Terminator, The (1984) | Action|Sci-Fi|Thriller
Jaws (1975) | Action|Horror
Rocky (1976) | Action|Drama
Saving Private Ryan (1998) | Action|Drama|War
King Kong (1933) | Action|Adventure|Horror
Run Lola Run (Lola rennt) (1998) | Action|Crime|Romance
Goldfinger (1964) | Action
Fistful of Dollars, A (1964) | Action|Western
Thelma & Louise (1991) | Action|Drama
Hustler, The (1961) | Drama
Mad Max (1979) | Action|Sci-Fi

Recommended unbiased movies for user 4
0.7415357031205381 Room with a View, A (1986) | Drama|Romance
0.6829720269647984 Nashville (1975) | Drama|Musical
0.6669970430090664 On the Waterfron

In [401]:
model.print_similar_movies(1)

Similar movies to movie Toy Story (1995) | Animation|Children's|Comedy
Toy Story (1995) | Animation|Children's|Comedy
Toy Story 2 (1999) | Animation|Children's|Comedy
Babe (1995) | Children's|Comedy|Drama
Big (1988) | Comedy|Fantasy
Bug's Life, A (1998) | Animation|Children's|Comedy
Chicken Run (2000) | Animation|Children's|Comedy
Return of the Pink Panther, The (1974) | Comedy
Mulan (1998) | Animation|Children's
Aladdin (1992) | Animation|Children's|Comedy|Musical
Cocoon (1985) | Comedy|Sci-Fi


In [402]:
model.print_history(7)

print_unbiased_recommendations(model, 7)
model.print_recommendations(7)

User 7 watched:
Heat (1995) | Action|Crime|Thriller
Braveheart (1995) | Action|Drama|War
Clear and Present Danger (1994) | Action|Adventure|Thriller
True Lies (1994) | Action|Adventure|Comedy|Romance
Demolition Man (1993) | Action|Sci-Fi
Fugitive, The (1993) | Action|Thriller
In the Line of Fire (1993) | Action|Thriller
Jurassic Park (1993) | Action|Adventure|Sci-Fi
Terminator 2: Judgment Day (1991) | Action|Sci-Fi|Thriller
Mission: Impossible (1996) | Action|Adventure|Mystery
Rock, The (1996) | Action|Adventure|Thriller
Supercop (1992) | Action|Thriller
Star Wars: Episode V - The Empire Strikes Back (1980) | Action|Adventure|Drama|Sci-Fi|War
Godfather: Part II, The (1974) | Action|Crime|Drama
Back to the Future (1985) | Comedy|Sci-Fi
Face/Off (1997) | Action|Sci-Fi|Thriller
Men in Black (1997) | Action|Adventure|Comedy|Sci-Fi
Hunt for Red October, The (1990) | Action|Thriller
Tomorrow Never Dies (1997) | Action|Romance|Thriller
Exorcist, The (1973) | Horror
Saving Private Ryan (1998) 

IndexError: single positional indexer is out-of-bounds

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [391]:
class ALS(Model):

    def __init__(
        self, n_users, n_movies, 
        dim=64, training_epochs=10, lr=1e-3, reg=1e-5, early_stopping_epochs=3,
        alpha=10
    ):
        super().__init__(n_users, n_movies, dim, training_epochs, lr, reg, early_stopping_epochs)
        self.alpha = alpha
        
    def fit(self, train_data, val_data, unique_users, unique_movies):
        self._init_embeddings(train_data)
        self.bias = 0
        self._store_state()
        self.historical_data = self._implicit_to_csr(self._create_implicit_data(train_data))
        best_rmse, best_epoch = np.inf, -1

        for epoch in range(self.epochs):
            print(f'Training epoch {epoch + 1}:', flush=True)
            
            print(f'Recomputing user embeddings...', flush=True)
            movie_normalizer = (self.movie_embedding * self.movie_embedding).sum(axis=1)
            for user_id in tqdm(unique_users):
                user_watched = self.historical_data.getrow(user_id).toarray()[0]
                confidence = user_watched * self.alpha + 1
                normalizer = (movie_normalizer * confidence + self.reg).sum()
                total_movie_embedding = (confidence.reshape(-1, 1) * self.movie_embedding).sum(axis=0)
                self.user_embedding[user_id] = total_movie_embedding / normalizer
                
                
            print(f'Recomputing movie embeddings...', flush=True)
            user_normalizer = (self.user_embedding * self.user_embedding).sum(axis=1)
            for movie_id in tqdm(unique_movies):
                movies_watched_by = self.historical_data.getcol(movie_id).toarray()[:, 0]
                confidence = movies_watched_by * self.alpha + 1
                normalizer = (user_normalizer * confidence + self.reg).sum()
                total_user_embedding = (confidence.reshape(-1, 1) * self.user_embedding).sum(axis=0)
                self.movie_embedding[movie_id] = total_user_embedding / normalizer

            validation_rmse = self.evaluate(val_data)
            print(f'After epoch {epoch + 1}:\n'
                  f'Validation RMSE = {validation_rmse:.4f}')
            if validation_rmse < best_rmse:
                best_rmse = validation_rmse
                best_epoch = epoch
                print(f'RMSE improved! Storing the parameters.')
                self._store_state()
            elif self.early_stopping_epochs <= epoch - best_epoch:
                print(f'No validation improvements for {epoch - best_epoch} epochs. Stopping.')
                self._restore_state()
                break
            print()
            
        print(f'Best validation RMSE: {best_rmse:.4f}')
        
    def evaluate(self, data):
        print(f'Running evaluation:', flush=True)
        start_time = time()
        proc = cpu_count()
        chunks = np.array_split(data.index, proc)
        with Parallel(proc) as pool:
            results = pool(
                delayed(self.evaluate_parallel)(data.loc[chunk])
                for chunk in chunks
            )
        total_se = sum(result[0] for result in results)
        count = sum(result[1] for result in results)
        mean_se = total_se / count
        rmse = np.sqrt(mean_se)
        end_time = time()
        print(f'Evaluated in {end_time - start_time:.1f} sec. RMSE = {rmse:.4f}')
        return rmse
    
    def evaluate_parallel(self, data):
        total_se = 0
        count = 0
        for ind in tqdm(data.index):
            row = data.loc[ind]
            user, movie, rating = row['user_id'], row['movie_id'], row['rating']
            if rating >= 4:
                prediction = self._predict_rating(user, movie)
                error = 1 - prediction
                total_se += error ** 2
                count += 1
        return total_se, count

In [392]:
als_model = ALS(eval_data.n_users, eval_data.n_movies, training_epochs=10, reg=0.01, alpha=1000)

In [393]:
als_model.fit(eval_data.train_ratings, eval_data.val_ratings, eval_data.present_users, eval_data.present_movies)

Training epoch 1:
Recomputing user embeddings...


100%|██████████| 6040/6040 [00:02<00:00, 2264.28it/s]

Recomputing movie embeddings...



100%|██████████| 3706/3706 [00:05<00:00, 707.84it/s]

Running evaluation:





Evaluated in 2.2 sec. RMSE = 0.0046
After epoch 1:
Validation RMSE = 0.0046
RMSE improved! Storing the parameters.

Training epoch 2:
Recomputing user embeddings...


100%|██████████| 6040/6040 [00:02<00:00, 2192.99it/s]

Recomputing movie embeddings...



100%|██████████| 3706/3706 [00:05<00:00, 723.02it/s]

Running evaluation:





Evaluated in 1.5 sec. RMSE = 0.0009
After epoch 2:
Validation RMSE = 0.0009
RMSE improved! Storing the parameters.

Training epoch 3:
Recomputing user embeddings...


100%|██████████| 6040/6040 [00:02<00:00, 2142.42it/s]

Recomputing movie embeddings...



100%|██████████| 3706/3706 [00:05<00:00, 627.14it/s]

Running evaluation:





Evaluated in 1.6 sec. RMSE = 0.0009
After epoch 3:
Validation RMSE = 0.0009

Training epoch 4:
Recomputing user embeddings...


100%|██████████| 6040/6040 [00:02<00:00, 2025.06it/s]

Recomputing movie embeddings...



100%|██████████| 3706/3706 [00:05<00:00, 627.90it/s]

Running evaluation:





Evaluated in 1.6 sec. RMSE = 0.0009
After epoch 4:
Validation RMSE = 0.0009

Training epoch 5:
Recomputing user embeddings...


100%|██████████| 6040/6040 [00:02<00:00, 2063.73it/s]

Recomputing movie embeddings...



100%|██████████| 3706/3706 [00:05<00:00, 690.11it/s]

Running evaluation:





Evaluated in 1.5 sec. RMSE = 0.0009
After epoch 5:
Validation RMSE = 0.0009
No validation improvements for 3 epochs. Stopping.
Best validation RMSE: 0.0009


In [389]:
als_model.print_similar_movies(1)

Similar movies to movie Toy Story (1995) | Animation|Children's|Comedy
Toy Story (1995) | Animation|Children's|Comedy
Toy Story 2 (1999) | Animation|Children's|Comedy
Groundhog Day (1993) | Comedy|Romance
Babe (1995) | Children's|Comedy|Drama
Snowriders (1996) | Documentary
Race the Sun (1996) | Drama
Adrenalin: Fear the Rush (1996) | Action|Sci-Fi
Autopsy (Macchie Solari) (1975) | Horror
Harlem (1993) | Drama
Boys (1996) | Drama


In [390]:
als_model.print_history(4)

als_model.print_recommendations(4)

User 4 watched:
Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Fantasy|Sci-Fi
Jurassic Park (1993) | Action|Adventure|Sci-Fi
Die Hard (1988) | Action|Thriller
E.T. the Extra-Terrestrial (1982) | Children's|Drama|Fantasy|Sci-Fi
Raiders of the Lost Ark (1981) | Action|Adventure
Good, The Bad and The Ugly, The (1966) | Action|Western
Alien (1979) | Action|Horror|Sci-Fi|Thriller
Terminator, The (1984) | Action|Sci-Fi|Thriller
Jaws (1975) | Action|Horror
Rocky (1976) | Action|Drama
Saving Private Ryan (1998) | Action|Drama|War
King Kong (1933) | Action|Adventure|Horror
Run Lola Run (Lola rennt) (1998) | Action|Crime|Romance
Goldfinger (1964) | Action
Fistful of Dollars, A (1964) | Action|Western
Thelma & Louise (1991) | Action|Drama
Hustler, The (1961) | Drama
Mad Max (1979) | Action|Sci-Fi

Recommended movies for user 4
Boys and Girls (2000) | Comedy|Romance
Bossa Nova (1999) | Comedy
Allnighter, The (1987) | Comedy|Romance
Ayn Rand: A Sense of Life (1997) | Documentary
Cup, 

In [369]:
als_model.print_history(250)

als_model.print_recommendations(250)

User 250 watched:
Forrest Gump (1994) | Comedy|Romance|War
Jurassic Park (1993) | Action|Adventure|Sci-Fi
Mission: Impossible (1996) | Action|Adventure|Mystery
Escape from New York (1981) | Action|Adventure|Sci-Fi|Thriller
Alien (1979) | Action|Horror|Sci-Fi|Thriller
Nikita (La Femme Nikita) (1990) | Thriller
Donnie Brasco (1997) | Crime|Drama
Lost World: Jurassic Park, The (1997) | Action|Adventure|Sci-Fi|Thriller
Airplane! (1980) | Comedy
From Russia with Love (1963) | Action
Romeo Must Die (2000) | Action|Romance
Frequency (2000) | Drama|Thriller

Recommended movies for user 250
Pitch Black (2000) | Action|Sci-Fi
Highlander: Endgame (2000) | Action|Adventure|Fantasy
U-571 (2000) | Action|Thriller
Knock Off (1998) | Action
Art of War, The (2000) | Action
Mission to Mars (2000) | Sci-Fi
Aces: Iron Eagle III (1992) | Action|War
Starship Troopers (1997) | Action|Adventure|Sci-Fi|War
Predator 2 (1990) | Action|Sci-Fi|Thriller
Face/Off (1997) | Action|Sci-Fi|Thriller


In [370]:
als_model.print_history(7)

als_model.print_recommendations(7)

User 7 watched:
Heat (1995) | Action|Crime|Thriller
Braveheart (1995) | Action|Drama|War
Clear and Present Danger (1994) | Action|Adventure|Thriller
True Lies (1994) | Action|Adventure|Comedy|Romance
Demolition Man (1993) | Action|Sci-Fi
Fugitive, The (1993) | Action|Thriller
In the Line of Fire (1993) | Action|Thriller
Jurassic Park (1993) | Action|Adventure|Sci-Fi
Terminator 2: Judgment Day (1991) | Action|Sci-Fi|Thriller
Mission: Impossible (1996) | Action|Adventure|Mystery
Rock, The (1996) | Action|Adventure|Thriller
Supercop (1992) | Action|Thriller
Star Wars: Episode V - The Empire Strikes Back (1980) | Action|Adventure|Drama|Sci-Fi|War
Godfather: Part II, The (1974) | Action|Crime|Drama
Back to the Future (1985) | Comedy|Sci-Fi
Face/Off (1997) | Action|Sci-Fi|Thriller
Men in Black (1997) | Action|Adventure|Comedy|Sci-Fi
Hunt for Red October, The (1990) | Action|Thriller
Tomorrow Never Dies (1997) | Action|Romance|Thriller
Exorcist, The (1973) | Horror
Saving Private Ryan (1998) 

In [371]:
als_model.print_history(6015)

als_model.print_recommendations(6015)

User 6015 watched:
Toy Story (1995) | Animation|Children's|Comedy
Babe (1995) | Children's|Comedy|Drama
Clerks (1994) | Comedy
Pulp Fiction (1994) | Crime|Drama
Manhattan Murder Mystery (1993) | Comedy|Mystery
Much Ado About Nothing (1993) | Comedy|Romance
Aladdin (1992) | Animation|Children's|Comedy|Musical
Fish Called Wanda, A (1988) | Comedy
English Patient, The (1996) | Drama|Romance|War
Strictly Ballroom (1992) | Comedy|Romance
Annie Hall (1977) | Comedy|Romance
Dead Poets Society (1989) | Drama
Groundhog Day (1993) | Comedy|Romance
Heathers (1989) | Comedy
Indiana Jones and the Last Crusade (1989) | Action|Adventure
When Harry Met Sally... (1989) | Comedy|Romance
Hercules (1997) | Adventure|Animation|Children's|Comedy|Musical
Mulan (1998) | Animation|Children's
Roger & Me (1989) | Comedy|Documentary
Lady and the Tramp (1955) | Animation|Children's|Comedy|Musical|Romance
Little Mermaid, The (1989) | Animation|Children's|Comedy|Musical|Romance
101 Dalmatians (1961) | Animation|Chil

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [None]:
# class BPR:

#     def __init__(self,):
    
#     def evaluate(self, data):
#         print(f'Running evaluation:', flush=True)
#         end_time = time()
#         y_true =  defaultdict(list)
#         y_predicted = defaultdict(list)
#         for ind in tqdm(data.index):
#             row = data.loc[ind]
#             user, movie, rating = row['user_id'], row['movie_id'], row['rating']
#             if rating >= 4:
#                 y_true[user].append(1)
#             else:
#                 y_true[user].append(0)
#             prediction = np.dot(self.movie_embedding[movie], self.user_embedding[user]) / np.linalg.norm(self.movie_embedding[movie])
#             y_predicted[user].append(prediction)
        
#         total_roc_auc = 0
#         count = 0
#         for user in y_true.keys():
#             y_t = y_true[user]
#             if 0 in y_t and 1 in y_t:
#                 y_p = y_predicted[user]
#                 total_roc_auc += roc_auc_score(y_t, y_p)
#                 count += 1
#         return total_roc_auc / count

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных