### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
!pip install -q implicit lightfm

[K     |████████████████████████████████| 1.1MB 9.3MB/s 
[K     |████████████████████████████████| 307kB 30.4MB/s 
[?25h  Building wheel for implicit (setup.py) ... [?25l[?25hdone
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone


In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp
import random

from lightfm.datasets import fetch_movielens
from tqdm.autonotebook import tqdm
from sklearn.neighbors import KDTree

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [3]:
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip -O data.zip
!unzip data.zip

--2020-10-07 21:36:35--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘data.zip’


2020-10-07 21:36:37 (6.71 MB/s) - ‘data.zip’ saved [5917549/5917549]

Archive:  data.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


In [30]:
ratings = pd.read_csv('ml-1m/ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [31]:
movie_info = pd.read_csv('ml-1m/movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

In [32]:
ratings['user_id'] -= 1
ratings['movie_id'] -= 1
movie_info['movie_id'] -= 1

Explicit данные

In [35]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,0,1192,5
1,0,660,3
2,0,913,3
3,0,3407,4
4,0,2354,5
5,0,1196,3
6,0,1286,5
7,0,2803,5
8,0,593,4
9,0,918,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [36]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [37]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,0,1192,5
3,0,3407,4
4,0,2354,5
6,0,1286,5
7,0,2803,5
8,0,593,4
9,0,918,4
10,0,594,5
11,0,937,4
12,0,2397,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [38]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [39]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)

В качестве loss здесь всеми любимый RMSE

In [40]:
model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [41]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,0,Toy Story (1995),Animation|Children's|Comedy
1,1,Jumanji (1995),Adventure|Children's|Fantasy
2,2,Grumpier Old Men (1995),Comedy|Romance
3,3,Waiting to Exhale (1995),Comedy|Drama
4,4,Father of the Bride Part II (1995),Comedy


In [42]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [43]:
get_similars(0, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1526    Hercules (1997)',
 '2692    Iron Giant, The (1999)',
 '1838    Mulan (1998)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [44]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [45]:
get_user_history(3, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [46]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [47]:
get_recommendations(3, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1182    Aliens (1986)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '2502    Matrix, The (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1884    French Connection, The (1971)',
 '3402    Close Encounters of the Third Kind (1977)',
 '1892    Rain Man (1988)',
 '847    Godfather, The (1972)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [29]:
def get_movies(ids):
    return movie_info.set_index('movie_id').loc[ids]

In [18]:
SEED = 7
np.random.seed(SEED)
random.seed(SEED)

In [23]:
class BaseRecommender:
    def __init__(self, 
                 ratings,
                 is_implicit=False,
                 emb_size=64, 
                 lr=0.01, 
                 reg=0.01,
                 reg_bias=0.1, 
                 max_iter=int(1e7), 
                 check_interval=int(1e5),
                 eps=1e-7):
        
        if is_implicit:
            self.R = sp.coo_matrix((np.ones_like(ratings.user_id), (ratings.user_id, ratings.movie_id))).tocsr()
        else:
            self.R = sp.coo_matrix((ratings.rating, (ratings.user_id, ratings.movie_id))).tocsr()
        self.n_users, self.n_items = self.R.shape
        self.nnz = self.R.nnz
        self.nonzero = self.R.nonzero()

        self.unique_items = ratings['movie_id'].unique()
        self.unique_users = ratings['user_id'].unique() 
          
        self.emb_size = emb_size
        self.lr = lr
        self.reg = reg
        self.reg_bias = reg_bias
        self.max_iter = max_iter
        self.check_interval = check_interval
        self.eps = eps

        self.P = np.random.uniform(0, 1 / np.sqrt(self.emb_size), size=(self.n_users, self.emb_size))  
        self.Q = np.random.uniform(0, 1 / np.sqrt(self.emb_size), size=(self.n_items, self.emb_size))

        self.P_bias = np.array(self.R.mean(axis=1)).flatten()
        self.Q_bias = np.array(self.R.mean(axis=0)).flatten()
        self.global_bias = self.R.mean()
        
    def predict(self, users, items):
        raise NotImplementedError()

    def build_rec(self):
        raise NotImplementedError()

    def early_stopping(self, iteration):
        rmse = np.linalg.norm(self.build_rec()[self.nonzero] - self.R[self.nonzero]) / (self.nnz ** 0.5)
        print(f'#{iteration} RMSE: {rmse:.4}')
        return rmse < self.eps
        
    def similar_items(self, item_id, k=10):
        return np.argsort(np.linalg.norm(self.Q - self.Q[item_id], axis=1))[:k]

    def recommend(self, user_id, k=10):
        known_items = set(self.R[user_id].nonzero()[1])
        unknown_items = list(set(self.unique_items) - known_items)
        items_to_rec = sorted(unknown_items,
                              key=lambda item: self.predict(user_id, item), 
                              reverse=True)
        return items_to_rec[:k]

In [24]:
class SVD(BaseRecommender):
    def __init__(self, 
                 ratings=ratings,
                 is_implicit=False,
                 emb_size=64, 
                 lr=0.01, 
                 reg=0.01,
                 reg_bias=0.01, 
                 max_iter=int(1e7), 
                 check_interval=int(1e5),
                 eps=1e-7):
        super().__init__(ratings, is_implicit, emb_size, lr, reg, reg_bias, max_iter, check_interval, eps)
        
    def predict(self, user, item):
        return self.P[user] @ self.Q[item].T + self.global_bias + self.P_bias[user] + self.Q_bias[item]

    def build_rec(self):
        return self.global_bias + self.P_bias[:, None] + self.Q_bias[None, :] + self.P @ self.Q.T

    def fit(self):
        pbar = tqdm(range(self.max_iter), total=self.max_iter)
        users_nz, items_nz = self.nonzero
        for cur_iter in pbar:
            ind = np.random.randint(len(users_nz))
            user = users_nz[ind]
            item = items_nz[ind]
            if cur_iter % self.check_interval == 0:
                if self.early_stopping(cur_iter):
                    break
            
            error = self.predict(user, item) - self.R[user, item]

            P_user_copy = self.P[user].copy()
            self.P[user] -= self.lr * (error * self.Q[item] + self.reg * self.P[user])
            self.Q[item] -= self.lr * (error * P_user_copy + self.reg * self.Q[item])
            del P_user_copy

            self.global_bias -= self.lr * error
            self.P_bias[user] -= self.lr * (error + self.reg_bias * self.P_bias[user])
            self.Q_bias[item] -= self.lr * (error + self.reg_bias * self.Q_bias[item])
            

In [25]:
svd = SVD(ratings, emb_size=64, lr=0.01, 
          reg=1e-4, reg_bias=1e-2, max_iter=int(2e7), check_interval=int(5e6), eps=1e-5)
svd.fit()

HBox(children=(FloatProgress(value=0.0, max=20000000.0), HTML(value='')))

#0 RMSE: 2.576
#5000000 RMSE: 0.8377
#10000000 RMSE: 0.6983
#15000000 RMSE: 0.6225



In [26]:
svd.early_stopping(int(2e7))

#20000000 RMSE: 0.5887


False

In [48]:
get_movies(svd.similar_items(0))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Toy Story (1995),Animation|Children's|Comedy
3113,Toy Story 2 (1999),Animation|Children's|Comedy
3904,"Specials, The (2000)",Comedy
946,My Man Godfrey (1936),Comedy
1612,Star Maps (1997),Drama
830,Stonewall (1995),Drama
74,Big Bully (1996),Comedy|Drama
3291,"Big Combo, The (1955)",Film-Noir
2979,Men Cry Bullets (1997),Drama
637,Jack and Sarah (1995),Romance


In [None]:
get_user_history(3, implicit_ratings)

In [50]:
movie_info.loc[svd.recommend(3)]

Unnamed: 0,movie_id,name,category
911,922,Citizen Kane (1941),Drama
857,867,Death in Brunswick (1991),Comedy
952,963,Angel and the Badman (1947),Western
3434,3502,Solaris (Solyaris) (1972),Drama|Sci-Fi
909,920,My Favorite Year (1982),Comedy
912,923,2001: A Space Odyssey (1968),Drama|Mystery|Sci-Fi|Thriller
526,529,Second Best (1994),Drama
3094,3162,Topsy-Turvy (1999),Drama
1206,1223,Henry V (1989),Drama|War
16,16,Sense and Sensibility (1995),Drama|Romance


### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [51]:
class ALS(BaseRecommender):
    def __init__(self, 
                 ratings=implicit_ratings,
                 is_implicit=True,
                 emb_size=64, 
                 lr=1e-3, 
                 reg=1e-3,
                 max_iter=100, 
                 check_interval=10,
                 eps=1e-7):
        super().__init__(ratings, is_implicit, emb_size, lr, reg, 0, max_iter, check_interval, eps)

    def predict(self, user, item):
        return self.P[user] @ self.Q[item]

    def build_rec(self):
        return self.P @ self.Q.T

    def fit(self):
        pbar = tqdm(range(self.max_iter), total=self.max_iter)
        for cur_iter in pbar:
            if cur_iter % self.check_interval == 0:
                if self.early_stopping(cur_iter):
                    break

            error = self.build_rec()
            error[self.nonzero] -= 1
            self.P -= self.lr * (error @ self.Q + self.reg * self.P)
            self.Q -= self.lr * (error.T @ self.P + self.reg * self.Q)

In [52]:
als = ALS(emb_size=64, max_iter=1000, check_interval=100)
als.fit()

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))

#0 RMSE: 0.7515
#100 RMSE: 0.6852
#200 RMSE: 0.6447
#300 RMSE: 0.6407
#400 RMSE: 0.6396
#500 RMSE: 0.6391
#600 RMSE: 0.6388
#700 RMSE: 0.6386
#800 RMSE: 0.6385
#900 RMSE: 0.6384



In [53]:
get_movies(als.similar_items(0))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Toy Story (1995),Animation|Children's|Comedy
3113,Toy Story 2 (1999),Animation|Children's|Comedy
2354,"Bug's Life, A (1998)",Animation|Children's|Comedy
587,Aladdin (1992),Animation|Children's|Comedy|Musical
363,"Lion King, The (1994)",Animation|Children's|Musical
2320,Pleasantville (1998),Comedy
2760,"Iron Giant, The (1999)",Animation|Children's
594,Beauty and the Beast (1991),Animation|Children's|Musical
1906,Mulan (1998),Animation|Children's
1922,There's Something About Mary (1998),Comedy


In [56]:
get_user_history(3, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [55]:
get_movies(als.recommend(3))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
588,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
2570,"Matrix, The (1999)",Action|Sci-Fi|Thriller
1290,Indiana Jones and the Last Crusade (1989),Action|Adventure
1303,Butch Cassidy and the Sundance Kid (1969),Action|Comedy|Western
1195,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
1199,Aliens (1986),Action|Sci-Fi|Thriller|War
3470,Close Encounters of the Third Kind (1977),Drama|Sci-Fi
109,Braveheart (1995),Action|Drama|War
2528,Planet of the Apes (1968),Action|Sci-Fi
1286,Ben-Hur (1959),Action|Adventure|Drama


### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [57]:
class BPR(BaseRecommender):
    def __init__(self, 
                 ratings=implicit_ratings, 
                 is_implicit=True,
                 emb_size=30, 
                 lr=1e-3, 
                 reg=1e-3,
                 max_iter=10, 
                 check_interval=1,
                 eps=1e-7,
                 n_samples=3):
        super().__init__(ratings, is_implicit, emb_size, lr, reg, 0, max_iter, check_interval, eps)
        self.pos_neg_pairs = {}
        self.build_pos_neg_pairs(ratings)
        self.n_samples = n_samples

    def predict(self, user, item):
        return self.P[user] @ self.Q[item]

    def build_rec(self):
        return self.P @ self.Q.T
        
    def build_pos_neg_pairs(self, ratings):
        for user in tqdm(self.unique_users, total=len(self.unique_users)):         
            pos_items = list(ratings[ratings['user_id'] == user]['movie_id'])
            neg_items = list(set(self.unique_items) - set(pos_items))
            self.pos_neg_pairs[user] = (pos_items, neg_items)

    def loss(self, size=20):
        res = count = 0
        for u in np.random.choice(self.unique_users, size=size, replace=False):
            pos_items, neg_items = self.pos_neg_pairs[u]
            for p in pos_items:
                for n in np.random.choice(neg_items, size=5, replace=False):
                    res += np.log(1. + np.exp(self.predict(u, n) - self.predict(u, p)))
                    count += 1
        return res / count
           
    def fit(self):
        for cur_iter in range(self.max_iter):
            print(f'Iteration #{cur_iter}')
            if cur_iter % self.check_interval == 0:
                if self.early_stopping(cur_iter):
                    break
            
            pbar = tqdm(self.unique_users, total=len(self.unique_users))
            for user in pbar:
                user_pos_items, user_neg_items = self.pos_neg_pairs[user]
                for pos in user_pos_items:
                    neg_samples = np.random.choice(user_neg_items, size=self.n_samples, replace=False)
                    
                    for neg in neg_samples:                        
                        r_pos = self.P[user] @ self.Q[pos].T
                        r_neg = self.P[user] @ self.Q[neg].T

                        L =  1 / (1 + np.exp(r_pos - r_neg))

                        P_user_copy = self.P[user].copy()
                        self.P[user] += self.lr * (L * (self.Q[pos] - self.Q[neg]) - self.reg * self.P[user])
                        self.Q[pos] += self.lr * (L * P_user_copy - self.reg * self.Q[pos])
                        self.Q[neg] += self.lr * (L * (-P_user_copy) - self.reg * self.Q[neg])

                if not user % 1000: 
                    loss = self.loss()
                    pbar.set_postfix({'loss': loss})

In [58]:
bpr = BPR(ratings=implicit_ratings, is_implicit=True, emb_size=64, max_iter=5, lr=1e-2, reg=1e-4, n_samples=5)
bpr.fit()

HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #0
#0 RMSE: 0.751


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #1
#1 RMSE: 2.796


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #2
#2 RMSE: 2.591


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #3
#3 RMSE: 2.732


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #4
#4 RMSE: 2.955


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))




In [59]:
get_movies(bpr.similar_items(0))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Toy Story (1995),Animation|Children's|Comedy
1196,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
3113,Toy Story 2 (1999),Animation|Children's|Comedy
1264,Groundhog Day (1993),Comedy|Romance
1269,Back to the Future (1985),Comedy|Sci-Fi
2354,"Bug's Life, A (1998)",Animation|Children's|Comedy
1258,Stand by Me (1986),Adventure|Comedy|Drama
2986,Who Framed Roger Rabbit? (1988),Adventure|Animation|Film-Noir
2715,Ghostbusters (1984),Comedy|Horror
2917,Ferris Bueller's Day Off (1986),Comedy


In [60]:
get_user_history(3, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [61]:
get_movies(bpr.recommend(3))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1195,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
857,"Godfather, The (1972)",Action|Crime|Drama
2857,American Beauty (1999),Comedy|Drama
592,"Silence of the Lambs, The (1991)",Drama|Thriller
2570,"Matrix, The (1999)",Action|Sci-Fi|Thriller
588,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
607,Fargo (1996),Crime|Drama|Thriller
2761,"Sixth Sense, The (1999)",Thriller
1196,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
1209,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War


### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [62]:
class WARP(BPR):
    def __init__(self, 
                 ratings=implicit_ratings, 
                 is_implicit=True,
                 emb_size=64, 
                 lr=1e-3, 
                 reg=1e-3,
                 max_iter=10, 
                 check_interval=1,
                 eps=1e-7,
                 n_samples=3):
         super().__init__(ratings, is_implicit, emb_size, lr, reg, max_iter, check_interval, eps, n_samples)

    def fit(self):
        for cur_iter in range(self.max_iter):
            print(f'Iteration #{cur_iter}')
            if cur_iter % self.check_interval == 0:
                if self.early_stopping(cur_iter):
                    break
            
            pbar = tqdm(self.unique_users, total=len(self.unique_users))
            for user in pbar:
                pos_items, neg_items = self.pos_neg_pairs[user]
                for pos in pos_items:
                    rank = 0
                    for neg in np.random.permutation(neg_items):
                        rank += 1
                        if self.predict(user, pos) < self.predict(user, neg) + 1:
                            P_user_copy = self.P[user].copy()
                            weight = np.log(len(neg_items) / rank)

                            self.P[user] += self.lr * (weight * (self.Q[pos] - self.Q[neg]) - self.reg * P_user_copy)
                            self.Q[pos] += self.lr * (weight * P_user_copy - self.reg * self.Q[pos])
                            self.Q[neg] += self.lr * (weight * (- P_user_copy) - self.reg * self.Q[neg])
                            break
                
                if not user % 1000: 
                    loss = self.loss()
                    pbar.set_postfix({'loss': loss})

In [63]:
warp = WARP(emb_size=64, max_iter=5, lr=1e-3, reg=1e-3, n_samples=5)
warp.fit()

HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #0
#0 RMSE: 0.7512


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #1
#1 RMSE: 1.621


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #2
#2 RMSE: 1.946


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #3
#3 RMSE: 2.012


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))


Iteration #4
#4 RMSE: 2.046


HBox(children=(FloatProgress(value=0.0, max=6038.0), HTML(value='')))




In [64]:
get_movies(warp.similar_items(0))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Toy Story (1995),Animation|Children's|Comedy
2354,"Bug's Life, A (1998)",Animation|Children's|Comedy
3113,Toy Story 2 (1999),Animation|Children's|Comedy
587,Aladdin (1992),Animation|Children's|Comedy|Musical
1264,Groundhog Day (1993),Comedy|Romance
1269,Back to the Future (1985),Comedy|Sci-Fi
2715,Ghostbusters (1984),Comedy|Horror
2917,Ferris Bueller's Day Off (1986),Comedy
1196,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
2796,Big (1988),Comedy|Fantasy


In [65]:
get_user_history(3, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [66]:
get_movies(warp.recommend(3))

Unnamed: 0_level_0,name,category
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1195,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
2570,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2761,"Sixth Sense, The (1999)",Thriller
1209,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
592,"Silence of the Lambs, The (1991)",Drama|Thriller
857,"Godfather, The (1972)",Action|Crime|Drama
588,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
317,"Shawshank Redemption, The (1994)",Drama
2857,American Beauty (1999),Comedy|Drama
607,Fargo (1996),Crime|Drama|Thriller
