### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [2]:
ratings = pd.read_csv('./ml-1m/ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [3]:
movie_info = pd.read_csv('./ml-1m/movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [4]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [5]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [6]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [7]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [8]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [9]:
model.fit(user_item_t_csr)

HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [10]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [11]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [12]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1526    Hercules (1997)',
 '2692    Iron Giant, The (1999)',
 '2252    Pleasantville (1998)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [13]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [14]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [15]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [16]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1182    Aliens (1986)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '2502    Matrix, The (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1179    Princess Bride, The (1987)',
 '3402    Close Encounters of the Third Kind (1977)',
 '847    Godfather, The (1972)',
 '1892    Rain Man (1988)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

In [17]:
from tqdm.notebook import tqdm
from sklearn.metrics.pairwise import cosine_similarity

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [18]:
class SVD:

    def __init__(self, max_iter, n_components, lr, decay, eval_each):
        self.max_iter = max_iter
        self.n_components = n_components
        self.lr = lr
        self.decay = decay
        self.eval_each = eval_each

    def fit(self, ratings: pd.DataFrame):
        emb2user = np.unique(ratings.user_id)
        emb2movie = np.unique(ratings.movie_id)
        user2emb = {user: emb for emb, user in enumerate(emb2user)}
        movie2emb = {movie: emb for emb, movie in enumerate(emb2movie)}

        ratings_np = ratings.rating.to_numpy()
        user_ids_np = ratings.user_id.map(user2emb.get).to_numpy()
        movie_ids_np = ratings.movie_id.map(movie2emb.get).to_numpy()

        W = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (len(emb2user), self.n_components))
        H = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (self.n_components, len(emb2movie)))
        BW = np.zeros(len(emb2user))
        BH = np.zeros(len(emb2movie))
        mu = ratings.rating.mean()

        for k in tqdm(range(1, self.max_iter + 1)):
            user_id, movie_id, rating = ratings.iloc[np.random.randint(0, len(ratings))]
            i, j = user2emb[user_id], movie2emb[movie_id]

            error = W[i] @ H[:, j] + BW[i] + BH[j] + mu - rating

            Wi_update = self.lr * (error * H[:, j] + self.decay * W[i])
            H[:, j] -= self.lr * (error * W[i] + self.decay * H[:, j])
            W[i] -= Wi_update

            BW[i] -= self.lr * (error + self.decay * BW[i])
            BH[j] -= self.lr * (error + self.decay * BH[j])

            if k % self.eval_each == 0:
                predictions = W @ H + BW.reshape(-1, 1) + BH + mu
                rmse = np.linalg.norm(predictions[user_ids_np, movie_ids_np] - ratings_np) / np.sqrt(len(ratings))
                print(f'Iter: {k:08}, RMSE: {rmse:.3f}')

        self.W, self.H, self.BW, self.BH, self.mu = W, H, BW, BH, mu
        self.movie2emb, self.emb2movie, self.user2emb = movie2emb, emb2movie, user2emb

    def similar_items(self, movie_id, n=10):
        j = self.movie2emb[movie_id]
        distances = cosine_similarity(self.H[:, j].reshape(1, -1), self.H.T)
        most_similar = distances[0].argsort()[-n:][::-1]
        return [(self.emb2movie[emb],) for emb in most_similar]

    def recommend(self, user_id, _, n=20):
        i = self.user2emb[user_id]
        ratings = self.W[i] @ self.H + self.BW[i] + self.BH + self.mu
        recommended_embs = ratings.argsort()
        return [(self.emb2movie[emb],) for emb in recommended_embs 
                if self.emb2movie[emb] not in 
                implicit_ratings[implicit_ratings.user_id == user_id].movie_id.to_numpy()][-n:][::-1]

In [19]:
%%time

svd = SVD(max_iter=12000000, n_components=300, lr=1e-2, decay=1e-2, eval_each=500000)

svd.fit(ratings)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=12000000.0), HTML(value='')))

Iter: 00500000, RMSE: 0.931
Iter: 01000000, RMSE: 0.913
Iter: 01500000, RMSE: 0.905
Iter: 02000000, RMSE: 0.899
Iter: 02500000, RMSE: 0.894
Iter: 03000000, RMSE: 0.887
Iter: 03500000, RMSE: 0.880
Iter: 04000000, RMSE: 0.871
Iter: 04500000, RMSE: 0.861
Iter: 05000000, RMSE: 0.849
Iter: 05500000, RMSE: 0.836
Iter: 06000000, RMSE: 0.822
Iter: 06500000, RMSE: 0.807
Iter: 07000000, RMSE: 0.791
Iter: 07500000, RMSE: 0.774
Iter: 08000000, RMSE: 0.756
Iter: 08500000, RMSE: 0.737
Iter: 09000000, RMSE: 0.718
Iter: 09500000, RMSE: 0.699
Iter: 10000000, RMSE: 0.679
Iter: 10500000, RMSE: 0.659
Iter: 11000000, RMSE: 0.639
Iter: 11500000, RMSE: 0.619
Iter: 12000000, RMSE: 0.600

CPU times: user 33min 1s, sys: 58.5 s, total: 33min 59s
Wall time: 32min 27s


In [20]:
movie_info[movie_info.movie_id == 1]

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy


In [21]:
get_similars(1, svd)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '584    Aladdin (1992)',
 '1838    Mulan (1998)',
 '591    Beauty and the Beast (1991)',
 '3327    Muppet Movie, The (1979)',
 '2618    Tarzan (1999)',
 '2020    Rescuers Down Under, The (1990)',
 '1636    Truman Show, The (1998)']

In [22]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [23]:
get_recommendations(4, svd)

['847    Godfather, The (1972)',
 '892    Rear Window (1954)',
 '900    Casablanca (1942)',
 '2836    Sanjuro (1962)',
 '1203    Godfather: Part II, The (1974)',
 '1950    Seven Samurai (The Magnificent Seven) (Shichin...',
 '1234    Treasure of the Sierra Madre, The (1948)',
 '3366    Double Indemnity (1944)',
 '1876    On the Waterfront (1954)',
 '1194    Third Man, The (1949)',
 "1176    One Flew Over the Cuckoo's Nest (1975)",
 '664    World of Apu, The (Apur Sansar) (1959)',
 '1189    To Kill a Mockingbird (1962)',
 '911    Citizen Kane (1941)',
 '1399    Hearts and Minds (1996)',
 '886    Philadelphia Story, The (1940)',
 '3560    Gold Rush, The (1925)',
 '2961    Yojimbo (1961)',
 '3026    Grapes of Wrath, The (1940)',
 '896    North by Northwest (1959)']

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [24]:
implicit_ratings['rating'].values[:] = 1

In [25]:
class ALS:

    def __init__(self, max_iter, n_components, lr, decay, eval_each):
        self.max_iter = max_iter
        self.n_components = n_components
        self.lr = lr
        self.decay = decay
        self.eval_each = eval_each

    def fit(self, ratings: pd.DataFrame):
        emb2user = np.unique(ratings.user_id)
        emb2movie = np.unique(ratings.movie_id)
        user2emb = {user: emb for emb, user in enumerate(emb2user)}
        movie2emb = {movie: emb for emb, movie in enumerate(emb2movie)}

        ratings_np = ratings.rating.to_numpy()
        user_ids_np = ratings.user_id.map(user2emb.get).to_numpy()
        movie_ids_np = ratings.movie_id.map(movie2emb.get).to_numpy()

        W = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (len(emb2user), self.n_components))
        H = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (self.n_components, len(emb2movie)))

        for k in tqdm(range(1, self.max_iter + 1)):
            predictions = W @ H
            
            error = predictions.copy()
            error[user_ids_np, movie_ids_np] -= ratings_np
            
            if k % 2 == 0:
                W -= self.lr * (error @ H.T + self.decay * W)
            else:
                H -= self.lr * (W.T @ error + self.decay * H)

            if k % self.eval_each == 0:
                rmse = np.linalg.norm(predictions[user_ids_np, movie_ids_np] - ratings_np) / np.sqrt(len(ratings))
                print(f'Iter: {k:03}, RMSE: {rmse:.3f}')

        self.W, self.H = W, H
        self.movie2emb, self.emb2movie, self.user2emb = movie2emb, emb2movie, user2emb

    def similar_items(self, movie_id, n=10):
        j = self.movie2emb[movie_id]
        distances = cosine_similarity(self.H[:, j].reshape(1, -1), self.H.T)
        most_similar = distances[0].argsort()[-n:][::-1]
        return [(self.emb2movie[emb],) for emb in most_similar]

    def recommend(self, user_id, _, n=20):
        i = self.user2emb[user_id]
        ratings = self.W[i] @ self.H
        recommended_embs = ratings.argsort()
        return [(self.emb2movie[emb],) for emb in recommended_embs 
                if self.emb2movie[emb] not in 
                implicit_ratings[implicit_ratings.user_id == user_id].movie_id.to_numpy()][-n:][::-1]

In [26]:
%%time

als = ALS(max_iter=260, n_components=300, lr=1e-3, decay=1e-3, eval_each=10)

als.fit(implicit_ratings)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=260.0), HTML(value='')))

Iter: 010, RMSE: 0.868
Iter: 020, RMSE: 0.840
Iter: 030, RMSE: 0.829
Iter: 040, RMSE: 0.823
Iter: 050, RMSE: 0.817
Iter: 060, RMSE: 0.804
Iter: 070, RMSE: 0.788
Iter: 080, RMSE: 0.771
Iter: 090, RMSE: 0.756
Iter: 100, RMSE: 0.743
Iter: 110, RMSE: 0.732
Iter: 120, RMSE: 0.721
Iter: 130, RMSE: 0.711
Iter: 140, RMSE: 0.701
Iter: 150, RMSE: 0.691
Iter: 160, RMSE: 0.680
Iter: 170, RMSE: 0.670
Iter: 180, RMSE: 0.659
Iter: 190, RMSE: 0.648
Iter: 200, RMSE: 0.637
Iter: 210, RMSE: 0.627
Iter: 220, RMSE: 0.616
Iter: 230, RMSE: 0.605
Iter: 240, RMSE: 0.595
Iter: 250, RMSE: 0.585
Iter: 260, RMSE: 0.575

CPU times: user 14min 35s, sys: 7min 38s, total: 22min 13s
Wall time: 3min 40s


In [27]:
movie_info[movie_info.movie_id == 1]

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy


In [28]:
get_similars(1, als)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 '584    Aladdin (1992)',
 "2286    Bug's Life, A (1998)",
 '591    Beauty and the Beast (1991)',
 '2225    Antz (1998)',
 '1526    Hercules (1997)',
 '360    Lion King, The (1994)',
 '1838    Mulan (1998)',
 '2618    Tarzan (1999)']

In [29]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [30]:
get_recommendations(4, als)

['1182    Aliens (1986)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '585    Terminator 2: Judgment Day (1991)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '453    Fugitive, The (1993)',
 '1179    Princess Bride, The (1987)',
 '2880    Dr. No (1962)',
 '2879    From Russia with Love (1963)',
 '1884    French Connection, The (1971)',
 '537    Blade Runner (1982)',
 '1267    Ben-Hur (1959)',
 '3458    Predator (1987)',
 '3634    Mad Max 2 (a.k.a. The Road Warrior) (1981)',
 '847    Godfather, The (1972)',
 '2875    Dirty Dozen, The (1967)',
 '2571    Superman (1978)',
 '2460    Planet of the Apes (1968)',
 '1568    Hunt for Red October, The (1990)',
 '108    Braveheart (1995)']

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [31]:
class BPR:

    def __init__(self, max_iter, n_components, lr, decay, eval_each):
        self.max_iter = max_iter
        self.n_components = n_components
        self.lr = lr
        self.decay = decay
        self.eval_each = eval_each

    def fit(self, ratings: pd.DataFrame):
        n_ratings = len(ratings)

        emb2user = np.unique(ratings.user_id)
        emb2movie = np.unique(ratings.movie_id)
        user2emb = {user: emb for emb, user in enumerate(emb2user)}
        movie2emb = {movie: emb for emb, movie in enumerate(emb2movie)}

        user2seen = {user2emb[user_id]: set(ratings[ratings.user_id == user_id].movie_id) for user_id in emb2user}
        user2unseen = [[movie2emb[movie_id] for movie_id in movie2emb if movie_id not in user2seen[emb]] for emb in range(len(emb2user))]
        
        perm = np.random.permutation(n_ratings)
        ratings_np = ratings.rating.to_numpy()[perm]
        user_ids_np = ratings.user_id.map(user2emb.get).to_numpy()[perm]
        movie_ids_np = ratings.movie_id.map(movie2emb.get).to_numpy()[perm]

        W = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (len(emb2user), self.n_components))
        H = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (self.n_components, len(emb2movie)))

        for k in tqdm(range(1, self.max_iter + 1)):
            if k % n_ratings == 0:
                perm = np.random.permutation(len(ratings))
                user_ids_np = user_ids_np[perm]
                movie_ids_np = movie_ids_np[perm]
                ratings_np = ratings_np[perm]

            u = user_ids_np[k % n_ratings]
            i = movie_ids_np[k % n_ratings]
            
            for _ in range(4):
                j = np.random.choice(user2unseen[u])

                x_uij = W[u] @ H[:, i] - W[u] @ H[:, j]
                sigmoid = 1 / (1 + np.exp(x_uij))

                W[u] += self.lr * (sigmoid * (H[:, i] - H[:, j]) - self.decay * W[u])
                H[:, i] += self.lr * (sigmoid * W[u] - self.decay * H[:, i])
                H[:, j] += self.lr * (-sigmoid * W[u] - self.decay * H[:, j])
            
            if k % self.eval_each == 0:
                rmse = np.linalg.norm((W @ H)[user_ids_np, movie_ids_np] - ratings_np) / np.sqrt(n_ratings)
                print(f'Iter: {k:06}, RMSE: {rmse:.3f}')

        self.W, self.H = W, H
        self.movie2emb, self.emb2movie, self.user2emb = movie2emb, emb2movie, user2emb

    def similar_items(self, movie_id, n=10):
        j = self.movie2emb[movie_id]
        distances = cosine_similarity(self.H[:, j].reshape(1, -1), self.H.T)
        most_similar = distances[0].argsort()[-n:][::-1]
        return [(self.emb2movie[emb],) for emb in most_similar]

    def recommend(self, user_id, _, n=20):
        i = self.user2emb[user_id]
        ratings = self.W[i] @ self.H
        recommended_embs = ratings.argsort()
        return [(self.emb2movie[emb],) for emb in recommended_embs
                if self.emb2movie[emb] not in
                implicit_ratings[implicit_ratings.user_id == user_id].movie_id.to_numpy()][-n:][::-1]

In [32]:
%%time

bpr = BPR(max_iter=820000, n_components=300, lr=1e-3, decay=5e-3, eval_each=50000)

bpr.fit(implicit_ratings)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=820000.0), HTML(value='')))

Iter: 050000, RMSE: 0.728
Iter: 100000, RMSE: 0.707
Iter: 150000, RMSE: 0.686
Iter: 200000, RMSE: 0.666
Iter: 250000, RMSE: 0.646
Iter: 300000, RMSE: 0.627
Iter: 350000, RMSE: 0.608
Iter: 400000, RMSE: 0.590
Iter: 450000, RMSE: 0.572
Iter: 500000, RMSE: 0.556
Iter: 550000, RMSE: 0.541
Iter: 600000, RMSE: 0.528
Iter: 650000, RMSE: 0.517
Iter: 700000, RMSE: 0.508
Iter: 750000, RMSE: 0.502
Iter: 800000, RMSE: 0.498

CPU times: user 16min 29s, sys: 35.9 s, total: 17min 5s
Wall time: 15min 53s


In [33]:
movie_info[movie_info.movie_id == 1]

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy


In [34]:
get_similars(1, bpr)

['0    Toy Story (1995)',
 '1959    Saving Private Ryan (1998)',
 '589    Silence of the Lambs, The (1991)',
 '2502    Matrix, The (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '2789    American Beauty (1999)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '847    Godfather, The (1972)',
 '604    Fargo (1996)',
 '1180    Raiders of the Lost Ark (1981)']

In [35]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [36]:
get_recommendations(4, bpr)

['2789    American Beauty (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '589    Silence of the Lambs, The (1991)',
 '2502    Matrix, The (1999)',
 '2693    Sixth Sense, The (1999)',
 '1192    Star Wars: Episode VI - Return of the Jedi (1983)',
 '604    Fargo (1996)',
 '585    Terminator 2: Judgment Day (1991)',
 "523    Schindler's List (1993)",
 '315    Shawshank Redemption, The (1994)',
 '847    Godfather, The (1972)',
 '108    Braveheart (1995)',
 '1179    Princess Bride, The (1987)',
 '1250    Back to the Future (1985)',
 '1575    L.A. Confidential (1997)',
 '2327    Shakespeare in Love (1998)',
 '293    Pulp Fiction (1994)',
 '2928    Being John Malkovich (1999)',
 '1245    Groundhog Day (1993)',
 '0    Toy Story (1995)']

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [37]:
class WARP:

    def __init__(self, max_iter, n_components, lr, decay, eval_each):
        self.max_iter = max_iter
        self.n_components = n_components
        self.lr = lr
        self.decay = decay
        self.eval_each = eval_each

    def fit(self, ratings: pd.DataFrame):
        n_ratings = len(ratings)

        emb2user = np.unique(ratings.user_id)
        emb2movie = np.unique(ratings.movie_id)
        user2emb = {user: emb for emb, user in enumerate(emb2user)}
        movie2emb = {movie: emb for emb, movie in enumerate(emb2movie)}

        user2seen = {user2emb[user_id]: set(ratings[ratings.user_id == user_id].movie_id) for user_id in emb2user}
        user2unseen = [[movie2emb[movie_id] for movie_id in movie2emb if movie_id not in user2seen[emb]] for emb in range(len(emb2user))]

        perm = np.random.permutation(n_ratings)
        ratings_np = ratings.rating.to_numpy()[perm]
        user_ids_np = ratings.user_id.map(user2emb.get).to_numpy()[perm]
        movie_ids_np = ratings.movie_id.map(movie2emb.get).to_numpy()[perm]

        W = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (len(emb2user), self.n_components))
        H = np.random.uniform(0.0, 1.0 / np.sqrt(self.n_components), (self.n_components, len(emb2movie)))

        for k in tqdm(range(1, self.max_iter + 1)):
            if k % n_ratings == 0:
                perm = np.random.permutation(len(ratings))
                user_ids_np = user_ids_np[perm]
                movie_ids_np = movie_ids_np[perm]
                ratings_np = ratings_np[perm]

            u = user_ids_np[k % n_ratings]
            i = movie_ids_np[k % n_ratings]

            score_i = W[u] @ H[:, i]
            for q, j in enumerate(np.random.permutation(user2unseen[u]), 1):
                score_j = W[u] @ H[:, j]
                if score_i < score_j + 1:
                    modulator = np.log(len(user2seen[u]) / q)
                    W[u] -= self.lr * (modulator * (H[:, j] - H[:, i]) + self.decay * W[u])
                    H[:, i] -= self.lr * (-modulator * W[u] + self.decay * H[:, i])
                    H[:, j] -= self.lr * (modulator * W[u] + self.decay * H[:, j])
                    break
                    
            if k % self.eval_each == 0:
                rmse = np.linalg.norm((W @ H)[user_ids_np, movie_ids_np] - ratings_np) / np.sqrt(n_ratings)
                print(f'Iter: {k:06}, RMSE: {rmse:.3f}')

        self.W, self.H = W, H
        self.movie2emb, self.emb2movie, self.user2emb = movie2emb, emb2movie, user2emb

    def similar_items(self, movie_id, n=10):
        j = self.movie2emb[movie_id]
        distances = cosine_similarity(self.H[:, j].reshape(1, -1), self.H.T)
        most_similar = distances[0].argsort()[-n:][::-1]
        return [(self.emb2movie[emb],) for emb in most_similar]

    def recommend(self, user_id, _, n=20):
        i = self.user2emb[user_id]
        ratings = self.W[i] @ self.H
        recommended_embs = ratings.argsort()
        return [(self.emb2movie[emb],) for emb in recommended_embs
                if self.emb2movie[emb] not in
                implicit_ratings[implicit_ratings.user_id == user_id].movie_id.to_numpy()][-n:][ ::-1]

In [38]:
%%time

warp = WARP(max_iter=250000, n_components=300, lr=1e-3, decay=5e-3, eval_each=50000)

warp.fit(implicit_ratings)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=250000.0), HTML(value='')))

Iter: 050000, RMSE: 0.694
Iter: 100000, RMSE: 0.637
Iter: 150000, RMSE: 0.580
Iter: 200000, RMSE: 0.533
Iter: 250000, RMSE: 0.512

CPU times: user 2min 58s, sys: 10.3 s, total: 3min 8s
Wall time: 2min 41s


In [39]:
movie_info[movie_info.movie_id == 1]

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy


In [40]:
get_similars(1, warp)

['0    Toy Story (1995)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '2789    American Beauty (1999)',
 "523    Schindler's List (1993)",
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '2693    Sixth Sense, The (1999)',
 '585    Terminator 2: Judgment Day (1991)',
 '1192    Star Wars: Episode VI - Return of the Jedi (1983)',
 '1180    Raiders of the Lost Ark (1981)',
 '604    Fargo (1996)']

In [41]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

In [42]:
get_recommendations(4, warp)

['2789    American Beauty (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '589    Silence of the Lambs, The (1991)',
 '2502    Matrix, The (1999)',
 '604    Fargo (1996)',
 '2693    Sixth Sense, The (1999)',
 '585    Terminator 2: Judgment Day (1991)',
 '315    Shawshank Redemption, The (1994)',
 "523    Schindler's List (1993)",
 '1192    Star Wars: Episode VI - Return of the Jedi (1983)',
 '847    Godfather, The (1972)',
 '108    Braveheart (1995)',
 '1179    Princess Bride, The (1987)',
 '293    Pulp Fiction (1994)',
 '1250    Back to the Future (1985)',
 '1575    L.A. Confidential (1997)',
 '2327    Shakespeare in Love (1998)',
 '453    Fugitive, The (1993)',
 '352    Forrest Gump (1994)',
 '0    Toy Story (1995)']