### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP на implicit данных

Мягкий дедлайн 13 Октября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 20 Октября (Итоговая проверка)

In [34]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [3]:
ratings = pd.read_csv('ml-1m/ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5


In [5]:
movie_info = pd.read_csv('ml-1m/movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python', encoding='latin-1')

Explicit данные

In [5]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [7]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [8]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [9]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [10]:
user_item_csr.shape

(6041, 3953)

In [11]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [12]:
model.fit(user_item_t_csr)

100%|██████████| 100/100 [00:28<00:00,  3.54it/s, loss=0.0135]


Построим похожие фильмы по 1 movie_id = Истории игрушек

In [13]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [14]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [15]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1838    Mulan (1998)',
 '1526    Hercules (1997)',
 '2618    Tarzan (1999)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [16]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [17]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [18]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [19]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '2502    Matrix, The (1999)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1182    Aliens (1986)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '3402    Close Encounters of the Third Kind (1977)',
 '847    Godfather, The (1972)',
 '2460    Planet of the Apes (1968)',
 '2880    Dr. No (1962)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [5]:
from methods import SVD

svd = SVD(latent_size=32, epochs=15)
svd.fit(ratings)

100%|██████████| 1000209/1000209 [01:18<00:00, 12817.05it/s]


epoch 1: RMSE = 1.2158


100%|██████████| 1000209/1000209 [01:18<00:00, 12814.65it/s]


epoch 2: RMSE = 0.9288


100%|██████████| 1000209/1000209 [01:17<00:00, 12845.14it/s]


epoch 3: RMSE = 0.9111


100%|██████████| 1000209/1000209 [01:17<00:00, 12894.37it/s]


epoch 4: RMSE = 0.8931


100%|██████████| 1000209/1000209 [01:17<00:00, 12875.43it/s]


epoch 5: RMSE = 0.8687


100%|██████████| 1000209/1000209 [01:17<00:00, 12851.63it/s]


epoch 6: RMSE = 0.8429


100%|██████████| 1000209/1000209 [01:17<00:00, 12873.21it/s]


epoch 7: RMSE = 0.8192


100%|██████████| 1000209/1000209 [01:17<00:00, 12836.78it/s]


epoch 8: RMSE = 0.7992


100%|██████████| 1000209/1000209 [01:17<00:00, 12916.65it/s]


epoch 9: RMSE = 0.7830


100%|██████████| 1000209/1000209 [01:17<00:00, 12871.37it/s]


epoch 10: RMSE = 0.7697


100%|██████████| 1000209/1000209 [01:18<00:00, 12767.78it/s]


epoch 11: RMSE = 0.7589


100%|██████████| 1000209/1000209 [01:18<00:00, 12788.25it/s]


epoch 12: RMSE = 0.7499


100%|██████████| 1000209/1000209 [01:18<00:00, 12788.93it/s]


epoch 13: RMSE = 0.7422


100%|██████████| 1000209/1000209 [01:17<00:00, 12849.10it/s]


epoch 14: RMSE = 0.7355


100%|██████████| 1000209/1000209 [01:17<00:00, 12846.28it/s]

epoch 15: RMSE = 0.7302





In [18]:
from methods import Test

test = Test(movie_info, svd.U, svd.V, svd.users, svd.items, svd.U_bias, svd.V_bias)
movie, similars = test.get_similars(item_id=1)

print(f'Similars to movie: {movie}')
for _, name, cat in similars.values:
    print(f'- {name:<60} | {cat}')

Similars to movie: Toy Story (1995)
- Little Princess, A (1995)                                    | Children's|Drama
- Aladdin (1992)                                               | Animation|Children's|Comedy|Musical
- Beauty and the Beast (1991)                                  | Animation|Children's|Musical
- Small Faces (1995)                                           | Drama
- Best Men (1997)                                              | Action|Comedy|Crime|Drama
- Hercules (1997)                                              | Adventure|Animation|Children's|Comedy|Musical
- Mulan (1998)                                                 | Animation|Children's
- Bug's Life, A (1998)                                         | Animation|Children's|Comedy
- Tarzan (1999)                                                | Animation|Children's
- Toy Story 2 (1999)                                           | Animation|Children's|Comedy


In [19]:
user_id = 4

recs = test.get_recommendations(user_id=user_id, k=15)
print(f'Recomendations for user: {user_id}')

for _, name, cat in recs.values:
    print(f'- {name:<60} | {cat}')

Recomendations for user: 4
- Casablanca (1942)                                            | Drama|Romance|War
- Maltese Falcon, The (1941)                                   | Film-Noir|Mystery
- It's a Wonderful Life (1946)                                 | Drama
- 12 Angry Men (1957)                                          | Drama
- To Kill a Mockingbird (1962)                                 | Drama
- Sting, The (1973)                                            | Comedy|Crime
- Young Frankenstein (1974)                                    | Comedy|Horror
- Gandhi (1982)                                                | Drama
- M*A*S*H (1970)                                               | Comedy|War
- Full Monty, The (1997)                                       | Comedy
- As Good As It Gets (1997)                                    | Comedy|Drama
- Shakespeare in Love (1998)                                   | Comedy|Romance
- Talented Mr. Ripley, The (1999)                           

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [8]:
from methods import ALS

als = ALS(latent_size=64, epochs=30, lambd=1e-4)
als.fit(ratings)

epoch 1: RMSE = 0.7297
epoch 2: RMSE = 0.6552
epoch 3: RMSE = 0.6451
epoch 4: RMSE = 0.6418
epoch 5: RMSE = 0.6404
epoch 6: RMSE = 0.6397
epoch 7: RMSE = 0.6392
epoch 8: RMSE = 0.6389
epoch 9: RMSE = 0.6388
epoch 10: RMSE = 0.6386
epoch 11: RMSE = 0.6385
epoch 12: RMSE = 0.6384
epoch 13: RMSE = 0.6384
epoch 14: RMSE = 0.6383
epoch 15: RMSE = 0.6383
epoch 16: RMSE = 0.6383
epoch 17: RMSE = 0.6382
epoch 18: RMSE = 0.6382
epoch 19: RMSE = 0.6382
epoch 20: RMSE = 0.6382
epoch 21: RMSE = 0.6382
epoch 22: RMSE = 0.6382
epoch 23: RMSE = 0.6382
epoch 24: RMSE = 0.6381
epoch 25: RMSE = 0.6381
epoch 26: RMSE = 0.6381
epoch 27: RMSE = 0.6381
epoch 28: RMSE = 0.6381
epoch 29: RMSE = 0.6381
epoch 30: RMSE = 0.6381


In [16]:
from methods import Test

test = Test(movie_info, als.U, als.V, als.users, als.items)
movie, similars = test.get_similars(item_id=1)

print(f'Similars to movie: {movie}')
for _, name, cat in similars.values:
    print(f'- {name:<60} | {cat}')

Similars to movie: Toy Story (1995)
- Babe (1995)                                                  | Children's|Comedy|Drama
- Lion King, The (1994)                                        | Animation|Children's|Musical
- Aladdin (1992)                                               | Animation|Children's|Comedy|Musical
- Beauty and the Beast (1991)                                  | Animation|Children's|Musical
- Hercules (1997)                                              | Adventure|Animation|Children's|Comedy|Musical
- Mulan (1998)                                                 | Animation|Children's
- Bug's Life, A (1998)                                         | Animation|Children's|Comedy
- Babe: Pig in the City (1998)                                 | Children's|Comedy
- Iron Giant, The (1999)                                       | Animation|Children's
- Toy Story 2 (1999)                                           | Animation|Children's|Comedy


In [17]:
user_id = 4

recs = test.get_recommendations(user_id=user_id, k=15)
print(f'Recomendations for user: {user_id}')

for _, name, cat in recs.values:
    print(f'- {name:<60} | {cat}')

Recomendations for user: 4
- Star Wars: Episode IV - A New Hope (1977)                    | Action|Adventure|Fantasy|Sci-Fi
- Terminator 2: Judgment Day (1991)                            | Action|Sci-Fi|Thriller
- Die Hard (1988)                                              | Action|Thriller
- E.T. the Extra-Terrestrial (1982)                            | Children's|Drama|Fantasy|Sci-Fi
- Raiders of the Lost Ark (1981)                               | Action|Adventure
- Aliens (1986)                                                | Action|Sci-Fi|Thriller|War
- Good, The Bad and The Ugly, The (1966)                       | Action|Western
- Alien (1979)                                                 | Action|Horror|Sci-Fi|Thriller
- Terminator, The (1984)                                       | Action|Sci-Fi|Thriller
- Indiana Jones and the Last Crusade (1989)                    | Action|Adventure
- Butch Cassidy and the Sundance Kid (1969)                    | Action|Comedy|Western
- Ja

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [11]:
from methods import BPR

bpr = BPR(learning_rate=1e-1, lambd=1e-7, epochs=5)
bpr.fit(ratings)

 20%|██        | 1/5 [00:26<01:47, 26.78s/it]

epoch 1: AUC = 0.8838


 40%|████      | 2/5 [00:53<01:20, 26.74s/it]

epoch 2: AUC = 0.9093


 60%|██████    | 3/5 [01:20<00:53, 26.84s/it]

epoch 3: AUC = 0.9319


 80%|████████  | 4/5 [01:47<00:26, 26.83s/it]

epoch 4: AUC = 0.9425


100%|██████████| 5/5 [02:13<00:00, 26.78s/it]

epoch 5: AUC = 0.9455





In [14]:
from methods import Test

test = Test(movie_info, bpr.U, bpr.V, bpr.users, bpr.items)
movie, similars = test.get_similars(item_id=1)

print(f'Similars to movie: {movie}')
for _, name, cat in similars.values:
    print(f'- {name:<60} | {cat}')

Similars to movie: Toy Story (1995)
- Lion King, The (1994)                                        | Animation|Children's|Musical
- Brady Bunch Movie, The (1995)                                | Comedy
- Home Alone (1990)                                            | Children's|Comedy
- Aladdin (1992)                                               | Animation|Children's|Comedy|Musical
- Matilda (1996)                                               | Children's|Comedy
- Hercules (1997)                                              | Adventure|Animation|Children's|Comedy|Musical
- 101 Dalmatians (1961)                                        | Animation|Children's
- Babe: Pig in the City (1998)                                 | Children's|Comedy
- Toy Story 2 (1999)                                           | Animation|Children's|Comedy
- Sister Act (1992)                                            | Comedy|Crime


In [15]:
user_id = 4

recs = test.get_recommendations(user_id=user_id, k=15)
print(f'Recomendations for user: {user_id}')

for _, name, cat in recs.values:
    print(f'- {name:<60} | {cat}')

Recomendations for user: 4
- Star Wars: Episode IV - A New Hope (1977)                    | Action|Adventure|Fantasy|Sci-Fi
- Schindler's List (1993)                                      | Drama|War
- Terminator 2: Judgment Day (1991)                            | Action|Sci-Fi|Thriller
- Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | Sci-Fi|War
- Godfather, The (1972)                                        | Action|Crime|Drama
- Monty Python and the Holy Grail (1974)                       | Comedy
- Star Wars: Episode V - The Empire Strikes Back (1980)        | Action|Adventure|Drama|Sci-Fi|War
- Raiders of the Lost Ark (1981)                               | Action|Adventure
- Star Wars: Episode VI - Return of the Jedi (1983)            | Action|Adventure|Romance|Sci-Fi|War
- Butch Cassidy and the Sundance Kid (1969)                    | Action|Comedy|Western
- Jaws (1975)                                                  | Action|Horror
- Saving Private R

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [31]:
from methods import WARP

warp = WARP(learning_rate=1e-2, lambd=1e-7, max_neg=10, epochs=10)
warp.fit(ratings)

100%|██████████| 575281/575281 [02:10<00:00, 4411.73it/s]


epoch 1: AUC = 0.8915


100%|██████████| 575281/575281 [02:07<00:00, 4512.56it/s]


epoch 2: AUC = 0.9244


100%|██████████| 575281/575281 [01:47<00:00, 5354.76it/s]


epoch 3: AUC = 0.9412


100%|██████████| 575281/575281 [01:33<00:00, 6161.84it/s]


epoch 4: AUC = 0.9498


100%|██████████| 575281/575281 [01:23<00:00, 6866.91it/s]


epoch 5: AUC = 0.9554


100%|██████████| 575281/575281 [01:16<00:00, 7475.54it/s]


epoch 6: AUC = 0.9587


100%|██████████| 575281/575281 [01:11<00:00, 8018.72it/s]


epoch 7: AUC = 0.9611


100%|██████████| 575281/575281 [01:08<00:00, 8446.49it/s]


epoch 8: AUC = 0.9628


100%|██████████| 575281/575281 [01:04<00:00, 8852.08it/s]


epoch 9: AUC = 0.9640


100%|██████████| 575281/575281 [01:02<00:00, 9131.60it/s]


epoch 10: AUC = 0.9651


In [32]:
from methods import Test

test = Test(movie_info, warp.U, warp.V, warp.users, warp.items)
movie, similars = test.get_similars(item_id=1)

print(f'Similars to movie: {movie}')
for _, name, cat in similars.values:
    print(f'- {name:<60} | {cat}')

Similars to movie: Toy Story (1995)
- Get Shorty (1995)                                            | Action|Comedy|Drama
- Babe (1995)                                                  | Children's|Comedy|Drama
- Clueless (1995)                                              | Comedy|Romance
- Star Wars: Episode IV - A New Hope (1977)                    | Action|Adventure|Fantasy|Sci-Fi
- Aladdin (1992)                                               | Animation|Children's|Comedy|Musical
- Silence of the Lambs, The (1991)                             | Drama|Thriller
- Groundhog Day (1993)                                         | Comedy|Romance
- Bug's Life, A (1998)                                         | Animation|Children's|Comedy
- Toy Story 2 (1999)                                           | Animation|Children's|Comedy
- Chicken Run (2000)                                           | Animation|Children's|Comedy


In [33]:
user_id = 4

recs = test.get_recommendations(user_id=user_id, k=15)
print(f'Recomendations for user: {user_id}')

for _, name, cat in recs.values:
    print(f'- {name:<60} | {cat}')

Recomendations for user: 4
- Star Wars: Episode IV - A New Hope (1977)                    | Action|Adventure|Fantasy|Sci-Fi
- Silence of the Lambs, The (1991)                             | Drama|Thriller
- Godfather, The (1972)                                        | Action|Crime|Drama
- Die Hard (1988)                                              | Action|Thriller
- Star Wars: Episode V - The Empire Strikes Back (1980)        | Action|Adventure|Drama|Sci-Fi|War
- Raiders of the Lost Ark (1981)                               | Action|Adventure
- Star Wars: Episode VI - Return of the Jedi (1983)            | Action|Adventure|Romance|Sci-Fi|War
- Alien (1979)                                                 | Action|Horror|Sci-Fi|Thriller
- Godfather: Part II, The (1974)                               | Action|Crime|Drama
- Terminator, The (1984)                                       | Action|Sci-Fi|Thriller
- Rocky (1976)                                                 | Action|Drama
- Sa