# Этап L4

**Задача:** разработка бейзлайна и реализация выбранного решения.

**Итог работы:** готов бейзлайн и первая реализация выбранного решения.

In [None]:
!pip install lightfm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.17-cp310-cp310-linux_x86_64.whl size=879174 sha256=5329fea634d087b7ccbe2405ad4cc791d9eed6eba23291d0b3ad377db2498524
  Stored in directory: /root/.cache/pip/wheels/4f/9b/7e/0b256f2168511d8fa4dae4fae0200fdbd729eb424a912ad636
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


In [1]:
pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import pandas as pd
import numpy as np
from numpy import load
from scipy.sparse import csr_matrix, coo_matrix
# from lightfm import LightFM
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tqdm import tqdm
# import pickle
import itertools
import random
import lightgbm

In [3]:
path = "/content/drive/MyDrive/WB School/data.csv.gzip"
df = pd.read_csv(path, compression='gzip')
df["order_ts"] = pd.to_datetime(df["order_ts"])

# 0. Предобработка данных

Отделим пользователей с малым количеством заказов. Им будут рекомендоваться популярные товары.

In [4]:
def extract_reluctant_users(df, threshold=5, both=False):

  len_df = len(df)
  df = df.drop_duplicates()
  df_count = df.groupby(["user_id", "item_id"], as_index=False).count().rename(columns={"order_ts": "counter"})

  df_count_users = df_count.groupby("user_id", as_index=False)["counter"].sum()
  users = df_count_users.loc[df_count_users.counter <= threshold, "user_id"].values

  df_reluctants = df[df.user_id.isin(users)]
  df = df[~df.user_id.isin(users)]

  if both:
    return df_reluctants, df
  else:
    return df

In [5]:
df_new = extract_reluctant_users(df, threshold=20)

Исключим редко заказываемые товары:

In [6]:
def drop_rare_items(df, threshold=2):

  df_temp = df.drop_duplicates()
  df_count = df_temp.groupby(["user_id", "item_id"], as_index=False).count().rename(columns={"order_ts": "counter"})
  df_count_items = df_count.groupby("item_id", as_index=False)["counter"].sum()

  items = df_count_items.loc[df_count_items.counter <= threshold, "item_id"].values
  df = df[~df.item_id.isin(items)]

  return df

In [7]:
df_new = drop_rare_items(df_new, threshold=10)

Для разделения на train/test и для кросс-валидации используется схема, предложенная в [работе](https://arxiv.org/abs/1805.09557):



In [8]:
def train_test(df, by, test_weeks=1, test_size=0.2):

  if by == "time":

    n_folds = 13 / test_weeks

    delta = (df["order_ts"].max() - df["order_ts"].min()) / n_folds
    edge = df["order_ts"].max() - delta

    train = df.loc[df["order_ts"] <= edge]
    test = df.loc[df["order_ts"] > edge]

    return train, test

  elif by == "percents":

    train_size = 1 - test_size
    idx = int(len(df) * train_size)

    train = df[:idx]
    test = df[idx:]

    return train, test

In [9]:
train_global, test_global = train_test(df_new, by="time", test_weeks=1)

Оставим юзеров, которые делали заказы в течение периода и train_global, и test_global:

In [10]:
def common_only(df1, df2, column="users"):

  users = list(set(df1[column]).intersection(set(df2[column])))

  df1_new = df1[df1[column].isin(users)]
  df2_new = df2[df2[column].isin(users)]

  return df1_new, df2_new

In [11]:
train_global, test_global = common_only(train_global, test_global, column="user_id")
train_global = extract_reluctant_users(train_global)

Делим train_global на локальные train и test выборки:

In [12]:
train_local, test_local = train_test(train_global, by="percents", test_size=0.2)
train_local = extract_reluctant_users(train_local)

train_local = drop_rare_items(train_local, threshold=20)

train_local, test_local = common_only(train_local, test_local, column="user_id")
train_local, test_local = common_only(train_local, test_local, column="item_id")

train_local, test_local = common_only(train_local, test_local, column="user_id")
train_local, test_local = common_only(train_local, test_local, column="item_id")

Создадим разрезженную матрицу взаимодействий.

In [13]:
def csr_matrix_via_encoder(train, test): # Датафреймы должны быть сгруппированными!

  user_encoder, item_encoder = LabelEncoder(), LabelEncoder()

  users_final = set(train.user_id.unique()).intersection(set(test.user_id.unique()))
  user_encoder.fit(list(users_final))

  all_items = set(train.item_id.unique()).union(set(test.item_id.unique()))
  item_encoder.fit(list(all_items))

  train["user_new_id"] = user_encoder.transform(train["user_id"])
  test["user_new_id"] = user_encoder.transform(test["user_id"])

  train["item_new_id"] = item_encoder.transform(train["item_id"])
  test["item_new_id"] = item_encoder.transform(test["item_id"])

  matrix_shape = len(user_encoder.classes_), len(item_encoder.classes_)

  train_sparse = coo_matrix((list(train.counter.astype(np.float32)),
                            (list(train.user_new_id.astype(np.int64)),
                              list(train.item_new_id.astype(np.int64)))), shape=matrix_shape)

  train_csr = train_sparse.tocsr()

  test_sparse = coo_matrix((list(test.counter.astype(np.float32)),
                           (list(test.user_new_id.astype(np.int64)),
                            list(test.item_new_id.astype(np.int64)))), shape=matrix_shape)

  test_csr = test_sparse.tocsr()

  return train_csr, test_csr, users_final, all_items, train, test

In [14]:
train_local_grouped = train_local.groupby(["user_id", "item_id"], as_index=False).count().rename(columns={"order_ts": "counter"})
test_local_grouped = test_local.groupby(["user_id", "item_id"], as_index=False).count().rename(columns={"order_ts": "counter"})

train_local_grouped = train_local_grouped.sort_values("user_id")
test_local_grouped = test_local_grouped.sort_values("user_id")

train_local_csr, test_local_csr, users_final, all_items, train, test = csr_matrix_via_encoder(train_local_grouped, test_local_grouped)

# I. Отбор кандидатов

Обучим модели, которые будут отбирать кандидатов: [WARP loss MF](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37180.pdf), [BPR Optimized MF](https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf), [LMF](https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf) и [WARP k-OS loss MF](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41534.pdf). Их оптимальные параметры по метрике Recall@K были подобраны заранее кросс-валидацией на train_local.



In [None]:
model_warp = LightFM(no_components=16,
                     learning_schedule="adagrad",
                     loss="warp",
                     learning_rate=0.05,
                     item_alpha=0.00005,
                     user_alpha=0.00005,
                     max_sampled=30)

model_warp.fit(train_local_csr, epochs=20)

<lightfm.lightfm.LightFM at 0x7f89509b6860>

In [None]:
pickle.dump(model_warp, open("model_warp_new.pkl", "wb"))

In [None]:
model_bpr = LightFM(no_components=14,
                    learning_schedule="adagrad",
                    loss="bpr",
                    learning_rate=0.03,
                    item_alpha=0.00001,
                    user_alpha=0.0001)

model_bpr.fit(train_local_csr, epochs = 20)

<lightfm.lightfm.LightFM at 0x7fe624042440>

In [None]:
pickle.dump(model_bpr, open("model_bpr.pkl", "wb"))

In [None]:
model_lmf = LightFM(no_components=13,
                    learning_schedule="adagrad",
                    loss="logistic",
                    learning_rate=0.019,
                    item_alpha=0.0001,
                    user_alpha=0.00001)

model_lmf.fit(train_local_csr, epochs = 20)

<lightfm.lightfm.LightFM at 0x7fe624041360>

In [None]:
pickle.dump(model_lmf, open("model_lmf.pkl", "wb"))

In [None]:
model_warp_kos = LightFM(no_components=13,
                         k=3,
                         n=11,
                         learning_schedule="adagrad",
                         loss="warp-kos",
                         learning_rate=0.027,
                         item_alpha=0.00001,
                         user_alpha=0.00014,
                         max_sampled=42)

model_warp_kos.fit(train_local_csr, epochs=20)

<lightfm.lightfm.LightFM at 0x7fe6240439a0>

In [None]:
pickle.dump(model_warp_kos, open("model_warp_kos.pkl", "wb"))

Из обученных моделей достаём для топ N айтемов с наибольшим скором для каждого юзера item_id, ранг и скор. Вычисление скора является крайне трудоёмким процессом, поэтому скоры были предпосчитаны заранее. Для этого использовалась функция:

На основании скоров функции выше сможем проранжировать кандидатов каждой модели

In [15]:
def scores_calculation(user_embeddings, item_embeddings, user_biases, item_biases, items_number=50, top=50):

  first_N_scores = user_embeddings.dot(item_embeddings[:items_number].T) + user_biases.reshape(-1,1) + item_biases[:items_number].reshape(1,-1)

  pairs = list()

  # Пронумеруем первые N айтемов, чтобы не потеряться в нумерации, ведь она не совпадает с исходными item_id
  for i in range(len(first_N_scores)):
    user_scores = list()
    for elem in enumerate(first_N_scores[i]):
      user_scores.append(elem)
    pairs.append(user_scores)

  # Отберём N (=items_number) айтемов с наибольшим скором, которые и будут кандидатами от модели
  for u in tqdm(range(len(user_embeddings))):
    for i in range(top, len(item_embeddings)):
      score = list(user_embeddings[u:(u + 1)].dot(item_embeddings[i:(i+1)].T) + user_biases[:1].reshape(-1,1) + item_biases[i:(i+1)].reshape(1,-1))[0][0]
      pair = (i, score)
      pairs[u].append(pair)
      pairs[u] = sorted(pairs[u], key=lambda x: x[-1], reverse=True)
      pairs[u].remove(pairs[u][-1])

  return pairs

In [16]:
def candidates_extraction(model_type, users_test, items_test, top=50, precomputed_scores=True):

  if precomputed_scores == True:
    path = "/content/drive/MyDrive/WB School/pairs" + "_" + model_type + ".npy"
    pairs = load(path)
    pairs = pairs[:]
  elif precomputed_scores == False:
    path_item_emb = "/content/drive/MyDrive/WB School/item_emb_" + model_type + ".npy"
    path_user_emb = "/content/drive/MyDrive/WB School/user_emb_" + model_type + ".npy"
    path_user_bias = "/content/drive/MyDrive/WB School/user_biases_" + model_type + ".npy"
    path_item_bias = "/content/drive/MyDrive/WB School/item_biases_" + model_type + ".npy"

    item_emb = load(path_item_emb)
    user_emb = load(path_user_emb)
    user_biases = load(path_user_bias)
    item_biases = load(path_item_bias)

    pairs = scores_calculation(user_emb, item_emb, user_biases, item_biases, items_number=50)

  model_dict = dict()
  for user, user_data in enumerate(pairs):
        for rank, (item, score) in enumerate(user_data):
            key = tuple([user, item])
            value = tuple([score, (rank + 1)])
            model_dict[key] = value

  model_pairs = list()
  for key in model_dict.keys():
      model_pairs.append(key)

  return model_pairs, model_dict

In [17]:
users_test = sorted(list(set(coo_matrix(train_local_csr).row)))
items_test = sorted(list(set(coo_matrix(train_local_csr).col)))

Некоторые переменные больше не нужны

In [18]:
del train_local
del df
del train_local_grouped
del test_local_grouped
del all_items
del users_final
del train_local_csr
# del model_warp
# del user_biases_warp
# del item_biases_warp
# del item_emb_warp
# del user_emb_warp
del df_new
del train_global

In [19]:
warp_pairs, warp_dict = candidates_extraction("warp_new", users_test, items_test, top=50, precomputed_scores=True) # 3.6 GB
bpr_pairs, bpr_dict = candidates_extraction("bpr_new", users_test, items_test, top=50, precomputed_scores=True) # 3.9 GB
# lmf_pairs, lmf_dict, lmf_user_biases_series, lmf_item_biases_series, lmf_user_emb = candidates_extraction(model_lmf, "lmf", users_test, items_test, top=50, precomputed_scores=True)
# warp_kos_pairs, warp_kos_dict = candidates_extraction("warp_kos", users_test, items_test, top=50, precomputed_scores=True)

Собираем из этого датасет.

In [20]:
total_pairs = list(set(warp_pairs).union(set(bpr_pairs)))
# total_pairs = list(set(total_pairs).union(set(lmf_pairs)))
# total_pairs = list(set(total_pairs).union(set(warp_kos_pairs)))

In [21]:
del warp_pairs
del bpr_pairs
# del lmf_pairs
# del warp_kos_pairs

In [22]:
data_all_pairs = [pair +
                  warp_dict.get(pair, (np.nan, np.nan)) +
                  bpr_dict.get(pair, (np.nan, np.nan))  for pair in tqdm(total_pairs)]

100%|██████████| 17666447/17666447 [00:51<00:00, 344797.51it/s]


In [23]:
del warp_dict
del bpr_dict
# del lmf_dict
# del warp_kos_dict

In [24]:
data_all_pairs_df = pd.DataFrame(data_all_pairs,
                                 columns=["user_id", "item_id", "warp_score", "warp_rank",
                                                                "bpr_score", "bpr_rank"])

In [25]:
del data_all_pairs

При подгрузке предпосчитанных скоров из-за формата .npy меняется тип данных, поэтому зададим формат в ручную:

In [26]:
def change_dtype(df):

    for column in df.columns:
        if column.endswith("id"):
            df[column] = df[column].astype(np.int32)
        else:
            df[column] = df[column].astype(np.float32)

    return df

In [27]:
data_all_pairs_df = change_dtype(data_all_pairs_df)

Заполним пропуски, чтобы бустинг мог работать

In [28]:
def fill_nans(df, top):

    for column in df.columns:
        if column.endswith("score"):
            df[column] = df[column].fillna(random.uniform(0, 1))
        elif column.endswith("rank"):
            df[column] = df[column].fillna(random.randint(top, (top + 100))) # Чтобы отдалить незаказанные айтемы

    return df

In [29]:
predictions = fill_nans(data_all_pairs_df, top=50)

Моделью II-го уровня будет градиентный бустинг. Он перешёл в задачу ранжирования из задачи (бинарной) классификации, поэтому необходимо собрать таргет из 0 и 1, где 1 будет означать, что юзер заказал айтем.

В словарь purchases сложим все покупки юзеров в тестовом периоде.

In [30]:
purchases = list()

for k in tqdm(range(test_local_csr.shape[0])):
    cx = coo_matrix(test_local_csr[k])
    purchased_items, user_id = [], []
    user_id.append(k)

    for i,j,v in zip(cx.row, cx.col, cx.data):
        purchased_items.append(j)
    for i in list(itertools.product(user_id, purchased_items)):
        purchases.append(i)

100%|██████████| 224642/224642 [00:59<00:00, 3791.70it/s]


In [31]:
def purchases2dict(purchases):

    data_true = {}
    for i in tqdm(purchases):
        curr, item = i[0], int(i[1])

        if curr not in data_true:
            data_true[curr] = list()
            data_true[curr].append(item)
        else:
            data_true[curr].append(item)

    for i in tqdm(data_true.keys()):
        data_true[i] = set(data_true[i])

    return data_true

In [32]:
data_true = purchases2dict(purchases)

100%|██████████| 1661221/1661221 [00:05<00:00, 292918.38it/s]
100%|██████████| 224642/224642 [00:03<00:00, 71179.21it/s] 


In [33]:
del purchases

Вернём исходные идентификаторы айтемам и юзерам, которые преобразовывали для обучения моделей

In [34]:
items_dict = dict(zip(train.item_new_id, train.item_id))
users_dict = dict(zip(train.user_new_id, train.user_id))

In [35]:
predictions["user_id"] = predictions["user_id"].map(users_dict)
predictions["item_id"] = predictions["item_id"].map(items_dict)

In [36]:
del items_dict
del users_dict

In [37]:
del train

Сделаем привычный для бустинга датасет

In [38]:
test["target"] = 1

dataset = pd.merge(predictions,
                   test[["user_id", "item_id", "target"]].drop_duplicates(),
                   how="left",
                   left_on=["user_id", "item_id"],
                   right_on=["user_id", "item_id"])

dataset["target"].fillna(0, inplace=True)

In [39]:
del predictions

In [40]:
dataset = dataset.dropna()

In [None]:
dataset.target.value_counts(normalize=True)

0.0    0.961707
1.0    0.038293
Name: target, dtype: float64

In [None]:
dataset.head()

Unnamed: 0,user_id,item_id,warp_score,warp_rank,bpr_score,bpr_rank,target
0,608003,407,0.785074,143.0,0.023496,36.0,0.0
1,31409,82,1.58543,24.0,0.041741,19.0,0.0
2,620822,1069,1.595243,42.0,0.542877,145.0,1.0
3,520135,347,1.424931,34.0,0.542877,145.0,0.0
4,852904,180,1.853096,11.0,0.06926,10.0,1.0


Обычным для бустинга отделим таргет и данные

In [41]:
Y = dataset.pop("target")
X = dataset

In [42]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.7, random_state=42)

X_train = x_train[["warp_score", "warp_rank", "bpr_score", "bpr_rank"]]
X_test = x_test[["warp_score", "warp_rank", "bpr_score", "bpr_rank"]]
train_data, test_data = lightgbm.Dataset(X_train, y_train), lightgbm.Dataset(X_test, y_test)

Для кросс-валидации вместо такого деления на train/test надо разбить на K фолдов и на них подбирать оптимальные параметры бустинга.

In [43]:
del X
del Y
del x_train, y_train
del X_train, X_test
del dataset

# II. Ранжирование

Обучим модель градиентного бустинга:

In [44]:
params = {"objective": "binary",
          "boosting": "gbdt",
          "metric": "binary_logloss",
          "verbose": 1,
          "learning_rate": 0.001}

model = lightgbm.train(params,
                       train_data,
                       valid_sets=test_data,
                       num_boost_round=200,
                       verbose_eval=1)



[LightGBM] [Info] Number of positive: 473018, number of negative: 11893494
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 614
[LightGBM] [Info] Number of data points in the train set: 12366512, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.038250 -> initscore=-3.224613
[LightGBM] [Info] Start training from score -3.224613
[1]	valid_0's binary_logloss: 0.16278
[2]	valid_0's binary_logloss: 0.162751
[3]	valid_0's binary_logloss: 0.162721
[4]	valid_0's binary_logloss: 0.162692
[5]	valid_0's binary_logloss: 0.162663
[6]	valid_0's binary_logloss: 0.162634
[7]	valid_0's binary_logloss: 0.162605
[8]	valid_0's binary_logloss: 0.162576
[9]	valid_0's binary_logloss: 0.162547
[10]	valid_0's binary_logloss: 0.162518
[11]	valid_0's binary_logloss: 0.16249
[12]	valid_0's binary_logloss: 0.162461
[13]	valid_0's binary_logloss: 0.162433
[14]	valid_0's binary_logloss: 0.162405
[15]	valid_0's binary_logloss: 0.162376
[16]	valid_0's b

In [45]:
lgb_test = x_test.copy()
lgb_test[["user_id", "item_id"]].drop_duplicates(inplace=True)
lgb_test.set_index(["user_id", "item_id"], inplace=True)
lgb_test["lgb_score"] = model.predict(lgb_test, num_iteration=model.best_iteration)
lgb_test = lgb_test.set_index("lgb_score", append=True).sort_values("lgb_score", ascending=False)
lgb_test.drop_duplicates(inplace=True)

dataset_predicted = dict()
lgb_test.reset_index(inplace=True)
for user, group in tqdm(lgb_test.groupby("user_id")):
    dataset_predicted[user] = list(group.item_id)[:20]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lgb_test[["user_id", "item_id"]].drop_duplicates(inplace=True)
100%|██████████| 224642/224642 [00:20<00:00, 11124.07it/s]


In [48]:
from numpy import save

In [51]:
save("dataset_predicted_new.npy", dataset_predicted, allow_pickle=True)

In [53]:
with open('dataset_predicted.pkl', 'wb') as f:
    pickle.dump(dataset_predicted, f)

In [54]:
with open('/content/drive/MyDrive/WB School/dataset_predicted.pkl', 'rb') as f:
    dataset_predicted = pickle.load(f)

# Источники #

1. Гибрид, идея train/test сплит:
[A Hybrid Approach to Music Playlist Continuation Based on Playlist-Song Membership](https://arxiv.org/abs/1805.09557).

2. Hybrid model have lower Precision@K compare to pure CF: Issue on [Github](https://github.com/lyst/lightfm/issues/486)

3. Weston, Jason, Samy Bengio, and Nicolas Usunier. “Wsabie: Scaling up to large vocabulary image annotation.” IJCAI. Vol. 11. 2011.
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37180.pdf

4. Weston, J., Yee, H., & Weiss, R. J. (2013, October). Learning to rank recommendations with the k-order statistic loss. In Proceedings of the 7th ACM Conference on Recommender Systems (pp. 245-248). [dl.acm.org](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41534.pdf)


5. Rendle, Steffen, et al. “BPR: Bayesian personalized ranking from implicit feedback.” Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2009. [arxiv.org](https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf)

6. Johnson, C. C. (2014). Logistic matrix factorization for implicit feedback data. Advances in Neural Information Processing Systems, 27(78), 1-9. [stanford.edu](https://web.stanford.edu/~rezab/nips2014workshop/submits/logmat.pdf)

7. Ben Frederickson. Distance Metrics for Fun and Profit. [Блог об implicit](https://www.benfrederickson.com/distance-metrics/)