# Курсовой проект. Двухуровневая рекомендательная система по данным соревнования Retail Hero


### Основное

Дедлайн - 29 декабря 23:59
Целевая метрика precision@5
Бейзлайн решения - MainRecommender
Сдаем ссылку на github с решением. На github должен быть файл recommendations.csv (user_id | [rec_1, rec_2, ...] с рекомендациями. rec_i - реальные id item-ов (из retail_train.csv)
Hints:

Сначала просто попробуйте разные параметры MainRecommender:

N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)
Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)
Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)
Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)
Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

---

### Применение в бизнесе

- 2-ух уровневая система применяется во многих компаниях
- Зачастую уровней > 2
- Идем от более простых эвристик/моделей к более сложным
- Фичи из моделей первого уровня (embeddings, biases из ALS) можно использовать в последующих моделях

Также решения на основе 2-ух уровневых рекомендаций заняли все топ-10 мест в соревновании X5 Retail hero. 

### Как отбирать кандидатов?

Вариантов множество. Тут нам поможет *MainRecommender*. Пока в нем реализованы далеко не все возможные способы генерации кандидатов

- Генерируем топ-k кандидатов
- Качество кандидатов измеряем через **recall@k**
- recall@k показывает какую долю из купленных товаров мы смогли выявить (рекомендовать) нашей моделью

----

Pipline:
1. Рекомендуем 50 кандидатов среди товаров классическими методами
2. Оцениваем recall@k нашу кандидатную выдачу (выдача моделями 1-го уровня)
3. Получаем user-item датасет по кандидатным рекомендациям
4. Для такого датасета проставляем target купил/не купил товар по истории взаимодействий
5. На этом датасете строим lightGBM, предсказывающий купит или не купит пользователь данный товар
6. Рекомендовано ознакомиться и попробовать Light AutoML

# Практическая часть

Код для src, utils кастомизировал, recommender2 содержит метод, возвращающий user&item embeddings. Они пригодятся для модели второго уровня.

In [None]:
!pip install implicit

Collecting implicit
  Downloading implicit-0.4.8.tar.gz (1.1 MB)
[?25l[K     |▎                               | 10 kB 22.6 MB/s eta 0:00:01[K     |▋                               | 20 kB 27.8 MB/s eta 0:00:01[K     |▉                               | 30 kB 23.8 MB/s eta 0:00:01[K     |█▏                              | 40 kB 18.6 MB/s eta 0:00:01[K     |█▍                              | 51 kB 14.6 MB/s eta 0:00:01[K     |█▊                              | 61 kB 13.5 MB/s eta 0:00:01[K     |██                              | 71 kB 12.7 MB/s eta 0:00:01[K     |██▎                             | 81 kB 13.8 MB/s eta 0:00:01[K     |██▋                             | 92 kB 14.1 MB/s eta 0:00:01[K     |██▉                             | 102 kB 12.3 MB/s eta 0:00:01[K     |███▏                            | 112 kB 12.3 MB/s eta 0:00:01[K     |███▍                            | 122 kB 12.3 MB/s eta 0:00:01[K     |███▊                            | 133 kB 12.3 MB/s eta 0:00:01

In [None]:
from google.colab import drive
drive.mount('/content/drive')
root = root = '/content/drive/My Drive/Colab Notebooks/rec_sys/data/'

Mounted at /content/drive


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als
from implicit.nearest_neighbours import bm25_weight, tfidf_weight


# Модель второго уровня
from lightgbm import LGBMClassifier

import pickle

In [None]:
!pip install lightautoml

Collecting lightautoml
  Downloading LightAutoML-0.3.2-py3-none-any.whl (294 kB)
[K     |████████████████████████████████| 294 kB 12.0 MB/s 
[?25hCollecting catboost
  Downloading catboost-1.0.3-cp37-none-manylinux1_x86_64.whl (76.3 MB)
[K     |████████████████████████████████| 76.3 MB 1.2 MB/s 
[?25hCollecting json2html
  Downloading json2html-1.3.0.tar.gz (7.0 kB)
Collecting importlib-metadata<2.0,>=1.0
  Downloading importlib_metadata-1.7.0-py2.py3-none-any.whl (31 kB)
Collecting lightgbm<3.0,>=2.3
  Downloading lightgbm-2.3.1-py2.py3-none-manylinux1_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 38.5 MB/s 
Collecting torch<1.9
  Downloading torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1 MB)
[K     |████████████████████████████████| 804.1 MB 2.9 kB/s 
Collecting dataclasses==0.6
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting cmaes
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting poetry-core<2.0.0,>=1.0.0
  Downloading 

In [2]:
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.tasks.common_metric import mean_quantile_error

In [88]:
import os, sys
#module_path = os.path.abspath(os.path.join(os.pardir))
#if module_path not in sys.path:
#     sys.path.append(module_path)
#sys.path.append('/content/drive/My Drive/Colab Notebooks/rec_sys')

# Написанные нами функции
from src.metrics import precision_at_k, recall_at_k
from src.utils import prefilter_items
from src.recommenders2 import MainRecommender

### Загружаем данные и разделяем их для валидации out of time: -9 недель -6 недель

In [89]:
data = pd.read_csv('retail_train.csv')#(root+'retail_train.csv')
item_features = pd.read_csv('product.csv')#(root+'product.csv')
user_features = pd.read_csv('hh_demographic.csv')#(root+'hh_demographic.csv')

# column processing
#data.columns = [col.lower() for col in data.columns]
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)


# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)
val_lvl_1_size_weeks = 6
val_lvl_2_size_weeks = 3

data_train_lvl_1 = data[data['week_no'] < data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)]
data_val_lvl_1 = data[(data['week_no'] >= data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)) &
                      (data['week_no'] < data['week_no'].max() - (val_lvl_2_size_weeks))]

data_train_lvl_2 = data_val_lvl_1.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться
data_val_lvl_2 = data[data['week_no'] >= data['week_no'].max() - val_lvl_2_size_weeks]

data_train_lvl_1.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [90]:
n_items_before = data_train_lvl_1['item_id'].nunique()

data_train_lvl_1 = prefilter_items(data_train_lvl_1, item_features=item_features, take_n_popular=5000)

n_items_after = data_train_lvl_1['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

Decreased # items from 83685 to 5001


Модель первого уровня. MainRecommender

In [91]:
recommender = MainRecommender(data_train_lvl_1)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/5001 [00:00<?, ?it/s]

In [92]:
recommender

<src.recommenders2.MainRecommender at 0x1c707cde550>

### Варианты, как получить кандидатов

Можно потом все эти варианты соединить в один

(!) Если модель рекомендует < N товаров, то рекомендации дополняются топ-популярными товарами до N

### Измеряем recall@k


In [93]:
result_lvl_1 = data_val_lvl_1.groupby('user_id')['item_id'].unique().reset_index()
result_lvl_1.columns=['user_id', 'actual']
result_lvl_1.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870..."


In [94]:
result_lvl_1.shape

(2154, 2)

In [95]:
users_lvl_1 = pd.DataFrame(data_train_lvl_1['user_id'].unique(),columns = ['user_id'])
users_lvl_1.shape

(2497, 1)

In [96]:
%%time
K_num = 50
result_lvl_1['als_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_als_recommendations(x, N=K_num))
result_lvl_1['own_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_own_recommendations(x, N=K_num))
result_lvl_1['sim_items'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=K_num))
#result_lvl_1['sim_users'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_users_recommendation(x, N=K_num))

Wall time: 1min 15s


In [97]:
%%time
result_lvl_1['sim_users'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_users_recommendation(x, N=K_num))

KeyboardInterrupt: 

## Расчет recall для отбора модели первого уровня - модель для отбора кандидатов

In [100]:
def calculate_recall_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: recall_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [101]:
recall_results = pd.DataFrame(sorted(calculate_recall_k(result_lvl_1, 50), key=lambda x: x[1],reverse=True), columns = ['Candidate_model','Recall'])
recall_results

Unnamed: 0,Candidate_model,Recall
0,als_rec,0.02412
1,sim_items,0.022766
2,own_rec,0.012843


# Бейзлайн - модели первого уровня, расчет метрики precision@5

In [102]:
def calculate_precision_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: precision_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [103]:
precision_results = pd.DataFrame(sorted(calculate_precision_k(result_lvl_1, 5), key=lambda x: x[1],reverse=True), columns = ['Model','Precision'])
precision_results

Unnamed: 0,Model,Precision
0,als_rec,0.047075
1,sim_items,0.03584
2,own_rec,0.008078


### Лучшая метрика по Baseline - MainRecommender ALS (als рекомендации с дефолтным взвешиванием bm25, доподненные ТОП популярными товарами) составляет 0.026277. Наилучший recall выдает метод similar_items; его возьмем для отбора кандидатов второй модели. 

In [104]:
## Добавить ТОП популярных, стекнуть несколько кандидатских списков

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_lvl_2
- Обучаем *только* на выбранных кандидатах -  сгенерирую топ-50 кадидиатов через get_als_recommendations. Если юзер купил < 50 товаров, то get_als_recommendations дополнит рекоммендации топ-популярными

In [105]:
users_lvl_2 = pd.DataFrame(data_train_lvl_2['user_id'].unique())
users_lvl_2.columns = ['user_id']

# Пока только warm start
train_users = data_train_lvl_1['user_id'].unique()
users_lvl_2 = users_lvl_2[users_lvl_2['user_id'].isin(train_users)]

users_lvl_2['candidates'] = users_lvl_2['user_id'].apply(lambda x: recommender.get_als_recommendations(x, N=50))
s = users_lvl_2.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'item_id'

users_lvl_2 = users_lvl_2.drop('candidates', axis=1).join(s)
users_lvl_2['flag'] = 1
targets_lvl_2 = data_train_lvl_2[['user_id', 'item_id']].copy()
targets_lvl_2['target'] = 1  # тут только покупки 

targets_lvl_2 = users_lvl_2.merge(targets_lvl_2, on=['user_id', 'item_id'], how='left')

targets_lvl_2['target'].fillna(0, inplace= True)
targets_lvl_2.drop('flag', axis=1, inplace=True)

In [106]:
users_lvl_2.shape[0]

107600

In [107]:
users_lvl_2['user_id'].nunique()

2152

In [108]:
targets_lvl_2.shape

(112884, 3)

In [109]:
targets_lvl_2['target'].mean()

0.12832642358527338

## Feature generation. Добавим фичи users, items, а также их ембеддинги (ALS) из встроенного метода класса MainRecommender (добавил сам)

**Фичи user_id:**
    - Средний чек
    - Средняя сумма покупки 1 товара в каждой категории
    - Кол-во покупок в каждой категории
    - Частотность покупок раз/месяц
    - Долю покупок в выходные
    - Долю покупок утром/днем/вечером

**Фичи item_id**:
    - Кол-во покупок в неделю
    - Среднее ол-во покупок 1 товара в категории в неделю
    - (Кол-во покупок в неделю) / (Среднее кол-во покупок 1 товара в категории в неделю)
    - Цена (Можно посчитать из retil_train.csv)
    - Цена / Средняя цена товара в категории
    
**Фичи пары user_id - item_id**
    - (Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id)
    - (Кол-во покупок юзером конкретной категории в неделю) - (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)
    - (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

## Feature generation

In [110]:
#Max week
MAX_WEEK = data['week_no'].max()

In [111]:
# Данные транзакций
t_data = data_train_lvl_2.copy()
df_augm = targets_lvl_2
t_data = t_data.merge(item_features[['item_id','department']], on='item_id',how='left')

In [112]:
# средний чек на юзера
avg_basket = (t_data.groupby(['user_id', 'basket_id'])['sales_value'].sum().reset_index()).groupby('user_id')['sales_value'].mean().reset_index()
avg_basket.columns = ['user_id', 'avg_basket']

In [113]:
#Среднее кол-во покупок юзера в каждой категории
avg_user_qty_per_department = (t_data.groupby(['user_id', 'department'])['quantity'].sum().reset_index()).groupby('user_id')['quantity'].mean().reset_index()
avg_user_qty_per_department.columns = ['user_id', 'avg_user_qty_per_department']

In [114]:
# Количество недель после последней покупки юзера
last_activity = t_data.groupby(['user_id'])['week_no'].max().reset_index()
last_activity.columns = ['user_id', 'inactivity']
last_activity['inactivity'] = MAX_WEEK - last_activity['inactivity']

In [115]:
#цена товара
price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()
price['price'] = price['sales_value']/price['quantity']
price.drop(['sales_value','quantity'], axis=1,inplace=True)


  price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()


In [116]:
# Среднее кол-во покупок 1 товара в категории
qty_purch_in_department = (t_data.groupby(['item_id', 'department'])['quantity'].sum().reset_index()).groupby('item_id')['quantity'].mean().reset_index()
qty_purch_in_department.columns = ['item_id', 'avg_count_item_dep']

In [117]:
items_emb = recommender.items_embedings()
users_emb = recommender.user_embedings()

In [118]:
df_augm = df_augm.merge(avg_basket, on='user_id',how='left')
df_augm = df_augm.merge(avg_user_qty_per_department, on='user_id',how='left')
df_augm = df_augm.merge(last_activity, on='user_id',how='left')
df_augm = df_augm.merge(user_features, on='user_id', how='left')
df_augm = df_augm.merge(users_emb, on='user_id', how='left')

In [119]:
df_augm = df_augm.merge(price[['item_id','price']], on='item_id',how='left')
df_augm = df_augm.merge(qty_purch_in_department, on='item_id',how='left')
df_augm = df_augm.merge(item_features, on='item_id', how='left')
df_augm = df_augm.merge(items_emb, on='item_id', how='left')

In [120]:
df_augm.shape

(112884, 61)

In [121]:
df_augm.head()

Unnamed: 0,user_id,item_id,target,avg_basket,avg_user_qty_per_department,inactivity,age_desc,marital_status_code,income_desc,homeowner_desc,...,item10,item11,item12,item13,item14,item15,item16,item17,item18,item19
0,2070,1082185,1.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,...,0.016263,0.004875,0.018303,0.018178,0.017284,0.012828,0.010776,0.013405,0.008766,0.026451
1,2070,1107553,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,...,0.007785,0.006604,-0.00743,0.019037,0.022661,0.000413,0.004594,0.002029,0.012358,-0.003668
2,2070,1085604,1.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,...,-0.005059,0.0125,-0.000715,-0.002108,0.014751,-0.002767,0.006061,0.007394,0.010498,-0.002009
3,2070,879755,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,...,-0.004387,0.007124,0.005094,-1.1e-05,0.023788,0.003137,0.011089,0.009344,0.003565,-0.00266
4,2070,844165,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,...,-0.002054,0.010486,0.010291,0.012439,0.019254,-0.005544,0.015886,0.0129,0.002635,-0.000749


In [122]:
targets_lvl_2 = df_augm

In [123]:
def preprocessing(data):
    #Max week
    MAX_WEEK = data['week_no'].max()
    users_lvl_2 = pd.DataFrame(data['user_id'].unique())
    users_lvl_2.columns = ['user_id']

    train_users = data_train_lvl_1['user_id'].unique()
    users_lvl_2 = users_lvl_2[users_lvl_2['user_id'].isin(train_users)]

    users_lvl_2['candidates'] = users_lvl_2['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=50))
    s = users_lvl_2.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
    s.name = 'item_id'

    users_lvl_2 = users_lvl_2.drop('candidates', axis=1).join(s)
    users_lvl_2['flag'] = 1
    targets_lvl_2 = data_train_lvl_2[['user_id', 'item_id']].copy()
    targets_lvl_2['target'] = 1  # тут только покупки 

    targets_lvl_2 = users_lvl_2.merge(targets_lvl_2, on=['user_id', 'item_id'], how='left')

    targets_lvl_2['target'].fillna(0, inplace= True)
    targets_lvl_2.drop('flag', axis=1, inplace=True)

    # feature augmenting and combining
    t_data = data.copy()
    df_augm = targets_lvl_2
    t_data = t_data.merge(item_features[['item_id','department']], on='item_id',how='left')
    # средний чек на юзера
    avg_basket = (t_data.groupby(['user_id', 'basket_id'])['sales_value'].sum().reset_index()).groupby('user_id')['sales_value'].mean().reset_index()
    avg_basket.columns = ['user_id', 'avg_basket']
    #Среднее кол-во покупок юзера в каждой категории
    avg_user_qty_per_department = (t_data.groupby(['user_id', 'department'])['quantity'].sum().reset_index()).groupby('user_id')['quantity'].mean().reset_index()
    avg_user_qty_per_department.columns = ['user_id', 'avg_user_qty_per_department']
    # Количество недель после последней покупки юзера
    last_activity = t_data.groupby(['user_id'])['week_no'].max().reset_index()
    last_activity.columns = ['user_id', 'inactivity']
    last_activity['inactivity'] = MAX_WEEK - last_activity['inactivity']
    #цена товара
    price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()
    price['price'] = price['sales_value']/price['quantity']
    price.drop(['sales_value','quantity'], axis=1,inplace=True)
    # Среднее кол-во покупок 1 товара в категории
    qty_purch_in_department = (t_data.groupby(['item_id', 'department'])['quantity'].sum().reset_index()).groupby('item_id')['quantity'].mean().reset_index()
    qty_purch_in_department.columns = ['item_id', 'avg_count_item_dep']

    items_emb = recommender.items_embedings()
    users_emb = recommender.user_embedings()

    df_augm = df_augm.merge(avg_basket, on='user_id',how='left')
    df_augm = df_augm.merge(avg_user_qty_per_department, on='user_id',how='left')
    df_augm = df_augm.merge(last_activity, on='user_id',how='left')
    df_augm = df_augm.merge(user_features, on='user_id', how='left')
    df_augm = df_augm.merge(users_emb, on='user_id', how='left')

    df_augm = df_augm.merge(price[['item_id','price']], on='item_id',how='left')
    df_augm = df_augm.merge(qty_purch_in_department, on='item_id',how='left')
    df_augm = df_augm.merge(item_features, on='item_id', how='left')
    df_augm = df_augm.merge(items_emb, on='item_id', how='left')

    return df_augm

In [124]:
cat_feats= ['age_desc', 'marital_status_code', 'income_desc', 'homeowner_desc', 'hh_comp_desc',
       'household_size_desc', 'kid_category_desc','manufacturer',
       'department', 'brand', 'commodity_desc', 'sub_commodity_desc',
       'curr_size_of_product']

In [125]:
data_lvl_2 = preprocessing(data_train_lvl_2)
data_lvl_2.shape

  price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()


(110790, 61)

In [126]:
data_lvl_2[cat_feats] = data_lvl_2[cat_feats].astype('category')


In [127]:
%%time
data_test_2 = preprocessing(data_val_lvl_2)
data_test_2.shape

  price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()


Wall time: 20.8 s


(105175, 61)

In [128]:
data_test_2[cat_feats] = data_test_2[cat_feats].astype('category')

In [129]:
data_test_2.head(2)

Unnamed: 0,user_id,item_id,target,avg_basket,avg_user_qty_per_department,inactivity,age_desc,marital_status_code,income_desc,homeowner_desc,...,item10,item11,item12,item13,item14,item15,item16,item17,item18,item19
0,338,825541,0.0,31.249333,17.777778,0,,,,,...,0.02079,0.000753,0.003201,0.006491,0.001149,0.009943,0.0031,0.003446,0.000391,0.010165
1,338,1041259,0.0,31.249333,17.777778,0,,,,,...,0.013776,-6.8e-05,0.002941,0.002281,0.009453,0.006515,0.006137,0.00499,-0.011734,0.002524


In [130]:
data_test_2

Unnamed: 0,user_id,item_id,target,avg_basket,avg_user_qty_per_department,inactivity,age_desc,marital_status_code,income_desc,homeowner_desc,...,item10,item11,item12,item13,item14,item15,item16,item17,item18,item19
0,338,825541,0.0,31.249333,17.777778,0,,,,,...,0.020790,0.000753,0.003201,0.006491,0.001149,0.009943,0.003100,0.003446,0.000391,0.010165
1,338,1041259,0.0,31.249333,17.777778,0,,,,,...,0.013776,-0.000068,0.002941,0.002281,0.009453,0.006515,0.006137,0.004990,-0.011734,0.002524
2,338,999999,0.0,31.249333,17.777778,0,,,,,...,0.012894,0.012746,0.019134,0.026457,0.019328,0.012612,0.015685,0.020780,0.012358,0.021689
3,338,930603,0.0,31.249333,17.777778,0,,,,,...,0.018984,0.002257,0.000400,0.005708,0.003646,0.008600,0.009417,0.000598,0.001112,0.010342
4,338,995785,0.0,31.249333,17.777778,0,,,,,...,0.018001,0.007041,0.013621,0.009948,0.003298,0.012808,0.019093,0.007730,0.007261,0.018080
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105170,832,950894,0.0,30.740000,5.333333,0,,,,,...,0.005473,0.002213,0.013172,0.008284,0.010178,-0.000677,0.009946,0.006324,0.008090,0.000063
105171,832,831628,0.0,30.740000,5.333333,0,,,,,...,0.017119,0.007072,0.003028,-0.006057,0.011859,0.007613,0.013201,0.008140,-0.001519,0.012809
105172,832,9526757,0.0,30.740000,5.333333,0,,,,,...,0.007468,0.004817,0.011809,0.002766,0.014223,0.001093,-0.000729,0.006975,0.001153,0.001175
105173,832,1070820,0.0,30.740000,5.333333,0,,,,,...,0.015168,0.001780,0.006407,0.013054,0.003679,0.006869,-0.003436,0.000867,0.008752,0.009875


In [131]:
def get_recomendations(test_data, test_preds, data_val_lvl_2):
    test_data['predict'] = test_preds

    test_data.sort_values(['user_id', 'predict'], ascending=False, inplace=True)

    result = test_data.groupby('user_id').head(5)

    recs = result.groupby('user_id')['item_id']
    recomendations = []
    for user, preds in recs:
        recomendations.append({'user_id': user, 'recomendations': preds.tolist()})

    recomendations = pd.DataFrame(recomendations)

    result_lvl_2 = data_val_lvl_2.groupby('user_id')['item_id'].unique().reset_index()
    result_lvl_2.columns=['user_id', 'actual']

    result_lvl_2 = result_lvl_2.merge(recomendations)
    
    return result_lvl_2

In [132]:
data_lvl_2.columns

Index(['user_id', 'item_id', 'target', 'avg_basket',
       'avg_user_qty_per_department', 'inactivity', 'age_desc',
       'marital_status_code', 'income_desc', 'homeowner_desc', 'hh_comp_desc',
       'household_size_desc', 'kid_category_desc', 'user0', 'user1', 'user2',
       'user3', 'user4', 'user5', 'user6', 'user7', 'user8', 'user9', 'user10',
       'user11', 'user12', 'user13', 'user14', 'user15', 'user16', 'user17',
       'user18', 'user19', 'price', 'avg_count_item_dep', 'manufacturer',
       'department', 'brand', 'commodity_desc', 'sub_commodity_desc',
       'curr_size_of_product', 'item0', 'item1', 'item2', 'item3', 'item4',
       'item5', 'item6', 'item7', 'item8', 'item9', 'item10', 'item11',
       'item12', 'item13', 'item14', 'item15', 'item16', 'item17', 'item18',
       'item19'],
      dtype='object')

In [133]:
X_train = data_lvl_2.drop(['user_id', 'item_id','target'], axis=1)
y_train = data_lvl_2[['target']]

In [134]:
%%time
lgb = LGBMClassifier(objective='binary', max_depth=7, categorical_column=cat_feats)
lgb.fit(X_train, y_train)

train_preds = lgb.predict(X_train)

  return f(*args, **kwargs)


Wall time: 4.02 s


In [135]:
train_preds

array([0., 1., 0., ..., 0., 0., 0.])

Берем топ-k предсказаний, ранжированных по вероятности, для каждого юзера

In [136]:
train_preds.mean()

0.03924541926166621

In [137]:
len(train_preds)

110790

In [138]:
X_train['preds'] = train_preds

In [139]:
X_train[['user_id', 'item_id','target']] = targets_lvl_2[['user_id', 'item_id','target']]


In [140]:
X_train[X_train.preds == 1]['user_id'].nunique()

864

In [141]:
X_train['user_id'].nunique()

2111

In [142]:
X_test = data_lvl_2.drop(['target'], axis=1)

In [143]:
lgb_test_pred = lgb.predict_proba(X_test.drop(['user_id','item_id'],axis=1))[:,1]

In [144]:
X_test['predict'] = lgb.predict_proba(X_test.drop(['user_id','item_id'],axis=1))[:,1]
X_test['predict']

0         0.157140
1         0.631722
2         0.053971
3         0.166427
4         0.078078
            ...   
110785    0.016575
110786    0.007284
110787    0.011959
110788    0.000814
110789    0.002817
Name: predict, Length: 110790, dtype: float64

In [145]:
result_test_1 = get_recomendations(X_test, lgb_test_pred, data_val_lvl_2)
result_test_1

Unnamed: 0,user_id,actual,recomendations
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[979707, 9297615, 9297615, 5577022, 981760]"
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[1070820, 1037863, 1037863, 1037863, 970030]"
2,7,"[840386, 889774, 898068, 909714, 929067, 95347...","[1106523, 1133018, 1058997, 1096036, 1019247]"
3,8,"[835098, 872137, 910439, 924610, 992977, 10412...","[1106523, 1133018, 981760, 1044078, 1092026]"
4,9,"[864335, 990865, 1029743, 9297474, 10457112, 8...","[1126899, 862349, 994928, 849843, 826249]"
...,...,...,...
1910,2496,[6534178],"[1106523, 1106523, 995785, 1133018, 826249]"
1911,2497,"[1016709, 9835695, 1132298, 16809501, 845294, ...","[1055646, 995785, 995785, 981760, 899624]"
1912,2498,"[15716530, 834484, 901776, 914190, 958382, 972...","[1126899, 979707, 1133018, 1058997, 879755]"
1913,2499,"[867188, 877580, 902396, 914190, 951590, 95813...","[1126899, 899624, 5568378, 5568378, 5568378]"


In [146]:
result_test_1.apply(lambda row: precision_at_k(row['recomendations'], row['actual'], k=5), axis=1).mean()

0.1470496083550898

LAMA как вторая модель

In [158]:
TASK = Task('reg', loss='mse', metric=mean_quantile_error, greater_is_better=False)
TIMEOUT = 100
N_THREADS = 4
MEMORY_LIMIT = 7
N_FOLDS = 5
RANDOM_STATE = 21
TARGET_NAME = 'target'
TEST_SIZE=0.2

In [159]:
roles = {'target': TARGET_NAME, 'drop': ['user_id', 'item_id']}

In [160]:
lama_model = TabularAutoML(task=TASK,
                            timeout=TIMEOUT,
                            cpu_limit = N_THREADS,
                            memory_limit = MEMORY_LIMIT,
                            gpu_ids='all',
                            reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
                             
                            general_params={'use_algos': [ ['lgb_tuned', 'cb_tuned','xgb'], ['lgb_tuned'] ]},
                             
                            tuning_params={'max_tuning_iter': 15},
)

In [161]:
%%time
train_pred = lama_model.fit_predict(data_lvl_2, roles = roles)

INFO:optuna.storages._in_memory:A new study created in memory with name: no-name-6bc4d226-8764-4b40-89a7-968b99d3eb54
INFO:optuna.study.study:Trial 0 finished with value: -0.051862075600984346 and parameters: {'feature_fraction': 0.6872700594236812, 'num_leaves': 244}. Best is trial 0 with value: -0.051862075600984346.


Wall time: 3min 40s


In [162]:
train_preds = lama_model.predict(targets_lvl_2.drop('target',axis=1))

In [163]:
train_preds

array([[ 0.35238102],
       [ 0.16125195],
       [ 0.4985717 ],
       ...,
       [ 0.00186583],
       [-0.02600814],
       [-0.01780427]], dtype=float32)

In [164]:
test_preds = lama_model.predict(data_test_2.drop('target',axis=1))


In [165]:
test_preds = test_preds.data

In [166]:
result_test_2 = get_recomendations(data_test_2, test_preds, data_val_lvl_2)
result_test_2

Unnamed: 0,user_id,actual,recomendations
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[5577022, 979707, 9297615, 9297615, 7409918]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[1106523, 13842214, 1055646, 962229, 1092026]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[1037863, 1037863, 1037863, 845208, 970030]"
3,7,"[840386, 889774, 898068, 909714, 929067, 95347...","[1106523, 1133018, 1058997, 12301100, 995965]"
4,8,"[835098, 872137, 910439, 924610, 992977, 10412...","[1044078, 1092026, 1106523, 1133018, 1105301]"
...,...,...,...
2036,2496,[6534178],"[1106523, 1106523, 826249, 995785, 1133018]"
2037,2497,"[1016709, 9835695, 1132298, 16809501, 845294, ...","[1055646, 970202, 995785, 995785, 1040807]"
2038,2498,"[15716530, 834484, 901776, 914190, 958382, 972...","[1092026, 5568378, 879635, 1065593, 9297615]"
2039,2499,"[867188, 877580, 902396, 914190, 951590, 95813...","[5568378, 5568378, 5568378, 5568378, 899624]"


In [167]:
result_test_2.apply(lambda row: precision_at_k(row['recomendations'], row['actual'], k=5), axis=1).mean()

0.1521803037726585

In [157]:
result_test_2.to_csv('recomendations_als.csv')