# Вебинар 6. Двухуровневые модели рекомендаций


### Зачем 2 уровня?
- Классические модели классификации (lightgbm) зачастую работают лучше, чем рекоммендательные модели (als, lightfm)
- Данных много, предсказаний много (# items * # users) --> с таким объемом lightgbm не справляется
- Но рекомендательные модели справляются!

Отбираем top-N (200) *кандидатов* с помощью простой модели (als) --> переранжируем их сложной моделью (lightgbm)
и выберем top-k (10).

---

### Применение в бизнесе

Если вы еще не прочитали [статью](https://habr.com/ru/company/hh/blog/347276/) о рекомендательных системах и поиске в hh.ru, то обязательно прочитайте

- 2-ух уровневая система применяется во многих компаниях
- Зачастую уровней > 2
- Идем от более простых эвристик/моделей к более сложным
- Фичи из моделей первого уровня (embeddings, biases из ALS) можно использовать в последующих моделях

Также решения на основе 2-ух уровневых рекомендаций заняли все топ-10 мест в соревновании X5 Retail hero. 

- [Презентация](https://github.com/aprotopopov/retailhero_recommender/blob/master/slides/retailhero_recommender.pdf) и [Код](https://github.com/aprotopopov/retailhero_recommender) решения 2-ого места
- [Код](https://vk.com/away.php?utf=1&to=https%3A%2F%2Fgithub.com%2Fmike-chesnokov%2Fx5_retailhero_2020_recs) решения 9-ого места

### Как отбирать кандидатов?

Вариантов множество. Тут нам поможет *MainRecommender*. Пока в нем реализованы далеко не все возможные способы генерации кандидатов

- Генерируем топ-k кандидатов
- Качество кандидатов измеряем через **recall@k**
- recall@k показывает какую долю из купленных товаров мы смогли выявить (рекомендовать) нашей моделью

----

Pipline:
1. Рекомендуем 50 кандидатов среди товаров классическими методами
2. Оцениваем recall@k нашу кандидатную выдачу (выдача моделями 1-го уровня)
3. Получаем user-item датасет по кандидатным рекомендациям
4. Для такого датасета проставляем target купил/не купил товар по истории взаимодействий
5. На этом датасете строим lightGBM, предсказывающий купит или не купит пользователь данный товар 

# Практическая часть

Код для src, utils, metrics вы можете скачать с [этого](https://github.com/geangohn/recsys-tutorial) github репозитория

In [None]:
!pip install implicit

Collecting implicit
  Downloading implicit-0.4.8.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 4.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: implicit
  Building wheel for implicit (PEP 517) ... [?25l[?25hdone
  Created wheel for implicit: filename=implicit-0.4.8-cp37-cp37m-linux_x86_64.whl size=4606675 sha256=01399d58548193bae63ffe68cd55b682f61cc950f1c0ccc989752bb7bac839f5
  Stored in directory: /root/.cache/pip/wheels/88/e6/34/25e73cccbaf1a961154bb562a5f86123b68fdbf40e306073d6
Successfully built implicit
Installing collected packages: implicit
Successfully installed implicit-0.4.8


In [None]:
from google.colab import drive
drive.mount('/content/drive')
root = root = '/content/drive/My Drive/Colab Notebooks/rec_sys/data/'

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als
from implicit.nearest_neighbours import bm25_weight, tfidf_weight


# Модель второго уровня
from lightgbm import LGBMClassifier



In [None]:
!pip install lightautoml

Collecting lightautoml
  Downloading LightAutoML-0.3.2-py3-none-any.whl (294 kB)
[?25l[K     |█▏                              | 10 kB 20.8 MB/s eta 0:00:01[K     |██▎                             | 20 kB 9.8 MB/s eta 0:00:01[K     |███▍                            | 30 kB 8.0 MB/s eta 0:00:01[K     |████▌                           | 40 kB 7.2 MB/s eta 0:00:01[K     |█████▋                          | 51 kB 4.2 MB/s eta 0:00:01[K     |██████▊                         | 61 kB 4.4 MB/s eta 0:00:01[K     |███████▉                        | 71 kB 4.4 MB/s eta 0:00:01[K     |█████████                       | 81 kB 5.0 MB/s eta 0:00:01[K     |██████████                      | 92 kB 3.8 MB/s eta 0:00:01[K     |███████████▏                    | 102 kB 4.2 MB/s eta 0:00:01[K     |████████████▎                   | 112 kB 4.2 MB/s eta 0:00:01[K     |█████████████▍                  | 122 kB 4.2 MB/s eta 0:00:01[K     |██████████████▌                 | 133 kB 4.2 MB/s eta 0:00:

In [None]:
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.tasks.common_metric import mean_quantile_error


In [None]:
import os, sys
#module_path = os.path.abspath(os.path.join(os.pardir))
#if module_path not in sys.path:
#     sys.path.append(module_path)
sys.path.append('/content/drive/My Drive/Colab Notebooks/rec_sys')

# Написанные нами функции
from src.metrics import precision_at_k, recall_at_k
from src.utils import prefilter_items
from src.recommenders2 import MainRecommender

In [None]:
data = pd.read_csv(root+'retail_train.csv')
item_features = pd.read_csv(root+'product.csv')
user_features = pd.read_csv(root+'hh_demographic.csv')

# column processing
#data.columns = [col.lower() for col in data.columns]
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': 'item_id'}, inplace=True)
user_features.rename(columns={'household_key': 'user_id'}, inplace=True)


# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)
val_lvl_1_size_weeks = 6
val_lvl_2_size_weeks = 3

data_train_lvl_1 = data[data['week_no'] < data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)]
data_val_lvl_1 = data[(data['week_no'] >= data['week_no'].max() - (val_lvl_1_size_weeks + val_lvl_2_size_weeks)) &
                      (data['week_no'] < data['week_no'].max() - (val_lvl_2_size_weeks))]

data_train_lvl_2 = data_val_lvl_1.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться
data_val_lvl_2 = data[data['week_no'] >= data['week_no'].max() - val_lvl_2_size_weeks]

data_train_lvl_1.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [None]:
n_items_before = data_train_lvl_1['item_id'].nunique()

data_train_lvl_1 = prefilter_items(data_train_lvl_1, item_features=item_features, take_n_popular=5000)

n_items_after = data_train_lvl_1['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

Decreased # items from 83685 to 5001


In [None]:
recommender = MainRecommender(data_train_lvl_1)



  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/5001 [00:00<?, ?it/s]

In [None]:
recommender

<src.recommenders2.MainRecommender at 0x7f5983785b50>

### Варианты, как получить кандидатов

Можно потом все эти варианты соединить в один

(!) Если модель рекомендует < N товаров, то рекомендации дополняются топ-популярными товарами до N

In [None]:
recommender.get_als_recommendations(2375, N=5)

[899624, 871756, 1044078, 1106523, 844179]

In [None]:
recommender.get_own_recommendations(2375, N=5)

[948640, 918046, 847962, 907099, 873980]

In [None]:
recommender.get_similar_items_recommendation(2375, N=5)

[1046545, 1044078, 844179, 1078652, 15778319]

In [None]:
recommender.get_similar_users_recommendation(2375, N=5)

[1097398, 1096573, 835351, 861494, 821741]

In [None]:
recommender.overall_top_purchases[:5]

[1029743, 1106523, 5569230, 916122, 844179]

### Измеряем recall@k

Это будет в ДЗ: 

A) Попробуйте различные варианты генерации кандидатов. Какие из них дают наибольший recall@k ?
- Пока пробуем отобрать 50 кандидатов (k=50)
- Качество измеряем на data_val_lvl_1: следующие 6 недель после трейна

Дают ли own recommendtions + top-popular лучший recall?  

B)* Как зависит recall@k от k? Постройте для одной схемы генерации кандидатов эту зависимость для k = {20, 50, 100, 200, 500}  
C)* Исходя из прошлого вопроса, как вы думаете, какое значение k является наиболее разумным?


In [None]:
result_lvl_1 = data_val_lvl_1.groupby('user_id')['item_id'].unique().reset_index()
result_lvl_1.columns=['user_id', 'actual']
result_lvl_1.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870..."


In [None]:
users_lvl_1 = pd.DataFrame(data_train_lvl_1['user_id'].unique(),columns = ['user_id'])

In [None]:
K_num = 50
result_lvl_1['als_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_als_recommendations(x, N=K_num))
result_lvl_1['own_rec'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_own_recommendations(x, N=K_num))
result_lvl_1['sim_items'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=K_num))
result_lvl_1['sim_users'] = users_lvl_1['user_id'].apply(lambda x: recommender.get_similar_users_recommendation(x, N=K_num))

In [None]:
result_lvl_1.head(3)

Unnamed: 0,user_id,actual,als_rec,own_rec,sim_items,sim_users
0,1,"[853529, 865456, 867607, 872137, 874905, 87524...","[899624, 871756, 1044078, 1106523, 844179, 556...","[948640, 918046, 847962, 907099, 873980, 88469...","[1046545, 1044078, 844179, 1078652, 15778319, ...","[1097398, 1096573, 835351, 861494, 821741, 714..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870...","[5582712, 940947, 941198, 826597, 12731714, 10...","[1101378, 8090570, 857176, 947013, 1065979, 10...","[1074754, 865026, 1061688, 12301109, 901062, 9...","[1112825, 9527417, 963365, 1089568, 1115800, 9..."
2,4,"[883932, 970760, 1035676, 1055863, 1097610, 67...","[1102416, 963686, 5569172, 948670, 951164, 110...","[963686, 1057168, 908314, 9859017, 1120261, 10...","[959345, 1074754, 883616, 999779, 1038663, 861...","[998519, 5566800, 9392700, 5572803, 979674, 11..."


## Расчет recall для отбора модели первого уровня - модель для отбора кандидатов

In [None]:
def calculate_recall_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: recall_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [None]:
recall_results = pd.DataFrame(sorted(calculate_recall_k(result_lvl_1, 50), key=lambda x: x[1],reverse=True), columns = ['Candidate_model','Recall'])
recall_results

Unnamed: 0,Candidate_model,Recall
0,sim_items,0.015172
1,als_rec,0.012269
2,own_rec,0.010785
3,sim_users,0.002375


# Бейзлайн - модели первого уровня, расчет метрики precision@5

In [None]:
def calculate_precision_k(data, K): #data - pandas df
    for column in data.columns[2:]:
        yield column, data.apply(lambda row: precision_at_k(row[column], row['actual'], k=K), axis=1).mean()

In [None]:
precision_results = pd.DataFrame(sorted(calculate_precision_k(result_lvl_1, 5), key=lambda x: x[1],reverse=True), columns = ['Model','Precision'])
precision_results

Unnamed: 0,Model,Precision
0,als_rec,0.026277
1,sim_items,0.016992
2,own_rec,0.004457
3,sim_users,0.002971


### Лучшая метрика по Baseline - MainRecommender ALS (als рекомендации с дефолтным взвешиванием bm25, доподненные ТОП популярными товарами) составляет 0.026277. Наилучший recall выдает метод similar_items; его возьмем для отбора кандидатов второй модели. 

In [None]:
## Добавить ТОП популярных, стекнуть несколько кандидатских списков

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_lvl_2
- Обучаем *только* на выбранных кандидатах -  сгенерирую топ-50 кадидиатов через get_als_recommendations. Если юзер купил < 50 товаров, то get_als_recommendations дополнит рекоммендации топ-популярными

In [None]:
users_lvl_2 = pd.DataFrame(data_train_lvl_2['user_id'].unique())
users_lvl_2.columns = ['user_id']

# Пока только warm start
train_users = data_train_lvl_1['user_id'].unique()
users_lvl_2 = users_lvl_2[users_lvl_2['user_id'].isin(train_users)]

users_lvl_2['candidates'] = users_lvl_2['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=50))
s = users_lvl_2.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'item_id'

users_lvl_2 = users_lvl_2.drop('candidates', axis=1).join(s)
users_lvl_2['flag'] = 1
targets_lvl_2 = data_train_lvl_2[['user_id', 'item_id']].copy()
targets_lvl_2['target'] = 1  # тут только покупки 

targets_lvl_2 = users_lvl_2.merge(targets_lvl_2, on=['user_id', 'item_id'], how='left')

targets_lvl_2['target'].fillna(0, inplace= True)
targets_lvl_2.drop('flag', axis=1, inplace=True)

In [None]:
users_lvl_2.shape[0]

107550

In [None]:
users_lvl_2['user_id'].nunique()

2151

In [None]:
targets_lvl_2.shape

(109412, 3)

In [None]:
targets_lvl_2['target'].mean()

0.057452564618140606

## Feature generation. Добавим фичи users, items, а также их ембеддинги (ALS) из встроенного метода класса MainRecommender (добавил сам)

**Фичи user_id:**
    - Средний чек
    - Средняя сумма покупки 1 товара в каждой категории
    - Кол-во покупок в каждой категории
    - Частотность покупок раз/месяц
    - Долю покупок в выходные
    - Долю покупок утром/днем/вечером

**Фичи item_id**:
    - Кол-во покупок в неделю
    - Среднее ол-во покупок 1 товара в категории в неделю
    - (Кол-во покупок в неделю) / (Среднее кол-во покупок 1 товара в категории в неделю)
    - Цена (Можно посчитать из retil_train.csv)
    - Цена / Средняя цена товара в категории
    
**Фичи пары user_id - item_id**
    - (Средняя сумма покупки 1 товара в каждой категории (берем категорию item_id)) - (Цена item_id)
    - (Кол-во покупок юзером конкретной категории в неделю) - (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)
    - (Кол-во покупок юзером конкретной категории в неделю) / (Среднее кол-во покупок всеми юзерами конкретной категории в неделю)

## Feature generation

In [None]:
#Max week
MAX_WEEK = data['week_no'].max()

In [None]:
# Данные транзакций
t_data = data_train_lvl_2.copy()
df_augm = targets_lvl_2
t_data = t_data.merge(item_features[['item_id','department']], on='item_id',how='left')

In [None]:
# средний чек на юзера
avg_basket = (t_data.groupby(['user_id', 'basket_id'])['sales_value'].sum().reset_index()).groupby('user_id')['sales_value'].mean().reset_index()
avg_basket.columns = ['user_id', 'avg_basket']

In [None]:
#Среднее кол-во покупок юзера в каждой категории
avg_user_qty_per_department = (t_data.groupby(['user_id', 'department'])['quantity'].sum().reset_index()).groupby('user_id')['quantity'].mean().reset_index()
avg_user_qty_per_department.columns = ['user_id', 'avg_user_qty_per_department']

In [None]:
# Количество недель после последней покупки юзера
last_activity = t_data.groupby(['user_id'])['week_no'].max().reset_index()
last_activity.columns = ['user_id', 'inactivity']
last_activity['inactivity'] = MAX_WEEK - last_activity['inactivity']

In [None]:
#цена товара
price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()
price['price'] = price['sales_value']/price['quantity']
price.drop(['sales_value','quantity'], axis=1,inplace=True)



Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



In [None]:
# Среднее кол-во покупок 1 товара в категории
qty_purch_in_department = (t_data.groupby(['item_id', 'department'])['quantity'].sum().reset_index()).groupby('item_id')['quantity'].mean().reset_index()
qty_purch_in_department.columns = ['item_id', 'avg_count_item_dep']

In [None]:
items_emb = recommender.items_embedings()
users_emb = recommender.user_embedings()

In [None]:
df_augm = df_augm.merge(avg_basket, on='user_id',how='left')
df_augm = df_augm.merge(avg_user_qty_per_department, on='user_id',how='left')
df_augm = df_augm.merge(last_activity, on='user_id',how='left')
df_augm = df_augm.merge(user_features, on='user_id', how='left')
df_augm = df_augm.merge(users_emb, on='user_id', how='left')

In [None]:
df_augm = df_augm.merge(price[['item_id','price']], on='item_id',how='left')
df_augm = df_augm.merge(qty_purch_in_department, on='item_id',how='left')
df_augm = df_augm.merge(item_features, on='item_id', how='left')
df_augm = df_augm.merge(items_emb, on='item_id', how='left')

In [None]:
df_augm.shape

(109412, 61)

In [None]:
df_augm.head()

Unnamed: 0,user_id,item_id,target,avg_basket,avg_user_qty_per_department,inactivity,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user0,user1,user2,user3,user4,user5,user6,user7,user8,user9,user10,user11,user12,user13,user14,user15,user16,user17,user18,user19,price,avg_count_item_dep,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,item0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,item11,item12,item13,item14,item15,item16,item17,item18,item19
0,2070,1074754,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,-0.351166,4.982241,-5.68894,-5.264252,2.756481,6.626905,-2.647804,7.997111,2.834755,15.22492,1.299232,-3.03147,7.495206,-4.875326,-1.963318,1.260753,7.779292,4.163882,1.150063,3.741691,2.623429,35.0,1075.0,GROCERY,National,COOKIES/CONES,SANDWICH COOKIES,18 OZ,0.00153,0.009203,0.002945,0.013068,0.009493,0.008357,0.004851,0.010391,-0.004096,0.010596,0.00308,0.009981,0.013004,0.004926,0.008039,0.006033,0.012813,0.002828,0.007759,0.011154
1,2070,834117,1.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,-0.351166,4.982241,-5.68894,-5.264252,2.756481,6.626905,-2.647804,7.997111,2.834755,15.22492,1.299232,-3.03147,7.495206,-4.875326,-1.963318,1.260753,7.779292,4.163882,1.150063,3.741691,3.961429,70.0,69.0,GROCERY,Private,WATER - CARBONATED/FLVRD DRINK,NON-CRBNTD DRNKING/MNERAL WATE,405.6 OZ,0.010308,0.013603,0.008107,0.017457,0.00848,-0.000568,0.008023,0.003728,0.00933,0.014328,0.010369,0.010456,0.009388,0.010321,-0.001326,0.014896,0.011944,0.005493,0.007047,0.010671
2,2070,950202,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,-0.351166,4.982241,-5.68894,-5.264252,2.756481,6.626905,-2.647804,7.997111,2.834755,15.22492,1.299232,-3.03147,7.495206,-4.875326,-1.963318,1.260753,7.779292,4.163882,1.150063,3.741691,3.24375,8.0,69.0,DELI,Private,SANDWICHES,SANDWICHES - (COLD),,0.001707,0.010348,0.000246,-0.008168,0.000958,0.003499,0.003874,0.004472,-0.005055,0.011642,0.00825,1.1e-05,3.5e-05,-0.001752,0.00564,0.007829,0.007748,0.004278,-0.000112,-0.00628
3,2070,896862,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,-0.351166,4.982241,-5.68894,-5.264252,2.756481,6.626905,-2.647804,7.997111,2.834755,15.22492,1.299232,-3.03147,7.495206,-4.875326,-1.963318,1.260753,7.779292,4.163882,1.150063,3.741691,2.564918,61.0,1425.0,MEAT-PCKGD,National,BACON,ECONOMY,1 LB,0.008252,0.006021,0.00344,-0.000289,0.016306,0.014222,0.021572,0.010293,0.005467,0.00429,0.007416,-0.001788,0.010506,0.010282,0.001235,0.001552,0.006282,0.007574,-0.002281,0.004111
4,2070,857215,0.0,14.355581,1755.0,4,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown,-0.351166,4.982241,-5.68894,-5.264252,2.756481,6.626905,-2.647804,7.997111,2.834755,15.22492,1.299232,-3.03147,7.495206,-4.875326,-1.963318,1.260753,7.779292,4.163882,1.150063,3.741691,,,3020.0,MEAT,National,PORK,LOIN - CHOPS BONELESS,,0.004908,0.00997,-0.005454,0.006295,0.011874,0.004349,0.003341,0.004118,0.013964,0.00755,0.01351,-0.007624,0.012444,0.000708,-0.006718,-0.007277,0.001595,0.010028,0.005139,0.005538


In [None]:
targets_lvl_2 = df_augm

In [None]:
def preprocessing(data):
    #Max week
    MAX_WEEK = data['week_no'].max()
    users_lvl_2 = pd.DataFrame(data['user_id'].unique())
    users_lvl_2.columns = ['user_id']

    train_users = data_train_lvl_1['user_id'].unique()
    users_lvl_2 = users_lvl_2[users_lvl_2['user_id'].isin(train_users)]

    users_lvl_2['candidates'] = users_lvl_2['user_id'].apply(lambda x: recommender.get_similar_items_recommendation(x, N=50))
    s = users_lvl_2.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
    s.name = 'item_id'

    users_lvl_2 = users_lvl_2.drop('candidates', axis=1).join(s)
    users_lvl_2['flag'] = 1
    targets_lvl_2 = data_train_lvl_2[['user_id', 'item_id']].copy()
    targets_lvl_2['target'] = 1  # тут только покупки 

    targets_lvl_2 = users_lvl_2.merge(targets_lvl_2, on=['user_id', 'item_id'], how='left')

    targets_lvl_2['target'].fillna(0, inplace= True)
    targets_lvl_2.drop('flag', axis=1, inplace=True)

    # feature augmenting and combining
    t_data = data.copy()
    df_augm = targets_lvl_2
    t_data = t_data.merge(item_features[['item_id','department']], on='item_id',how='left')
    # средний чек на юзера
    avg_basket = (t_data.groupby(['user_id', 'basket_id'])['sales_value'].sum().reset_index()).groupby('user_id')['sales_value'].mean().reset_index()
    avg_basket.columns = ['user_id', 'avg_basket']
    #Среднее кол-во покупок юзера в каждой категории
    avg_user_qty_per_department = (t_data.groupby(['user_id', 'department'])['quantity'].sum().reset_index()).groupby('user_id')['quantity'].mean().reset_index()
    avg_user_qty_per_department.columns = ['user_id', 'avg_user_qty_per_department']
    # Количество недель после последней покупки юзера
    last_activity = t_data.groupby(['user_id'])['week_no'].max().reset_index()
    last_activity.columns = ['user_id', 'inactivity']
    last_activity['inactivity'] = MAX_WEEK - last_activity['inactivity']
    #цена товара
    price = t_data.groupby(['item_id'])['sales_value','quantity'].sum().reset_index()
    price['price'] = price['sales_value']/price['quantity']
    price.drop(['sales_value','quantity'], axis=1,inplace=True)
    # Среднее кол-во покупок 1 товара в категории
    qty_purch_in_department = (t_data.groupby(['item_id', 'department'])['quantity'].sum().reset_index()).groupby('item_id')['quantity'].mean().reset_index()
    qty_purch_in_department.columns = ['item_id', 'avg_count_item_dep']

    items_emb = recommender.items_embedings()
    users_emb = recommender.user_embedings()

    df_augm = df_augm.merge(avg_basket, on='user_id',how='left')
    df_augm = df_augm.merge(avg_user_qty_per_department, on='user_id',how='left')
    df_augm = df_augm.merge(last_activity, on='user_id',how='left')
    df_augm = df_augm.merge(user_features, on='user_id', how='left')
    df_augm = df_augm.merge(users_emb, on='user_id', how='left')

    df_augm = df_augm.merge(price[['item_id','price']], on='item_id',how='left')
    df_augm = df_augm.merge(qty_purch_in_department, on='item_id',how='left')
    df_augm = df_augm.merge(item_features, on='item_id', how='left')
    df_augm = df_augm.merge(items_emb, on='item_id', how='left')

    return df_augm

In [None]:
cat_feats= ['age_desc', 'marital_status_code', 'income_desc', 'homeowner_desc', 'hh_comp_desc',
       'household_size_desc', 'kid_category_desc','manufacturer',
       'department', 'brand', 'commodity_desc', 'sub_commodity_desc',
       'curr_size_of_product']

In [None]:
data_lvl_2 = preprocessing(data_train_lvl_2)
data_lvl_2.shape


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



(109412, 61)

In [None]:
data_lvl_2[cat_feats] = data_lvl_2[cat_feats].astype('category')


In [None]:
data_test_2 = preprocessing(data_val_lvl_2)
data_test_2.shape


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



(103839, 61)

In [None]:
data_test_2[cat_feats] = data_test_2[cat_feats].astype('category')

In [None]:
data_test_2.head(2)

Unnamed: 0,user_id,item_id,target,avg_basket,avg_user_qty_per_department,inactivity,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user0,user1,user2,user3,user4,user5,user6,user7,user8,user9,user10,user11,user12,user13,user14,user15,user16,user17,user18,user19,price,avg_count_item_dep,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,item0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,item11,item12,item13,item14,item15,item16,item17,item18,item19
0,338,952163,0.0,31.249333,17.777778,0,,,,,,,,9.976629,-0.845635,-2.177976,-0.4995,10.481974,6.647415,-0.666489,4.956534,-2.375327,-5.726499,8.511136,-3.590917,0.488225,-9.144132,-1.430664,8.756859,-2.631341,-0.845286,-4.183084,2.461438,,,69.0,PRODUCE,Private,POTATOES,POTATOES RUSSET (BULK&BAG),5 LB,0.010701,0.008569,0.005877,0.005629,0.018151,0.012672,0.015969,0.006975,0.00314,0.007838,0.009924,0.002384,0.008565,0.01447,0.000567,0.007493,0.009607,0.008212,0.002927,0.009365
1,338,8090440,0.0,31.249333,17.777778,0,,,,,,,,9.976629,-0.845635,-2.177976,-0.4995,10.481974,6.647415,-0.666489,4.956534,-2.375327,-5.726499,8.511136,-3.590917,0.488225,-9.144132,-1.430664,8.756859,-2.631341,-0.845286,-4.183084,2.461438,7.130909,22.0,69.0,DELI,Private,CHICKEN/POULTRY,CHIX: ROTISSERIE (HOT),48OZ,0.009893,0.012002,0.004965,0.002938,0.010651,0.012009,-0.001561,0.005882,0.01202,0.011422,0.008172,0.006357,0.010233,-0.008466,0.008523,0.018981,0.002752,-0.000458,0.00644,0.012033


In [None]:
data_test_2

(109412, 61)

In [None]:
def get_recomendations(test_data, test_preds, data_val_lvl_2):
    test_data['predict'] = test_preds

    test_data.sort_values(['user_id', 'predict'], ascending=False, inplace=True)

    result = test_data.groupby('user_id').head(5)

    recs = result.groupby('user_id')['item_id']
    recomendations = []
    for user, preds in recs:
        recomendations.append({'user_id': user, 'recomendations': preds.tolist()})

    recomendations = pd.DataFrame(recomendations)

    result_lvl_2 = data_val_lvl_2.groupby('user_id')['item_id'].unique().reset_index()
    result_lvl_2.columns=['user_id', 'actual']

    result_lvl_2 = result_lvl_2.merge(recomendations)
    
    return result_lvl_2

In [None]:
data_lvl_2.columns

Index(['user_id', 'item_id', 'target', 'avg_basket',
       'avg_user_qty_per_department', 'inactivity', 'age_desc',
       'marital_status_code', 'income_desc', 'homeowner_desc', 'hh_comp_desc',
       'household_size_desc', 'kid_category_desc', 'user0', 'user1', 'user2',
       'user3', 'user4', 'user5', 'user6', 'user7', 'user8', 'user9', 'user10',
       'user11', 'user12', 'user13', 'user14', 'user15', 'user16', 'user17',
       'user18', 'user19', 'price', 'avg_count_item_dep', 'manufacturer',
       'department', 'brand', 'commodity_desc', 'sub_commodity_desc',
       'curr_size_of_product', 'item0', 'item1', 'item2', 'item3', 'item4',
       'item5', 'item6', 'item7', 'item8', 'item9', 'item10', 'item11',
       'item12', 'item13', 'item14', 'item15', 'item16', 'item17', 'item18',
       'item19'],
      dtype='object')

In [None]:
X_train = data_lvl_2.drop(['user_id', 'item_id','target'], axis=1)
y_train = data_lvl_2[['target']]

In [None]:
lgb = LGBMClassifier(objective='binary', max_depth=7, categorical_column=cat_feats)
lgb.fit(X_train, y_train)

train_preds = lgb.predict(X_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


categorical_feature in param dict is overridden.



In [None]:
train_preds

array([0., 0., 0., ..., 0., 0., 0.])

Берем топ-k предсказаний, ранжированных по вероятности, для каждого юзера

In [None]:
train_preds.mean()

0.02028113918034585

In [None]:
len(train_preds)

109412

In [None]:
X_train['preds'] = train_preds

In [None]:
X_train[['user_id', 'item_id','target']] = targets_lvl_2[['user_id', 'item_id','target']]


In [None]:
X_train[X_train.preds == 1]['user_id'].nunique()

438

In [None]:
X_train['user_id'].nunique()

2151

In [None]:
X_test = data_lvl_2.drop(['target'], axis=1)

In [None]:
lgb_test_pred = lgb.predict_proba(X_test.drop(['user_id','item_id'],axis=1))[:,1]

In [None]:
X_test['predict'] = lgb.predict_proba(X_test.drop(['user_id','item_id'],axis=1))[:,1]
X_test['predict']

0         0.033072
1         0.204962
2         0.057412
3         0.056086
4         0.000287
            ...   
109407    0.009456
109408    0.001759
109409    0.011264
109410    0.013708
109411    0.003872
Name: predict, Length: 109412, dtype: float64

In [None]:
result_test_1 = get_recomendations(X_test, lgb_test_pred, data_val_lvl_2)
result_test_1

Unnamed: 0,user_id,actual,recomendations
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[9297615, 9297615, 940947, 940947, 940947]"
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[986912, 5569230, 878715, 1105488, 1105488]"
2,7,"[840386, 889774, 898068, 909714, 929067, 95347...","[1070820, 866211, 1110039, 5587656, 1044078]"
3,8,"[835098, 872137, 910439, 924610, 992977, 10412...","[976199, 1044078, 1044078, 1104343, 1087102]"
4,9,"[864335, 990865, 1029743, 9297474, 10457112, 8...","[1029743, 1106523, 916122, 5569230, 5569230]"
...,...,...,...
1910,2496,[6534178],"[899624, 916122, 916122, 1044078, 1044078]"
1911,2497,"[1016709, 9835695, 1132298, 16809501, 845294, ...","[899624, 1040807, 848319, 866211, 8090537]"
1912,2498,"[15716530, 834484, 901776, 914190, 958382, 972...","[1070820, 1070820, 1044078, 1044078, 827578]"
1913,2499,"[867188, 877580, 902396, 914190, 951590, 95813...","[1029743, 5568378, 5568378, 5568378, 5568378]"


In [None]:
result_test_1.apply(lambda row: precision_at_k(row['recomendations'], row['actual'], k=5), axis=1).mean()

0.09702349869451664

Вывод: Результат значительно вырос по сравнению с Baseline (с 0.026 до 0.097