# Course project


## **Основное**
- Дедлайн - 31 мая 23:59
- Целевая метрика precision@5 > 0.235
- Бейзлайн решения - [MainRecommender](https://github.com/geangohn/recsys-tutorial/blob/master/src/recommenders.py)
- Сдаем ссылку на github с решением. В решении должны быть отчетливо видна метрика на новом тестовом сете из файла retail_test1.csv, то есть вам нужно для всех юзеров из этого файла выдать выши рекомендации, и посчитать на actual покупках precision@5. 

**!! Мы не рассматриваем холодный старт для пользователя, все наши пользователя одинаковы во всех сетах, поэтому нужно позаботиться об их исключении из теста.**


**Hints:** 

Сначала просто попробуйте разные параметры MainRecommender:  
- N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)  
- Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)  
- Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)  
- Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)  

Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

Если вы делаете двухуровневую модель - следите за валидацией 

In [1]:
# !pip install implicit==0.4.4

# Import libs

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

# import module we'll need to import our custom module
from shutil import copyfile

# copy our file into the working directory (make sure it has .py suffix)
copyfile(src = "../input/recommend85/metrics.py", dst = "../working/metrics.py")
copyfile(src = "../input/recommend85/utils.py", dst = "../working/utils.py")
copyfile(src = "../input/recommend85/recommenders.py", dst = "../working/recommenders.py")

# Написанные нами функции
from metrics import precision_at_k, recall_at_k
from utils import prefilter_items
from recommenders import MainRecommender

import warnings
warnings.filterwarnings("ignore")

## Read data

In [3]:
PATH_DATA = "../input/recommend85/"

In [4]:
data = pd.read_csv(os.path.join(PATH_DATA,'retail_train.csv'))
item_features = pd.read_csv(os.path.join(PATH_DATA,'product.csv'))
user_features = pd.read_csv(os.path.join(PATH_DATA,'hh_demographic.csv'))

# Set global const

In [5]:
ITEM_COL = 'item_id'
USER_COL = 'user_id'
ACTUAL_COL = 'actual'

# N = Neighbors
N_PREDICT = 50
# N_PREDICT = 100
# N_PREDICT = 30

# Process features dataset

In [6]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': ITEM_COL}, inplace=True)
user_features.rename(columns={'household_key': USER_COL }, inplace=True)

# Split dataset for train, eval, test

In [7]:
# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)


VAL_MATCHER_WEEKS = 6
VAL_RANKER_WEEKS = 3

In [8]:
# берем данные для тренировки matching модели
data_train_matcher = data[data['week_no'] < data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)]

# берем данные для валидации matching модели
data_val_matcher = data[(data['week_no'] >= data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)) &
                      (data['week_no'] < data['week_no'].max() - (VAL_RANKER_WEEKS))]

# берем данные для тренировки ranking модели
data_train_ranker = data_val_matcher.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться

# берем данные для теста ranking, matching модели
data_val_ranker = data[data['week_no'] >= data['week_no'].max() - VAL_RANKER_WEEKS]

In [9]:
# сделаем объединенный сет данных для первого уровня (матчинга)
df_join_train_matcher = pd.concat([data_train_matcher, data_val_matcher])

In [10]:
def print_stats_data(df_data, name_df):
    print(name_df)
    print(f"Shape: {df_data.shape} Users: {df_data[USER_COL].nunique()} Items: {df_data[ITEM_COL].nunique()}")

In [11]:
# видим разброс по пользователям и товарам и дальше мы перейдем к warm-start (только известные пользователи)
print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (2108779, 12) Users: 2498 Items: 83685
val_matcher
Shape: (169711, 12) Users: 2154 Items: 27649
train_ranker
Shape: (169711, 12) Users: 2154 Items: 27649
val_ranker
Shape: (118314, 12) Users: 2042 Items: 24329


In [12]:
data_val_matcher.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
2104867,2070,40618492260,594,1019940,1,1.0,311,-0.29,40,86,0.0,0.0
2107468,2021,40618753059,594,840361,1,0.99,443,0.0,101,86,0.0,0.0


# Prefilter items

In [13]:
n_items_before = data_train_matcher['item_id'].nunique()

# data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=5000)
# data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=1000)
data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=500)

n_items_after = data_train_matcher['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

Decreased # items from 83685 to 501


# Make cold-start to warm-start

In [14]:
# ищем общих пользователей
common_users = data_train_matcher.user_id.values

data_val_matcher = data_val_matcher[data_val_matcher.user_id.isin(common_users)]
data_train_ranker = data_train_ranker[data_train_ranker.user_id.isin(common_users)]
data_val_ranker = data_val_ranker[data_val_ranker.user_id.isin(common_users)]

print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (861404, 13) Users: 2495 Items: 501
val_matcher
Shape: (169615, 12) Users: 2151 Items: 27644
train_ranker
Shape: (169615, 12) Users: 2151 Items: 27644
val_ranker
Shape: (118282, 12) Users: 2040 Items: 24325


# Init/train recommender

In [15]:
recommender = MainRecommender(data_train_matcher)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/501 [00:00<?, ?it/s]

# Eval recall of matching

In [16]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,2,"[15830248, 838136, 839656, 861272, 866211, 870..."


In [17]:
%%time

result_eval_matcher['own_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
result_eval_matcher['sim_item_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_items_recommendation(x, N=N_PREDICT))
result_eval_matcher['als_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=N_PREDICT))

CPU times: user 34.4 s, sys: 28.9 s, total: 1min 3s
Wall time: 17.7 s


In [18]:
%%time
# result_eval_matcher['sim_user_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_users_recommendation(x, N=50))

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.68 µs


In [19]:
def calc_recall(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: recall_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

In [20]:
def calc_precision(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: precision_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

### Recall@50 of matching

In [21]:
TOPK_RECALL = 50
# TOPK_RECALL = 100

In [22]:
sorted(calc_recall(result_eval_matcher, TOPK_RECALL), key=lambda x: x[1],reverse=True)

# [('own_rec', 0.06867126166497624),
#  ('als_rec', 0.05334769036466487),
#  ('sim_item_rec', 0.041631651342292325)]

[('own_rec', 0.06867126166497624),
 ('als_rec', 0.05316982442892741),
 ('sim_item_rec', 0.040692781037539136)]

### Precision@5 of matching

In [23]:
TOPK_PRECISION = 5

In [24]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

# [('own_rec', 0.2293816829381683),
#  ('als_rec', 0.09567642956764295),
#  ('sim_item_rec', 0.0702928870292887)]

[('own_rec', 0.2293816829381683),
 ('als_rec', 0.09576940957694097),
 ('sim_item_rec', 0.06452812645281265)]

# Ranking part

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_ranking
- Обучаем *только* на выбранных кандидатах
- Я *для примера* сгенерирую топ-50 кадидиатов через get_own_recommendations
- (!) Если юзер купил < 50 товаров, то get_own_recommendations дополнит рекоммендации топ-популярными

In [25]:
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 

## Подготовка данных для трейна

In [26]:
# взяли пользователей из трейна для ранжирования
df_match_candidates = pd.DataFrame(data_train_ranker[USER_COL].unique())
df_match_candidates.columns = [USER_COL]

In [27]:
# собираем кандитатов с первого этапа (matcher)
df_match_candidates['candidates'] = df_match_candidates[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

In [28]:
df_match_candidates.head(2)

Unnamed: 0,user_id,candidates
0,2070,"[1016800, 917033, 926905, 913210, 5569374, 933..."
1,2021,"[1013928, 6534077, 896862, 1000753, 883932, 10..."


In [29]:
# разворачиваем товары
df_items = df_match_candidates.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
df_items.name = 'item_id'

In [30]:
df_match_candidates = df_match_candidates.drop('candidates', axis=1).join(df_items)

In [31]:
df_match_candidates.head(4)

Unnamed: 0,user_id,item_id
0,2070,1016800
0,2070,917033
0,2070,926905
0,2070,913210


### Check warm start

In [32]:
print_stats_data(df_match_candidates, 'match_candidates')

match_candidates
Shape: (107550, 2) Users: 2151 Items: 499


### Создаем трейн сет для ранжирования с учетом кандидатов с этапа 1 

In [33]:
df_ranker_train = data_train_ranker[[USER_COL, ITEM_COL]].copy()
df_ranker_train['target'] = 1  # тут только покупки 

df_ranker_train.head()

Unnamed: 0,user_id,item_id,target
2104867,2070,1019940,1
2107468,2021,840361,1
2107469,2021,856060,1
2107470,2021,869344,1
2107471,2021,896862,1


In [34]:
df_ranker_train = df_match_candidates.merge(df_ranker_train, on=[USER_COL, ITEM_COL], how='left')

# чистим дубликаты
df_ranker_train = df_ranker_train.drop_duplicates(subset=[USER_COL, ITEM_COL])

df_ranker_train['target'].fillna(0, inplace= True)

In [35]:
df_ranker_train.target.value_counts()

0.0    94071
1.0     8956
Name: target, dtype: int64

In [36]:
df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target
0,2070,1016800,0.0
1,2070,917033,0.0


(!) На каждого юзера 50 item_id-кандидатов

In [37]:
df_ranker_train['target'].mean()

0.08692866918380619

## Подготавливаем фичи для обучения модели

### Описательные фичи

In [38]:
item_features.head(2)

Unnamed: 0,item_id,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product
0,25671,2,GROCERY,National,FRZN ICE,ICE - CRUSHED/CUBED,22 LB
1,26081,2,MISC. TRANS.,National,NO COMMODITY DESCRIPTION,NO SUBCOMMODITY DESCRIPTION,


In [39]:
user_features.head(2)

Unnamed: 0,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc,user_id
0,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,1
1,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,7


In [40]:
df_ranker_train = df_ranker_train.merge(item_features, on='item_id', how='left')
df_ranker_train = df_ranker_train.merge(user_features, on='user_id', how='left')

df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,1016800,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


### Поведенческие фичи

##### Чтобы считать поведенческие фичи, нужно учесть все данные что были до data_val_ranker

In [41]:
df_join_train_matcher.head()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [42]:
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum().rename('total_item_sales_value'), how='left',on=ITEM_COL)
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().rename('total_quantity_value'), how='left',on=ITEM_COL)
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(USER_COL).count().rename('user_freq'), how='left',on=USER_COL)
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL, USER_COL]).agg('quantity').sum().rename('item_user_quantity'), how='left',on=[ITEM_COL, USER_COL])
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL, USER_COL]).agg('week_no').sum().rename('item_user_week_no'), how='left',on=[ITEM_COL, USER_COL])
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL, USER_COL]).agg('trans_time').count().rename('item_user_trans_time'), how='left',on=[ITEM_COL, USER_COL])
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL]).agg('basket_id').mean().rename('item_basket_id_mean'), how='left',on=[ITEM_COL])
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL, USER_COL]).agg('sales_value').sum().rename('item_user_sales_value'), how='left',on=[ITEM_COL, USER_COL])


# ухудшает точность
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg(USER_COL).count().rename('item_freq'), how='left',on=ITEM_COL)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('sales_value').sum().rename('total_user_sales_value'), how='left',on=USER_COL)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().rename('user_quantity_per_week')/df_join_train_matcher.week_no.nunique(), how='left',on=USER_COL)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().rename('user_quantity_per_basktet')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg(USER_COL).count().rename('item_freq_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)
# df_ranker_train['average_check'] = df_ranker_train.total_user_sales_value/df_ranker_train.user_freq
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL, USER_COL]).agg('day').sum().rename('item_user_day'), how='left',on=[ITEM_COL, USER_COL])
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL, USER_COL]).agg('store_id').sum().rename('item_user_store_id'), how='left',on=[ITEM_COL, USER_COL])
# df_ranker_train = df_ranker_train.merge(pd.get_dummies(df_ranker_train.department, dtype=int), right_index=True, left_index=True)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL]).agg('store_id').count().rename('item_store_id'), how='left',on=[ITEM_COL])
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL]).agg('trans_time').count().rename('item_trans_time'), how='left',on=[ITEM_COL])
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=[ITEM_COL]).agg('basket_id').std().rename('item_basket_id_std'), how='left',on=[ITEM_COL])
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('coupon_disc').mean().rename('item_coupon_disc'), how='left',on=ITEM_COL)
# disc_items = df_join_train_matcher[df_join_train_matcher['coupon_disc'] != 0]
# df_ranker_train = df_ranker_train.merge(disc_items.groupby(by=ITEM_COL).agg('coupon_disc').count().rename('share_disc_items')/df_ranker_train.total_quantity_value.nunique(), how='left',on=ITEM_COL)
# df_ranker_train = df_ranker_train.merge(disc_items.groupby(by=USER_COL).agg('coupon_disc').count().rename('share_disc_items_u')/df_ranker_train.total_quantity_value.nunique(), how='left',on=USER_COL)


# не изменилось
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().rename('item_quantity_per_week')/df_join_train_matcher.week_no.nunique(), how='left',on=ITEM_COL)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().rename('item_quantity_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)
# df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(USER_COL).count().rename('user_freq_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)

In [43]:
df_ranker_train[df_ranker_train.columns[15:]]

Unnamed: 0,kid_category_desc,total_item_sales_value,total_quantity_value,user_freq,item_user_quantity,item_user_week_no,item_user_trans_time,item_basket_id_mean,item_user_sales_value
0,None/Unknown,2234.13,745,1996,10.0,331.0,5.0,3.257372e+10,25.76
1,None/Unknown,1434.55,481,1996,6.0,248.0,4.0,3.255729e+10,15.00
2,None/Unknown,2420.71,786,1996,9.0,331.0,5.0,3.239287e+10,23.26
3,None/Unknown,5406.18,1364,1996,11.0,540.0,9.0,3.394303e+10,40.61
4,None/Unknown,4006.56,1391,1996,6.0,225.0,4.0,3.226843e+10,17.26
...,...,...,...,...,...,...,...,...,...
103022,None/Unknown,11175.05,3769,897,,,,3.255426e+10,
103023,None/Unknown,7543.64,2165,897,,,,3.350955e+10,
103024,None/Unknown,9904.63,3458,897,,,,3.282426e+10,
103025,None/Unknown,6833.06,2169,897,,,,3.286465e+10,


In [44]:
# посчитаем количество покупок в каждой категории для пользователя
# список категорий
department_list = df_ranker_train.department.value_counts().reset_index()['index'].tolist()
df_departments = pd.DataFrame(columns=department_list)

# посчитаем количество товаров по пользователю в каждой категории
user_depart_count = df_ranker_train.groupby('user_id')['department'].value_counts()

# сформируем датасет и добавим фичи к датасету user_features
for user in df_ranker_train.user_id.unique():
    temp = pd.DataFrame(user_depart_count[user]).rename(columns={'department': user}).T
    df_departments = df_departments.append(temp, ignore_index=False)

df_departments.reset_index(inplace=True)
df_departments.rename(columns={'index': 'user_id'}, inplace=True)

df_ranker_train = df_ranker_train.merge(df_departments, on='user_id', how='left')
df_ranker_train.fillna(0, inplace=True)

In [45]:
# # Работа с эмбеддингами товаров ALS 

# model_item_factors = pd.DataFrame(recommender.model.item_factors)
# item_id_df = pd.DataFrame()
# item_id_df[ITEM_COL] = data_train_matcher['item_id'].unique()
# model_item_factors_df = pd.concat([item_id_df, model_item_factors], axis=1)

# ухудшает точность
# df_ranker_train = df_ranker_train.merge(model_item_factors_df, how='left',on=[ITEM_COL])

In [46]:
# df_ranker_train.drop('share_disc_items', axis=1, inplace=True)

In [47]:
df_ranker_train.head()

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,GROCERY,PRODUCE,MEAT,MEAT-PCKGD,DELI,DRUG GM,PASTRY,SEAFOOD-PCKGD,FLORAL,SEAFOOD
0,2070,1016800,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,28,9,2,7,0,4,0,0,0,0
1,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,28,9,2,7,0,4,0,0,0,0
2,2070,926905,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,28,9,2,7,0,4,0,0,0,0
3,2070,913210,1.0,2,GROCERY,National,WATER - CARBONATED/FLVRD DRINK,NON-CRBNTD DRNKING/MNERAL WATE,405.6 OZ,45-54,...,28,9,2,7,0,4,0,0,0,0
4,2070,5569374,0.0,1208,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,28,9,2,7,0,4,0,0,0,0


In [48]:
X_train = df_ranker_train.drop('target', axis=1)
y_train = df_ranker_train[['target']]

In [49]:
cat_feats = X_train.columns[2:].tolist()
X_train[cat_feats] = X_train[cat_feats].astype('category')

## Обучение модели ранжирования

In [50]:
# %%time
# lgb = LGBMClassifier(objective='binary',
#                      max_depth=8,
#                      n_estimators=100,
#                      learning_rate=0.1,
#                      categorical_column=cat_feats,
#                      n_jobs=-1)

# lgb.fit(X_train, y_train)

In [51]:
# # Поиск наилучших параметров для модели

# params ={'n_estimators': [100, 200, 250],
#          'max_depth': [5, 8, 10],
#          'learning_rate': [0.1, 0.5]
#             }

# best = []

# for lossf in params.get('n_estimators'):
#     for depth in params.get('max_depth'):
#         for lrate in params.get('learning_rate'):
#             lgb = LGBMClassifier(n_estimators=lossf, 
#                                     random_state=42,
#                                     max_depth=depth, 
#                                     learning_rate=lrate,
#                                     objective='binary',
#                                     categorical_column=cat_feats,
#                                     n_jobs=-1)
#             lgb.fit(X_train, y_train)
#             train_preds = lgb.predict_proba(X_train)
#             df_ranker_predict = df_ranker_train.copy()
#             df_ranker_predict['proba_item_purchase'] = train_preds[:,1]
#             result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
#             result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))
#             best.append([sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), lossf, depth, lrate])
#             print(sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), lossf, depth, lrate)

# best_result = 0
# list_ = []
# for item in range(len(best)):
#     if best_result < best[item][0][0][1]:
#         best_result = best[item][0][0][1]
#         list_.append(best[item])
# print(list_[-1])
# print(best_result)

# [[('reranked_own_rec', 0.2685117493472585), ('own_rec', 0.19333333333333336)], 100, 8, 0.1]
# 0.2685117493472585

In [52]:
%%time
from catboost import CatBoost, Pool

df_bin_feat = pd.get_dummies(X_train)

catbst = CatBoost(params ={'loss_function': 'RMSE',
                           'iterations': 2000,
                           'depth': 10,
#                           "task_type":"GPU",
            })
catbst.fit(df_bin_feat, y_train, silent=True)

train_preds = catbst.predict(df_bin_feat, prediction_type="Probability")

CPU times: user 24min 33s, sys: 15 s, total: 24min 48s
Wall time: 7min 21s


In [53]:
# Поиск наилучших параметров для модели

# params ={'loss_function': ['RMSE', 'Logloss', 'MAE', 'CrossEntropy'],
#          'iterations': [500, 1000, 2000],
#          'depth': [5, 8, 10]
#             }

# best = []

# for lossf in params.get('loss_function'):
#     for iters in params.get('iterations'):
#         for depth in params.get('depth'):
#             catbst = CatBoost(params={'loss_function': lossf, 'iterations': iters, 'depth': depth})
#             catbst.fit(df_bin_feat, y_train, silent=True)
#             train_preds = catbst.predict(df_bin_feat, prediction_type="Probability")
#             df_ranker_predict = df_ranker_train.copy()
#             df_ranker_predict['proba_item_purchase'] = train_preds[:,1]
#             result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
#             result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))
#             best.append([sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), lossf, iters, depth])

# best_result = 0
# list_ = []
# for item in range(len(best)):
#     if best_result < best[item][0][0][1]:
#         best_result = best[item][0][0][1]
#         list_.append(best[item])
# print(list_[-1])
# print(best_result)

# [[('reranked_own_rec', 0.27237597911227157), ('own_rec', 0.19333333333333336)], 'RMSE', 2000, 10]
# 0.27237597911227157

In [54]:
# train_preds = lgb.predict_proba(X_train)

In [55]:
df_ranker_predict = df_ranker_train.copy()

In [56]:
df_ranker_predict['proba_item_purchase'] = train_preds[:,1]

# Evaluation on test dataset

In [57]:
result_eval_ranker = data_val_ranker.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_ranker.columns=[USER_COL, ACTUAL_COL]
result_eval_ranker.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."


## Eval matching on test dataset

In [58]:
%%time
result_eval_ranker['own_rec'] = result_eval_ranker[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))

CPU times: user 2.21 s, sys: 4.74 ms, total: 2.21 s
Wall time: 2.21 s


In [59]:
# померяем precision только модели матчинга, чтобы понимать влияение ранжирования на метрики

sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True)

[('own_rec', 0.19333333333333336)]

## Eval re-ranked matched result on test dataset
    Вспомним df_match_candidates сет, который был получен own_recommendations на юзерах, набор пользователей мы фиксировали и он одинаков, значи и прогноз одинаков, поэтому мы можем использовать этот датафрейм для переранжирования.
    

In [60]:
def rerank(user_id):
    return df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False).head(5).item_id.tolist()

In [61]:
result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))

In [62]:
# смотрим на метрики выше и сравниваем что с ранжированием и без, добавляем фичи и то же смотрим
# в первом приближении метрики должны расти с использованием второго этапа

print(*sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.2742558746736293)
('own_rec', 0.19333333333333336)


# Оценка на тесте для выполнения курсового проекта

In [63]:
# df_transactions = pd.read_csv('../input/recommend85/transaction_data.csv')

In [64]:
# df_transactions.info()

In [65]:
df_test = pd.read_csv('../input/recommend85/retail_test1.csv')

In [66]:
result_test = df_test.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_test.columns=[USER_COL, ACTUAL_COL]
result_test.head(2)

Unnamed: 0,user_id,actual
0,1,"[880007, 883616, 931136, 938004, 940947, 94726..."
1,2,"[820165, 820291, 826784, 826835, 829009, 85784..."


Берем топ-k предсказаний, ранжированных по вероятности, для каждого юзера

# Считаем precision@5 по новому тесту

In [67]:
# уберем пользователей, о которых у нас нет информации
result_test = result_test[result_test.user_id.isin(common_users)]
result_test['own_rec'] = result_test[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
result_test['reranked_own_rec'] = result_test[USER_COL].apply(lambda user_id: rerank(user_id))

In [68]:
print(*sorted(calc_precision(result_test, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.21936936936936938)
('own_rec', 0.16675517790759425)


In [69]:
# baseline
# ('reranked_own_rec', 0.13344594594594594)
# ('own_rec', 0.04719065321295805)


# feats
# ('reranked_own_rec', 0.19605855855855858)
# ('own_rec', 0.12278279341476367)


# top-500
# ('reranked_own_rec', 0.20777027027027026)
# ('own_rec', 0.16675517790759425)


# catboost
# ('reranked_own_rec', 0.21621621621621623)
# ('own_rec', 0.16675517790759425)


# catboost + фичи
# ('reranked_own_rec', 0.21936936936936938)
# ('own_rec', 0.16675517790759425)

### **Итоговая метрика: 0.219**

1. Улучшение метрики было достигнуто с помощью:
    - генерации новых фичей
    - чистки фичей, которые снижали метрику precision@5
    - уменьшения количества популярных товаров с 5000 до 500
    - использования модели CatBoost вместо LGBMClassifier
    - подбор параметров модели ранжирования


2. Были испробованы варианты, но не дали хорошего результата:
    - изменение количества отбираемых соседей (30, 100)
    - добавление эмбеддингов товаров ALS в качестве фичей для модели ранжирования
    - применение взвешивания матрицы с помощью tfidf_weight
    - изменение параметров взвешивания матрицы с помощью bm25_weight (K1=200, B=0.6)