# Course project


## **Основное**
- Дедлайн - 31 мая 23:59
- Целевая метрика precision@5 > 0.235
- Бейзлайн решения - [MainRecommender](https://github.com/geangohn/recsys-tutorial/blob/master/src/recommenders.py)
- Сдаем ссылку на github с решением. В решении должны быть отчетливо видна метрика на новом тестовом сете из файла retail_test1.csv, то есть вам нужно для всех юзеров из этого файла выдать выши рекомендации, и посчитать на actual покупках precision@5. 

**!! Мы не рассматриваем холодный старт для пользователя, все наши пользователя одинаковы во всех сетах, поэтому нужно позаботиться об их исключении из теста.**


**Hints:** 

Сначала просто попробуйте разные параметры MainRecommender:  
- N в топ-N товарах при формировании user-item матирцы (сейчас топ-5000)  
- Различные веса в user-item матрице (0/1, кол-во покупок, log(кол-во покупок + 1), сумма покупки, ...)  
- Разные взвешивания матрицы (TF-IDF, BM25 - у него есть параметры)  
- Разные смешивания рекомендаций (обратите внимание на бейзлайн - прошлые покупки юзера)  

Сделайте MVP - минимально рабочий продукт - (пусть даже top-popular), а потом его улучшайте

Если вы делаете двухуровневую модель - следите за валидацией 

!pip install implicit==0.4.4

# Import libs

In [132]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix

# Матричная факторизация
from implicit import als

# Модель второго уровня
from lightgbm import LGBMClassifier

import os, sys
module_path = os.path.abspath(os.path.join(os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

# Написанные нами функции
from metrics import precision_at_k, recall_at_k
from utils import prefilter_items
from recommenders import MainRecommender

## Read data

In [133]:
PATH_DATA = "data"

In [134]:
data = pd.read_csv(os.path.join(PATH_DATA,'retail_train.csv'))
item_features = pd.read_csv(os.path.join(PATH_DATA,'product.csv'))
user_features = pd.read_csv(os.path.join(PATH_DATA,'hh_demographic.csv'))

# Set global const

In [135]:
ITEM_COL = 'item_id'
USER_COL = 'user_id'
ACTUAL_COL = 'actual'

# N = Neighbors
N_PREDICT = 50 

# Process features dataset

In [136]:
# column processing
item_features.columns = [col.lower() for col in item_features.columns]
user_features.columns = [col.lower() for col in user_features.columns]

item_features.rename(columns={'product_id': ITEM_COL}, inplace=True)
user_features.rename(columns={'household_key': USER_COL }, inplace=True)

# Split dataset for train, eval, test

In [137]:
# Важна схема обучения и валидации!
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 
# подобрать размер 2-ого датасета (6 недель) --> learning curve (зависимость метрики recall@k от размера датасета)


VAL_MATCHER_WEEKS = 6
VAL_RANKER_WEEKS = 3

In [138]:
# берем данные для тренировки matching модели
data_train_matcher = data[data['week_no'] < data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)]

# берем данные для валидации matching модели
data_val_matcher = data[(data['week_no'] >= data['week_no'].max() - (VAL_MATCHER_WEEKS + VAL_RANKER_WEEKS)) &
                      (data['week_no'] < data['week_no'].max() - (VAL_RANKER_WEEKS))]

# берем данные для тренировки ranking модели
data_train_ranker = data_val_matcher.copy()  # Для наглядности. Далее мы добавим изменения, и они будут отличаться

# берем данные для теста ranking, matching модели
data_val_ranker = data[data['week_no'] >= data['week_no'].max() - VAL_RANKER_WEEKS]

In [139]:
# сделаем объединенный сет данных для первого уровня (матчинга)
df_join_train_matcher = pd.concat([data_train_matcher, data_val_matcher])

In [140]:
def print_stats_data(df_data, name_df):
    print(name_df)
    print(f"Shape: {df_data.shape} Users: {df_data[USER_COL].nunique()} Items: {df_data[ITEM_COL].nunique()}")

In [141]:
print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (2108779, 12) Users: 2498 Items: 83685
val_matcher
Shape: (169711, 12) Users: 2154 Items: 27649
train_ranker
Shape: (169711, 12) Users: 2154 Items: 27649
val_ranker
Shape: (118314, 12) Users: 2042 Items: 24329


In [142]:
# выше видим разброс по пользователям и товарам и дальше мы перейдем к warm-start (только известные пользователи)

In [143]:
data_val_matcher.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
2104867,2070,40618492260,594,1019940,1,1.0,311,-0.29,40,86,0.0,0.0
2107468,2021,40618753059,594,840361,1,0.99,443,0.0,101,86,0.0,0.0


# Prefilter items

In [144]:
n_items_before = data_train_matcher['item_id'].nunique()

data_train_matcher = prefilter_items(data_train_matcher, item_features=item_features, take_n_popular=5000)

n_items_after = data_train_matcher['item_id'].nunique()
print('Decreased # items from {} to {}'.format(n_items_before, n_items_after))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['price'] = data['sales_value'] / (np.maximum(data['quantity'], 1))


Decreased # items from 83685 to 5001


# Make cold-start to warm-start

In [145]:
# ищем общих пользователей
common_users = list(set(data_train_matcher.user_id.values)&(set(data_val_matcher.user_id.values))&set(data_val_ranker.user_id.values))

data_train_matcher = data_train_matcher[data_train_matcher.user_id.isin(common_users)]
data_val_matcher = data_val_matcher[data_val_matcher.user_id.isin(common_users)]
data_train_ranker = data_train_ranker[data_train_ranker.user_id.isin(common_users)]
data_val_ranker = data_val_ranker[data_val_ranker.user_id.isin(common_users)]

print_stats_data(data_train_matcher,'train_matcher')
print_stats_data(data_val_matcher,'val_matcher')
print_stats_data(data_train_ranker,'train_ranker')
print_stats_data(data_val_ranker,'val_ranker')

train_matcher
Shape: (784420, 13) Users: 1915 Items: 4999
val_matcher
Shape: (163261, 12) Users: 1915 Items: 27118
train_ranker
Shape: (163261, 12) Users: 1915 Items: 27118
val_ranker
Shape: (115989, 12) Users: 1915 Items: 24042


In [146]:
def popularity_recommendation(data, n=5):
    """Топ-n популярных товаров"""
    
    popular = data.groupby('item_id')['sales_value'].sum().reset_index()
    popular.sort_values('sales_value', ascending=False, inplace=True)
    popular=popular.loc[popular.item_id !=999999]
    
    recs = popular.head(n).item_id
    
    return recs.tolist()

def get_recommendations(user, model, N=5):
    res = [id_to_itemid[rec[0]] for rec in model.recommend(userid=userid_to_id[user],  # userid - id от 0 до N
                           user_items=csr_matrix(user_item_matrix).tocsr(),   # на вход user-item matrix
                           N=N, # кол-во рекомендаций 
                           filter_already_liked_items=False, 
                           filter_items=[itemid_to_id[999999]], 
                           recalculate_user=False)]
    return res

In [147]:
result = data_train_matcher.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']

popular_recs = popularity_recommendation(data_train_matcher, n=5)

result['popular_recommendation'] = result['user_id'].apply(lambda x: popular_recs)

In [148]:
print("Precision:",result.apply(lambda row: precision_at_k(row['popular_recommendation'], row['actual']), axis=1).mean())

Precision: 0.437597911227154


In [149]:
user_item_matrix = pd.pivot_table(data_train_matcher, index='user_id', columns='item_id', 
               values='quantity', 
               aggfunc='count', fill_value=0)

user_item_matrix = user_item_matrix.astype(float)

sparse_user_item = csr_matrix(user_item_matrix).tocsr()

userids = user_item_matrix.index.values
itemids = user_item_matrix.columns.values

matrix_userids = np.arange(len(userids))
matrix_itemids = np.arange(len(itemids))

id_to_itemid = dict(zip(matrix_itemids, itemids))
id_to_userid = dict(zip(matrix_userids, userids))

itemid_to_id = dict(zip(itemids, matrix_itemids))
userid_to_id = dict(zip(userids, matrix_userids))

In [150]:
from implicit.bpr import BayesianPersonalizedRanking
from implicit.als import AlternatingLeastSquares
from hyperopt import fmin, tpe, hp, Trials

In [151]:
space = [hp.randint('N', 90, 200),
         hp.uniform('regular', 0.001, 0.05),
         hp.randint('IT', 40, 80)]
mmm=0
def f(args):
    N, regular ,IT = args
    
    
    model = AlternatingLeastSquares(factors=N, 
                                regularization=regular,
                                iterations=IT, 
                                calculate_training_loss=True, 
                                num_threads=0)
    #model = BayesianPersonalizedRanking(factors=N, 
    #                            regularization=regular,
    #                            learning_rate=0.01,
    #                            iterations=IT, 
    #                            num_threads=4)

    model.fit(csr_matrix(user_item_matrix).T.tocsr(),  # На вход item-user matrix
          show_progress=True)
    mmm=+1
    result['BPR_test_'+str(mmm)] = result['user_id'].apply(lambda x: get_recommendations(x, model=model, N=50))
    PRECISION=result.apply(lambda row: precision_at_k(row['BPR_test_'+str(mmm)], row['actual'], k=50), axis=1).mean()
    print(f'PRECISION= {PRECISION} and N, regular ,IT= {N, regular ,IT}')
    
    return 1-PRECISION

In [152]:
%%time

trials = Trials()

best = fmin(f, space, algo = tpe.suggest, max_evals=15, trials=trials)
print ('TPE result: ', best)

  0%|                                    | 0/15 [00:00<?, ?trial/s, best loss=?]

  0%|          | 0/64 [00:00<?, ?it/s]

PRECISION= 0.7181096605744126 and N, regular ,IT= (176, 0.004629717406149191, 64)
  7%|▍      | 1/15 [05:24<1:15:39, 324.28s/trial, best loss: 0.2818903394255874]

  0%|          | 0/62 [00:00<?, ?it/s]

PRECISION= 0.5988929503916449 and N, regular ,IT= (100, 0.042896591299486135, 62)
 13%|▉      | 2/15 [10:52<1:10:47, 326.70s/trial, best loss: 0.2818903394255874]

  0%|          | 0/76 [00:00<?, ?it/s]

PRECISION= 0.7019634464751959 and N, regular ,IT= (163, 0.02238088775110946, 76)
 20%|█▍     | 3/15 [16:30<1:06:19, 331.66s/trial, best loss: 0.2818903394255874]

  0%|          | 0/49 [00:00<?, ?it/s]

PRECISION= 0.6881775456919059 and N, regular ,IT= (153, 0.01729693767645488, 49)
 27%|██▍      | 4/15 [21:48<59:51, 326.50s/trial, best loss: 0.2818903394255874]

  0%|          | 0/60 [00:00<?, ?it/s]

PRECISION= 0.7252219321148825 and N, regular ,IT= (182, 0.03314366451922475, 60)
 33%|███      | 5/15 [27:09<54:03, 324.38s/trial, best loss: 0.2747780678851175]

  0%|          | 0/64 [00:00<?, ?it/s]

PRECISION= 0.6888981723237598 and N, regular ,IT= (153, 0.0011539009879272056, 64)
 40%|███▌     | 6/15 [32:24<48:11, 321.31s/trial, best loss: 0.2747780678851175]

  0%|          | 0/40 [00:00<?, ?it/s]

PRECISION= 0.6674151436031333 and N, regular ,IT= (139, 0.045337046655640416, 40)
 47%|████▏    | 7/15 [37:29<42:06, 315.78s/trial, best loss: 0.2747780678851175]

  0%|          | 0/55 [00:00<?, ?it/s]

PRECISION= 0.7152584856396867 and N, regular ,IT= (174, 0.022787824898998107, 55)
 53%|████▊    | 8/15 [42:44<36:50, 315.78s/trial, best loss: 0.2747780678851175]

  0%|          | 0/48 [00:00<?, ?it/s]

PRECISION= 0.6237180156657964 and N, regular ,IT= (113, 0.01700159740381922, 48)
 60%|█████▍   | 9/15 [47:59<31:32, 315.43s/trial, best loss: 0.2747780678851175]

  0%|          | 0/57 [00:00<?, ?it/s]

PRECISION= 0.6631436031331593 and N, regular ,IT= (135, 0.0010708482507872305, 57)
 67%|█████▎  | 10/15 [53:17<26:21, 316.28s/trial, best loss: 0.2747780678851175]

  0%|          | 0/59 [00:00<?, ?it/s]

PRECISION= 0.6064020887728461 and N, regular ,IT= (104, 0.03918142548928643, 59)
 73%|█████▊  | 11/15 [58:36<21:08, 317.14s/trial, best loss: 0.2747780678851175]

  0%|          | 0/79 [00:00<?, ?it/s]

PRECISION= 0.6400731070496084 and N, regular ,IT= (122, 0.03824201755745853, 79)
 80%|████▊ | 12/15 [1:04:14<16:10, 323.48s/trial, best loss: 0.2747780678851175]

  0%|          | 0/72 [00:00<?, ?it/s]

PRECISION= 0.7020992167101827 and N, regular ,IT= (163, 0.039864802019382355, 72)
 87%|█████▏| 13/15 [1:09:43<10:49, 324.92s/trial, best loss: 0.2747780678851175]

  0%|          | 0/69 [00:00<?, ?it/s]

PRECISION= 0.7364699738903394 and N, regular ,IT= (192, 0.008393368204954604, 69)
 93%|█████▌| 14/15 [1:15:03<05:23, 323.51s/trial, best loss: 0.2635300261096606]

  0%|          | 0/76 [00:00<?, ?it/s]

PRECISION= 0.6654516971279374 and N, regular ,IT= (137, 0.04461031443739603, 76)
100%|██████| 15/15 [1:20:29<00:00, 321.98s/trial, best loss: 0.2635300261096606]
TPE result:  {'IT': 69, 'N': 192, 'regular': 0.008393368204954604}
CPU times: user 4h 20min 51s, sys: 1h 20min 35s, total: 5h 41min 26s
Wall time: 1h 20min 29s


In [153]:
# best {'IT': 47, 'N': 62}
model = BayesianPersonalizedRanking(factors=62, 
                                regularization=0.01,
                                learning_rate=0.01,
                                iterations=47, 
                                num_threads=4)

model.fit(csr_matrix(user_item_matrix).T.tocsr(), show_progress=True)

  0%|          | 0/47 [00:00<?, ?it/s]

In [154]:
%%time
result['bpr_bm50'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model, N=50))

print("Precision:",result.apply(lambda row: precision_at_k(row['bpr_bm50'], row['actual']), axis=1).mean())

Precision: 0.5803655352480418
CPU times: user 15min 41s, sys: 4min 55s, total: 20min 36s
Wall time: 5min 1s


In [155]:
# best {'IT': 57, 'N': 198}
model = AlternatingLeastSquares(factors=198, 
                                regularization=0.0366,
                                iterations=57, 
                                calculate_training_loss=True, 
                                num_threads=0)

model.fit(csr_matrix(user_item_matrix).T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/57 [00:00<?, ?it/s]

In [156]:
%%time
result['als_bm50'] = result['user_id'].apply(lambda x: get_recommendations(x, model=model, N=50))

print("Precision:",result.apply(lambda row: precision_at_k(row['als_bm50'], row['actual']), axis=1).mean())

Precision: 0.9727415143603133
CPU times: user 15min 45s, sys: 4min 54s, total: 20min 39s
Wall time: 5min 1s


In [157]:
result.head(5)

Unnamed: 0,user_id,actual,popular_recommendation,BPR_test_1,bpr_bm50,als_bm50
0,1,"[825123, 999999, 845307, 852014, 856942, 99102...","[1029743, 916122, 5569230, 1106523, 844179]","[986912, 1029743, 1062002, 5569374, 940947, 10...","[912704, 852856, 1022097, 854852, 908318, 5568...","[8090521, 1029743, 1004906, 5569374, 10149640,..."
1,6,"[851819, 851903, 863447, 876232, 907099, 99079...","[1029743, 916122, 5569230, 1106523, 844179]","[1044078, 12301109, 965267, 896613, 878996, 11...","[930118, 878996, 866871, 896613, 965267, 10237...","[12301109, 844179, 1044078, 874972, 1023720, 8..."
2,7,"[999999, 1020581, 1029743, 1040183, 1068504, 1...","[1029743, 916122, 5569230, 1106523, 844179]","[1122358, 893018, 1029743, 866211, 1106523, 11...","[1029743, 916122, 866211, 878996, 1126899, 893...","[1122358, 1044078, 12810393, 985999, 1106523, ..."
3,8,"[999999, 841220, 860501, 888543, 902094, 90813...","[1029743, 916122, 5569230, 1106523, 844179]","[12301109, 1004906, 1029743, 5569230, 844179, ...","[916122, 844179, 12810393, 1029743, 985999, 90...","[12301109, 823704, 1029743, 12810393, 1081177,..."
4,9,"[882190, 949294, 999999, 1070820, 5568845, 556...","[1029743, 916122, 5569230, 1106523, 844179]","[1029743, 5569230, 1070820, 8090521, 862799, 8...","[1029743, 5569230, 8090521, 8090537, 1106523, ...","[1029743, 1070820, 5569230, 862799, 893018, 80..."


In [158]:
def calc_recall(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: recall_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()
        
def calc_precision(df_data, top_k):
    for col_name in df_data.columns[2:]:
        yield col_name, df_data.apply(lambda row: precision_at_k(row[col_name], row[ACTUAL_COL], k=top_k), axis=1).mean()

In [159]:
TOPK_RECALL = 50

In [160]:
sorted(calc_recall(result, TOPK_RECALL), key=lambda x: x[1],reverse=True)

[('als_bm50', 0.3346946480306026),
 ('BPR_test_1', 0.288437643542503),
 ('bpr_bm50', 0.12181038694375651),
 ('popular_recommendation', 0.019817168923891626)]

In [161]:
TOPK_PRECISION = 5

In [162]:
sorted(calc_precision(result, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('als_bm50', 0.9727415143603133),
 ('BPR_test_1', 0.9473629242819843),
 ('bpr_bm50', 0.5803655352480418),
 ('popular_recommendation', 0.437597911227154)]

# Init/train recommender

In [163]:
recommender = MainRecommender(data_train_matcher)

  0%|          | 0/15 [00:00<?, ?it/s]

  0%|          | 0/4999 [00:00<?, ?it/s]

In [164]:
result_eval_matcher = data_val_matcher.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_matcher.columns=[USER_COL, ACTUAL_COL]
result_eval_matcher.head(2)

Unnamed: 0,user_id,actual
0,1,"[853529, 865456, 867607, 872137, 874905, 87524..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96..."


In [165]:
%%time
# для понятности расписано все в строчку, без функций, ваша задача уметь оборачивать все это в функции
result_eval_matcher['own_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_own_recommendations(x, N=N_PREDICT))
result_eval_matcher['sim_item_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_similar_items_recommendation(x, N=N_PREDICT))
result_eval_matcher['als_rec'] = result_eval_matcher[USER_COL].apply(lambda x: recommender.get_als_recommendations(x, N=N_PREDICT))

popular_recs = popularity_recommendation(data_val_matcher, n=50)
result_eval_matcher['best_rec']=result_eval_matcher[USER_COL].apply(lambda x: popular_recs )

CPU times: user 1min 22s, sys: 27.4 s, total: 1min 49s
Wall time: 23 s


In [166]:
result_eval_matcher['als_50']=result_eval_matcher['user_id'].apply(lambda x: get_recommendations(x, model=model, N=50))

In [167]:
# объединение TOP-50 2-х рекомендательных систем (дополняем популярными товарами)
top50 = popularity_recommendation(data_val_matcher, n=N_PREDICT)
data_list=[]
for i in range(len(result_eval_matcher)):
    intersect=list(np.intersect1d(result_eval_matcher['own_rec'][i], result_eval_matcher['als_rec'][i]))
    if len(intersect)<N_PREDICT:
        intersect+= top50[:(N_PREDICT-len(intersect))]
                            
    data_list.append(intersect)
result_eval_matcher.insert(2, 'own_als', data_list)
result_eval_matcher.head()

Unnamed: 0,user_id,actual,own_als,own_rec,sim_item_rec,als_rec,best_rec,als_50
0,1,"[853529, 865456, 867607, 872137, 874905, 87524...","[856942, 1104349, 1124029, 5577022, 8090541, 8...","[856942, 9297615, 5577022, 877391, 9655212, 10...","[842762, 1007512, 904833, 5577022, 888210, 983...","[1135983, 1125943, 8293439, 1037332, 5569993, ...","[6534178, 6533889, 1029743, 6534166, 6533765, ...","[8090521, 1029743, 1004906, 5569374, 10149640,..."
1,6,"[1024306, 1102949, 6548453, 835394, 940804, 96...","[1037337, 1084036, 1098844, 13002975, 13003092...","[13003092, 995598, 923600, 972416, 1084036, 11...","[948650, 5569845, 9835606, 941361, 1074754, 11...","[854852, 878996, 1026118, 933637, 863632, 9652...","[6534178, 6533889, 1029743, 6534166, 6533765, ...","[12301109, 844179, 1044078, 874972, 1023720, 8..."
2,7,"[836281, 843306, 845294, 914190, 920456, 93886...","[6533878, 7147142, 9338009, 9803591, 10285022,...","[998519, 894360, 7147142, 9338009, 896666, 939...","[1012587, 5565612, 1044078, 995478, 896027, 83...","[10285022, 1039627, 1044188, 1098694, 1100140,...","[6534178, 6533889, 1029743, 6534166, 6533765, ...","[1122358, 1044078, 12810393, 985999, 1106523, ..."
3,8,"[868075, 886787, 945611, 1005186, 1008787, 101...","[981660, 6534178, 6533889, 1029743, 6534166, 6...","[12808385, 939860, 981660, 7410201, 5577022, 6...","[5569845, 5569374, 1044078, 937526, 1011459, 8...","[916122, 981660, 851528, 1070845, 1029743, 844...","[6534178, 6533889, 1029743, 6534166, 6533765, ...","[12301109, 823704, 1029743, 12810393, 1081177,..."
4,9,"[883616, 1029743, 1039126, 1051323, 1082772, 1...","[893018, 896085, 1029743, 6039859, 6534030, 65...","[872146, 918046, 9655676, 985622, 1056005, 109...","[832442, 1074754, 901062, 904493, 852080, 7138...","[891516, 1028238, 9935616, 861899, 1029743, 90...","[6534178, 6533889, 1029743, 6534166, 6533765, ...","[1029743, 1070820, 5569230, 862799, 893018, 80..."


In [168]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('own_rec', 0.18872062663185377),
 ('own_als', 0.1807832898172324),
 ('als_50', 0.1639686684073107),
 ('als_rec', 0.12741514360313316),
 ('best_rec', 0.1139425587467363),
 ('sim_item_rec', 0.06099216710182768)]

In [169]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('own_rec', 0.18872062663185377),
 ('own_als', 0.1807832898172324),
 ('als_50', 0.1639686684073107),
 ('als_rec', 0.12741514360313316),
 ('best_rec', 0.1139425587467363),
 ('sim_item_rec', 0.06099216710182768)]

# Ranking part

### Обучаем модель 2-ого уровня на выбранных кандидатах

- Обучаем на data_train_ranking
- Обучаем *только* на выбранных кандидатах
- Я *для примера* сгенерирую топ-50 кадидиатов через get_own_recommendations
- (!) Если юзер купил < 50 товаров, то get_own_recommendations дополнит рекоммендации топ-популярными

In [170]:
# -- давние покупки -- | -- 6 недель -- | -- 3 недель -- 

## Подготовка данных для трейна

In [171]:
# взяли пользователей из трейна для ранжирования
df_match_candidates = pd.DataFrame(data_train_ranker[USER_COL].unique())
df_match_candidates.columns = [USER_COL]

In [172]:
df_match_candidates=df_match_candidates.merge(result_eval_matcher[['user_id', 'own_als']], how='left', on='user_id')
df_match_candidates.rename(columns={'own_als': 'candidates'}, inplace=True)

In [173]:
df_match_candidates.head(2)

Unnamed: 0,user_id,candidates
0,2070,"[917033, 926905, 970866, 1008814, 1016800, 101..."
1,2021,"[883932, 950935, 1025535, 1096635, 1119454, 98..."


In [174]:
# разворачиваем товары
df_items = df_match_candidates.apply(lambda x: pd.Series(x['candidates']), axis=1).stack().reset_index(level=1, drop=True)
df_items.name = 'item_id'

In [175]:
df_match_candidates = df_match_candidates.drop('candidates', axis=1).join(df_items)

In [176]:
df_match_candidates.head(4)

Unnamed: 0,user_id,item_id
0,2070,917033
0,2070,926905
0,2070,970866
0,2070,1008814


### Check warm start

In [177]:
print_stats_data(df_match_candidates, 'match_candidates')

match_candidates
Shape: (95750, 2) Users: 1915 Items: 3182


### Создаем трейн сет для ранжирования с учетом кандидатов с этапа 1 

In [178]:
df_ranker_train = data_train_ranker[[USER_COL, ITEM_COL]].copy()
df_ranker_train['target'] = 1  # тут только покупки 

df_ranker_train.head()

Unnamed: 0,user_id,item_id,target
2104867,2070,1019940,1
2107468,2021,840361,1
2107469,2021,856060,1
2107470,2021,869344,1
2107471,2021,896862,1


In [179]:
df_ranker_train = df_match_candidates.merge(df_ranker_train, on=[USER_COL, ITEM_COL], how='left')

# чистим дубликаты
df_ranker_train = df_ranker_train.drop_duplicates(subset=[USER_COL, ITEM_COL])

df_ranker_train['target'].fillna(0, inplace= True)

In [180]:
df_ranker_train.target.value_counts()

0.0    84848
1.0    10357
Name: target, dtype: int64

In [181]:
df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target
0,2070,917033,0.0
1,2070,926905,0.0


In [182]:
df_ranker_train['target'].mean()

0.10878630324037603

In [183]:
df_ranker_train = df_ranker_train.merge(item_features, on='item_id', how='left')
df_ranker_train = df_ranker_train.merge(user_features, on='user_id', how='left')

df_ranker_train.head(2)

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,marital_status_code,income_desc,homeowner_desc,hh_comp_desc,household_size_desc,kid_category_desc
0,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown
1,2070,926905,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,U,50-74K,Unknown,Unknown,1,None/Unknown


In [184]:
df_join_train_matcher.head()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0
2,2375,26984851472,1,1036325,1,0.99,364,-0.3,1631,1,0.0,0.0
3,2375,26984851472,1,1082185,1,1.21,364,0.0,1631,1,0.0,0.0
4,2375,26984851472,1,8160430,1,1.5,364,-0.39,1631,1,0.0,0.0


In [185]:
df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('sales_value').sum().rename('total_item_sales_value'), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().rename('total_quantity_value'), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg(USER_COL).count().rename('item_freq'), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(USER_COL).count().rename('user_freq'), how='left',on=USER_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('sales_value').sum().rename('total_user_sales_value'), how='left',on=USER_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().rename('item_quantity_per_week')/df_join_train_matcher.week_no.nunique(), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().rename('user_quantity_per_week')/df_join_train_matcher.week_no.nunique(), how='left',on=USER_COL)


df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg('quantity').sum().rename('item_quantity_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg('quantity').sum().rename('user_quantity_per_baskter')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)


df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=ITEM_COL).agg(USER_COL).count().rename('item_freq_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=ITEM_COL)

df_ranker_train = df_ranker_train.merge(df_join_train_matcher.groupby(by=USER_COL).agg(USER_COL).count().rename('user_freq_per_basket')/df_join_train_matcher.basket_id.nunique(), how='left',on=USER_COL)



In [186]:
df_ranker_train.head()

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,total_quantity_value,item_freq,user_freq,total_user_sales_value,item_quantity_per_week,user_quantity_per_week,item_quantity_per_basket,user_quantity_per_baskter,item_freq_per_basket,user_freq_per_basket
0,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,481,267,1996,5754.86,5.285714,1218.32967,0.001962,0.452137,0.001089,0.00814
1,2070,926905,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,786,553,1996,5754.86,8.637363,1218.32967,0.003205,0.452137,0.002255,0.00814
2,2070,970866,0.0,5612,GROCERY,National,SUGARS/SWEETNERS,SWEETENERS,9.7 OZ,45-54,...,158,152,1996,5754.86,1.736264,1218.32967,0.000644,0.452137,0.00062,0.00814
3,2070,1008814,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,215,150,1996,5754.86,2.362637,1218.32967,0.000877,0.452137,0.000612,0.00814
4,2070,1016800,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,745,450,1996,5754.86,8.186813,1218.32967,0.003038,0.452137,0.001835,0.00814


In [187]:
X_train = df_ranker_train.drop('target', axis=1)
y_train = df_ranker_train[['target']]

In [188]:
cat_feats = X_train.columns[2:].tolist()
X_train[cat_feats] = X_train[cat_feats].astype('category')

In [189]:
cat_feats

['manufacturer',
 'department',
 'brand',
 'commodity_desc',
 'sub_commodity_desc',
 'curr_size_of_product',
 'age_desc',
 'marital_status_code',
 'income_desc',
 'homeowner_desc',
 'hh_comp_desc',
 'household_size_desc',
 'kid_category_desc',
 'total_item_sales_value',
 'total_quantity_value',
 'item_freq',
 'user_freq',
 'total_user_sales_value',
 'item_quantity_per_week',
 'user_quantity_per_week',
 'item_quantity_per_basket',
 'user_quantity_per_baskter',
 'item_freq_per_basket',
 'user_freq_per_basket']

## Обучение модели ранжирования

In [190]:
%%time
lgb = LGBMClassifier(objective='binary',
                     max_depth=47,
                     n_estimators=1145,
                     learning_rate=0.18924,
                     categorical_column=cat_feats)

lgb.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


CPU times: user 2min 9s, sys: 5.2 s, total: 2min 15s
Wall time: 12.4 s


LGBMClassifier(categorical_column=['manufacturer', 'department', 'brand',
                                   'commodity_desc', 'sub_commodity_desc',
                                   'curr_size_of_product', 'age_desc',
                                   'marital_status_code', 'income_desc',
                                   'homeowner_desc', 'hh_comp_desc',
                                   'household_size_desc', 'kid_category_desc',
                                   'total_item_sales_value',
                                   'total_quantity_value', 'item_freq',
                                   'user_freq', 'total_user_sales_value',
                                   'item_quantity_per_week',
                                   'user_quantity_per_week',
                                   'item_quantity_per_basket',
                                   'user_quantity_per_baskter',
                                   'item_freq_per_basket',
                                   'user_fre

In [191]:
train_preds = lgb.predict_proba(X_train)

In [192]:
df_ranker_predict = df_ranker_train.copy()

In [193]:
df_ranker_predict['proba_item_purchase'] = train_preds[:,1]

In [194]:
df_ranker_predict.head()

Unnamed: 0,user_id,item_id,target,manufacturer,department,brand,commodity_desc,sub_commodity_desc,curr_size_of_product,age_desc,...,item_freq,user_freq,total_user_sales_value,item_quantity_per_week,user_quantity_per_week,item_quantity_per_basket,user_quantity_per_baskter,item_freq_per_basket,user_freq_per_basket,proba_item_purchase
0,2070,917033,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,267,1996,5754.86,5.285714,1218.32967,0.001962,0.452137,0.001089,0.00814,0.002117
1,2070,926905,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,553,1996,5754.86,8.637363,1218.32967,0.003205,0.452137,0.002255,0.00814,0.002533
2,2070,970866,0.0,5612,GROCERY,National,SUGARS/SWEETNERS,SWEETENERS,9.7 OZ,45-54,...,152,1996,5754.86,1.736264,1218.32967,0.000644,0.452137,0.00062,0.00814,0.000912
3,2070,1008814,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,150,1996,5754.86,2.362637,1218.32967,0.000877,0.452137,0.000612,0.00814,0.000675
4,2070,1016800,0.0,103,GROCERY,National,SOFT DRINKS,SOFT DRINKS 12/18&15PK CAN CAR,12 OZ,45-54,...,450,1996,5754.86,8.186813,1218.32967,0.003038,0.452137,0.001835,0.00814,0.001704


## Подборка параметров

In [195]:

#LGBMClassifier
space = [hp.randint('N', 300, 1000), 
        hp.randint('D', 10, 35),
        hp.uniform('rate', 0.10, 0.3)]


mmm=0
def f(args):    
    N, D,rate = args
    
    
    lgb = LGBMClassifier(objective='binary',
                     max_depth=D,
                     n_estimators=N,
                     learning_rate=rate,
                     categorical_column=cat_feats)
    
    lgb.fit(X_train, y_train)
    
    mmm=+1
    train_preds = lgb.predict_proba(X_train)
    df_ranker_predict = df_ranker_train.copy()
    df_ranker_predict['proba_item_purchase'] = train_preds[:,1]
    
    result_eval_ranker['reranked_own_rec_'+str(mmm)] = result_eval_ranker[USER_COL].apply(lambda user_id: df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False).head(5).item_id.tolist())
    
    PRECISION=result_eval_ranker.apply(lambda row: precision_at_k(row['reranked_own_rec_'+str(mmm)], row['actual']), axis=1).mean()
    
    print(f'PRECISION= {PRECISION}, n,d,rate={N, D, rate}')
    
    return 1-PRECISION

In [196]:
%%time

trials = Trials()

best = fmin(f, space, algo = tpe.suggest, max_evals=20, trials=trials)
print ('TPE result: ', best)

  0%|                                    | 0/20 [00:00<?, ?trial/s, best loss=?]

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.30475195822454315, n,d,rate=(521, 33, 0.2101512855757734)          
  5%|▌         | 1/20 [00:11<03:29, 11.04s/trial, best loss: 0.6952480417754568]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.311958224543081, n,d,rate=(817, 17, 0.2911764390719052)            
 10%|█          | 2/20 [00:25<03:57, 13.21s/trial, best loss: 0.688041775456919]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.31300261096605747, n,d,rate=(909, 24, 0.20798795093853278)         
 15%|█▌        | 3/20 [00:41<04:03, 14.31s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.3079895561357702, n,d,rate=(562, 24, 0.2905103278863146)           
 20%|██        | 4/20 [00:51<03:22, 12.66s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.307467362924282, n,d,rate=(884, 23, 0.21803256668455)              
 25%|██▌       | 5/20 [01:06<03:23, 13.60s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.3057963446475196, n,d,rate=(814, 28, 0.25238201148642436)          
 30%|███       | 6/20 [01:20<03:09, 13.56s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.29712793733681464, n,d,rate=(596, 20, 0.16093533375809227)         
 35%|███▌      | 7/20 [01:29<02:36, 12.05s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.309869451697128, n,d,rate=(769, 27, 0.20025385564430304)           
 40%|████      | 8/20 [01:41<02:27, 12.27s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.31039164490861626, n,d,rate=(672, 24, 0.2512407428956941)          
 45%|████▌     | 9/20 [01:55<02:20, 12.74s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.31174934725848563, n,d,rate=(941, 14, 0.25035237288665924)         
 50%|████▌    | 10/20 [02:13<02:23, 14.33s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.30224543080939953, n,d,rate=(491, 31, 0.24526069389072247)         
 55%|████▉    | 11/20 [02:23<01:57, 13.04s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.307780678851175, n,d,rate=(559, 24, 0.27994222220500914)           
 60%|█████▍   | 12/20 [02:35<01:40, 12.50s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.29002610966057446, n,d,rate=(492, 31, 0.16117699564072324)         
 65%|█████▊   | 13/20 [02:43<01:19, 11.41s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.29973890339425585, n,d,rate=(836, 11, 0.13820246164870914)         
 70%|██████▎  | 14/20 [02:56<01:10, 11.72s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.3030809399477807, n,d,rate=(595, 18, 0.1013808328970103)           
 75%|██████▊  | 15/20 [03:06<00:56, 11.20s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.292532637075718, n,d,rate=(745, 13, 0.15503242394005037)           
 80%|███████▏ | 16/20 [03:18<00:45, 11.41s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.3005744125326371, n,d,rate=(995, 12, 0.16497952879299757)          
 85%|███████▋ | 17/20 [03:35<00:39, 13.17s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.309869451697128, n,d,rate=(914, 34, 0.28879572683857824)           
 90%|████████ | 18/20 [03:51<00:28, 14.15s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.30255874673629246, n,d,rate=(913, 11, 0.12644002832231546)         
 95%|████████▌| 19/20 [04:04<00:13, 13.63s/trial, best loss: 0.6869973890339425]

  return flags.sum() / len(recommended_list)

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)




PRECISION= 0.2982767624020888, n,d,rate=(667, 33, 0.14610273498830495)          
100%|█████████| 20/20 [04:13<00:00, 12.70s/trial, best loss: 0.6869973890339425]
TPE result:  {'D': 24, 'N': 909, 'rate': 0.20798795093853278}
CPU times: user 37min 7s, sys: 1min 40s, total: 38min 48s
Wall time: 4min 13s


  return flags.sum() / len(recommended_list)



In [197]:
result_eval_ranker = data_val_ranker.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_eval_ranker.columns=[USER_COL, ACTUAL_COL]
result_eval_ranker.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


In [198]:
result_eval_ranker=result_eval_ranker.merge(result_eval_matcher[['user_id', 'own_als']], how='left', on='user_id')
result_eval_ranker

Unnamed: 0,user_id,actual,own_als
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[856942, 1104349, 1124029, 5577022, 8090541, 8..."
1,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[1037337, 1084036, 1098844, 13002975, 13003092..."
2,7,"[840386, 889774, 898068, 909714, 929067, 95347...","[6533878, 7147142, 9338009, 9803591, 10285022,..."
3,8,"[835098, 872137, 910439, 924610, 992977, 10412...","[981660, 6534178, 6533889, 1029743, 6534166, 6..."
4,9,"[864335, 990865, 1029743, 9297474, 10457112, 8...","[893018, 896085, 1029743, 6039859, 6534030, 65..."
...,...,...,...
1910,2496,[6534178],"[1120928, 6534178, 6533889, 1029743, 6534166, ..."
1911,2497,"[1016709, 9835695, 1132298, 16809501, 845294, ...","[870515, 1050741, 6534178, 6533889, 1029743, 6..."
1912,2498,"[15716530, 834484, 901776, 914190, 958382, 972...","[5565356, 8119004, 6534178, 6533889, 1029743, ..."
1913,2499,"[867188, 877580, 902396, 914190, 951590, 95813...","[941797, 1015280, 1060872, 5570048, 6534178, 6..."


In [199]:
# Используем товары, вероятность которых больше 50%. Если их менее 5ти, то добавляем из списка рекомендованных. 

populal_rec=popularity_recommendation(data, n=5)
def rerank(user_id):
    df_=df_ranker_predict[df_ranker_predict[USER_COL]==user_id].sort_values('proba_item_purchase', ascending=False)
    df_=list(set(df_.loc[df_.proba_item_purchase >0.5].item_id.tolist()))
    if len(df_)<5:
        df_=df_+populal_rec[:(5-len(df_))]
    return df_[:5]

In [200]:
result_eval_ranker['reranked_own_rec'] = result_eval_ranker[USER_COL].apply(lambda user_id: rerank(user_id))

In [201]:
print(*sorted(calc_precision(result_eval_ranker, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.2978590078328982)
('own_als', 0.13994778067885116)


# Оценка на тесте для выполнения курсового проекта

In [202]:
df_test = pd.read_csv('retail_test1.csv')

In [203]:
df_test.head()

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,1340,41652823310,664,912987,1,8.49,446,0.0,52,96,0.0,0.0
1,588,41652838477,664,1024426,1,6.29,388,0.0,8,96,0.0,0.0
2,2070,41652857291,664,995242,5,9.1,311,-0.6,46,96,0.0,0.0
3,1602,41665647035,664,827939,1,7.99,334,0.0,1741,96,0.0,0.0
4,1602,41665647035,664,927712,1,0.59,334,-0.4,1741,96,0.0,0.0


In [62]:
result_test = df_test.groupby(USER_COL)[ITEM_COL].unique().reset_index()
result_test.columns=[USER_COL, ACTUAL_COL]
result_test.head(2)

Unnamed: 0,user_id,actual
0,1,"[880007, 883616, 931136, 938004, 940947, 94726..."
1,2,"[820165, 820291, 826784, 826835, 829009, 85784..."


In [204]:
# ищем общих пользователей
common_users1= list(set(df_test.user_id.values)&set(common_users))

# оставляем общих пользователей
result_test = result_test[result_test.user_id.isin(common_users1)]

In [205]:
result_test['reranked_own_rec'] = result_test[USER_COL].apply(lambda user_id: rerank(user_id))

In [206]:
print(*sorted(calc_precision(result_test, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.2675886951292844)


### Полученная метрика 0.267 больше целевой метрики 0.235, цель проекта достигнута

In [63]:
result_test['reranked_own_rec'] = result_test[USER_COL].apply(lambda user_id: rerank(user_id))

In [64]:
print(*sorted(calc_precision(result_test, TOPK_PRECISION), key=lambda x: x[1], reverse=True), sep='\n')

('reranked_own_rec', 0.16531531531531532)


  return flags.sum() / len(recommended_list)


In [75]:
sorted(calc_precision(result_eval_matcher, TOPK_PRECISION), key=lambda x: x[1],reverse=True)

[('own_rec', 0.2097629009762901),
 ('als_rec', 0.09474662947466295),
 ('sim_item_rec', 0.06517898651789865)]