## ***Related Products***

## Table of Content

[Overview](#overview)   
[Data Understanding](#data_understanding)   
[Data Preparation](#data_preparation)      
[Modeling for Recommendation](#recom)  
[Evaluation](#evaluation)  
[Further Ideas](#further)   
[Thanks](#thanks)   

## **Overview <a class="anchor" id="purpose"></a>**

In this project, a recommendation system is designed by utilizing Apriori Algorithm, Cosine Similarity and Popularity based product recommendation strategies. 

Almost all of the steps are explained in comment lines.

## **Data Understanding <a class="anchor" id="data_understanding"></a>**

In this part, the dataset provided by HepsiBurada is loaded an examined.

In [293]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from sklearn.model_selection import train_test_split
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import json

**Loading Dataset**

❗️Please do not forget to download files and update the *events_data_path* and *meta_data_path* variables accordingly.

In [270]:
# read events dataset which has been converted to .csv format. 
events_data_path = None
data = pd.read_csv('/content/drive/MyDrive/hepsiburada/events.csv')

# drop empty rows
data = data.dropna()

# convert eventtime to datetime type for later usage
data['eventtime'] = pd.to_datetime(data['eventtime'])

# remove duplicated items in one session
data = data.sort_values(by=['sessionid'])
data = data.drop_duplicates(subset=['sessionid', 'productid'], keep='first')

# split dataset into train and test to measure the model performance later
train_data, test_data = train_test_split(data, test_size = 0.4, shuffle=False)

# use data var. name for convenient usage
data = train_data


# read meta.json file and create a data frame for product information
meta_data_path = None
f = open('/content/drive/MyDrive/hepsiburada/meta.json')
meta_data = json.load(f)
product_df = pd.DataFrame(meta_data['meta'])

# extract product ids
product_ids = product_df.productid.to_list()

**Analysis and Statistics of Dataset**

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163485 entries, 278045 to 64908
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype              
---  ------      --------------   -----              
 0   Unnamed: 0  163485 non-null  int64              
 1   event       163485 non-null  int64              
 2   sessionid   163485 non-null  object             
 3   eventtime   163485 non-null  datetime64[ns, UTC]
 4   price       163485 non-null  float64            
 5   productid   163485 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(1), int64(2), object(2)
memory usage: 8.7+ MB


In [4]:
data.describe()

Unnamed: 0.1,Unnamed: 0,event,price
count,163485.0,163485.0,163485.0
mean,199430.181136,1.0,13.052104
std,111046.96929,0.0,16.158111
min,8.0,1.0,0.25
25%,104495.0,1.0,3.99
50%,203153.0,1.0,8.4
75%,295607.0,1.0,16.99
max,387655.0,1.0,599.0


In [5]:
len(data)

163485

In [6]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10236 entries, 0 to 10235
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   productid    10235 non-null  object
 1   brand        9777 non-null   object
 2   category     10235 non-null  object
 3   subcategory  10235 non-null  object
 4   name         10235 non-null  object
dtypes: object(5)
memory usage: 400.0+ KB


In [7]:
product_df.describe()

Unnamed: 0,productid,brand,category,subcategory,name
count,10235,9777,10235,10235,10235
unique,10235,789,20,132,10123
top,HBV00000PNGJG,Carrefour,Atıştırmalık,Saç Bakımı,Carrefour Yumurta 30'lu M Boy
freq,1,396,1113,556,3


## **Data Preperation <a class="anchor" id="data_preparation"></a>**

 Several different operations such as dataset splitting, removing duplicates, extracting uniques etc. are performed in this part. These process eased the way to manage the project. In addition to that, it provided an insight to dataset which appears to be very large. 

In this dataset, each **'session'** is considered as a cart(market basket). All of the other steps are followed based on this assumption.

In [281]:
# create a merged dataframe including information of events and products
data = data.merge(product_df[['productid','subcategory', 'category', 'name']], on='productid', how='left')
data.dropna(inplace=True, subset=['subcategory', 'category'])

# create a merged test dataframe including information of events and products
test_data = test_data.merge(product_df[['productid','subcategory', 'category', 'name']], on='productid', how='left')
data.dropna(inplace=True, subset=['subcategory', 'category'])

data.head(2)

Unnamed: 0.1,Unnamed: 0,event,sessionid,eventtime,price,productid,subcategory,category,name
0,278045,1,000280f4-62fc-4dcd-b51d-c66ac14d7d8c,2020-06-07 14:30:58.804000+00:00,9.99,HBV00000NE1WT,Zeytin,Kahvaltılık ve Süt,Fora Gemlik Doğal Yağlı Salamura Siyah Zeytin ...
1,322275,1,0002e53b-1f60-4309-8380-31ca03de51f8,2020-06-06 17:51:18.003000+00:00,22.48,HBV00000NVZGQ,Kırmızı Et,"Et, Balık, Şarküteri",Dana Antrikot 250 gr


In [9]:
# extract number of unique products, subcategories and categories
len(data['productid'].unique()), len(data['subcategory'].unique()), len(data['category'].unique())

(9348, 132, 20)

In [10]:
# show an example session with bought products and eventtime
example_session = data[data['sessionid'] == '0002e53b-1f60-4309-8380-31ca03de51f8'][['name', 'eventtime']]

# print example
print(example_session)

                             name                        eventtime
1            Dana Antrikot 250 gr 2020-06-06 17:51:18.003000+00:00
2  Aytaç Şipşak Macar Salam 60 gr 2020-06-06 17:52:42.480000+00:00


In [11]:
# create a set of unique items bought in one session
transactions = data.groupby(['sessionid']).name.unique()

transactions.head()

sessionid
000280f4-62fc-4dcd-b51d-c66ac14d7d8c    [Fora Gemlik Doğal Yağlı Salamura Siyah Zeytin...
0002e53b-1f60-4309-8380-31ca03de51f8    [Dana Antrikot 250 gr, Aytaç Şipşak Macar Sala...
0002ef34-6bee-4953-874b-8298ec26b625      [Papatya Ekmek 500 gr, Tereyağlı Kruvasan 5'Li]
000618de-d415-408c-863e-6124db43f529                                 [Dana Biftek 250 gr]
000770d6-c2d4-4ad2-bb2c-b35274bc5e7e                                 [Pınar Tereyağ 1 kg]
Name: name, dtype: object

In [12]:
# convert the pandas series to list of lists
transactions = transactions.tolist()

In the below cell, the session(cart) statistics are observed to develop a proper algorithm.

In [13]:
print('Number of transactions: ', len(transactions))

counts = [len(transaction) for transaction in transactions]
print('Number of unique items in each transaction: ', counts)

print('Median number of items in a transaction: ', np.median(counts))

print('Mean number of items in a transaction: ', np.mean(counts))

print('Max number of items in a transaction: ', np.max(counts) )

Number of transactions:  32797
Number of unique items in each transaction:  [1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 4, 55, 14, 11, 1, 3, 3, 1, 6, 2, 1, 2, 6, 1, 4, 1, 45, 2, 4, 1, 1, 1, 1, 3, 2, 1, 1, 3, 1, 2, 1, 1, 2, 22, 1, 1, 2, 1, 6, 1, 1, 1, 17, 1, 2, 2, 1, 1, 1, 1, 3, 20, 1, 1, 4, 3, 7, 6, 1, 4, 3, 6, 7, 2, 1, 1, 8, 2, 7, 8, 1, 13, 1, 4, 5, 1, 2, 1, 7, 13, 3, 11, 1, 16, 2, 3, 4, 14, 1, 1, 2, 10, 1, 1, 1, 18, 2, 3, 21, 1, 6, 1, 2, 1, 33, 4, 7, 8, 1, 28, 11, 1, 7, 1, 13, 6, 2, 1, 7, 4, 1, 12, 4, 29, 11, 3, 4, 5, 2, 8, 5, 7, 6, 2, 3, 1, 24, 3, 46, 19, 2, 14, 6, 4, 1, 7, 1, 2, 5, 1, 12, 1, 1, 13, 1, 8, 3, 5, 12, 24, 1, 1, 1, 2, 1, 1, 1, 28, 2, 2, 25, 5, 3, 6, 1, 6, 1, 3, 14, 1, 2, 1, 1, 8, 1, 26, 1, 4, 3, 15, 8, 1, 7, 1, 10, 2, 14, 13, 2, 1, 1, 2, 1, 1, 1, 1, 3, 15, 1, 6, 2, 6, 1, 10, 3, 10, 1, 5, 1, 8, 9, 2, 15, 1, 2, 1, 1, 20, 1, 1, 1, 1, 10, 8, 7, 1, 16, 1, 3, 9, 9, 3, 1, 1, 1, 23, 2, 17, 1, 3, 1, 1, 23, 1, 5, 2, 1, 1, 2, 4, 1, 1, 2,

## **Modeling for Recommendation <a class="anchor" id="recom"></a>**

A combined model is develoeped to build a system that can make recommendations based on the products added by customer in a session. 

Apriori Algorithms, which is commonly used to do market basket analysis. By applying this algorithm to train set, I obtained thousands of rules that represents the pattern in the shopping process. These rules are extracted by using Lift score which bidrectionally carries information about probability of a product being purchased when the other pair product is purchased. I thought this relation could give an idea about people's purchase habit. After, antecedent and consequent products are extracted, these are sort based on the confidence score.

Since there are thousands of unique products in this dataset, some of these products are not occured in antecedents of assocation rules. So, another approach is implemented to make a recommendation based on a product. Thus, Cosine Similarity based recommendation is applied to the products which did not occur in assocation rules. Different from apriori algorithm, the cart is analyzed by using the textual description created for each product. Textual descriptions are composed of category + subcategory information of each product. 'Name' feature of products is not included in textual description as it might be too spesific to recommend a similar product. 'Name' feature is discarded as it could have cause the recommendation of exactly similar product. And finally, popoularity based recommendation is applied to dataset in the case of no recommendations from 'Apriori' and 'Similarity' strategies. The most popoular n items are recommended. 

These strategies are prioritized based on the quality of recommendations as follows;

1. Apriori Algorithm
2. Cosine Similarity
3. Popularity. 

Firstly, apriori algorithm is used to extract 10 recommendations and if there is not any recommendations obtained from apriori, the similarity is applied. And if there is still insufficent number of product recommendations,  the  popularity approach is applied. 

In [295]:
from mlxtend.preprocessing import TransactionEncoder

# create an encoder for apriori algorithm
encoder = TransactionEncoder()

# fit encoder to transactions
encoder.fit(transactions)

# create a one-hot encoded(True, Flase)  occurence matrix for sessions as rows and products as columns
occurence_matrix = encoder.transform(transactions)

# convert array to pandas dataframe.
occurence_matrix = pd.DataFrame(occurence_matrix, columns = encoder.columns_)

occurence_matrix.head()

Unnamed: 0,17091 Command Orta Boy Şeffaf Askı,17092 Command Küçük Boy Şeffaf Askı,3 Mucizevi Kil Saç Kremi 360 ml,3'lü Ekonomik Topraklı Grup Priz Anahtarlı 5 m,"3'Ü 1 Arada Onarıcı Ve Koruyucu Şampuan, Saç Kremi, Bakım Kürü 470 ml","3'Ü 1 Arada Şampuan, Saç Bakım Kürü Ve Saç Bakım Kremi Saç Dökülmelerine",30 Etkinlik - Betül Tok,360 Medium 1+1 Diş Fırçası,360° Gold 1+1 Soft Diş Fırçası,360° Optik Beyaz 1+1 Orta Diş Fırçası,360°Optik Beyaz Soft 1+1 Diş Fırçası,365 Aktivite,3Dw Luxe Intense Blast Diş Macunu 75 ml,5 No Nergiz Saksı (Kahve),6 Mucizevi Yağ Besleyici Saç Kremi 360 ml,6 Mucizevi Yağ Besleyici Saç Maskesi 300 ml,6 Mucizevi Yağ Besleyici Şampuan 360 ml,7 No Nergiz Saksı (Lila),7/24 Belirgin Bukle Bakım Kremi 300 ml,7/24 Onarıcı Bakım Bakım Kremi 300 ml,7Days Kruvasan Kakaolu 185 gr,7Days Kruvasan Vişneli Aile 185 gr,90 Etkinlik - Betül Tok,ACE Ultra Yoğun Kıvamlı Çamaşır Suyu Dağ Tazeliği 3Kg,Ace Power Jel Okaliptus 3 kg,Ace Ultra Yağ Çözücü Ferahlık Etkisi 750 Ml,Ace Çamaşır Suyu 1000 gr,Ace Çamaşır Suyu 2 lt,Ace Çamaşır Suyu 4 L,Ace Çamaşır Suyu Bahar Kokulu 4 L,Ace Çamaşır Suyu Ekstra Hijyen 4 lt,Ace Çamaşır Suyu Lavanta 2000 gr,Ace Çamaşır Suyu Lavanta 4 L,Ace Çamaşır Suyu Lavanta 4 Lt,Ace Çamaşır Suyu Lavanta Düşleri 2 lt,Ace Çamaşır Suyu Ultra Köpük Ferahlık 700 Ml,Ace Çamaşır Suyu Ultra Köpük Çiçek Kokulu 700 Ml,Ace Çamaşır Suyu Ultra Çiçek Kokulu 810 Gr,"Acetum Balsamik Sirkesi, 500Ml",Acqua Panna Dogal Mınerallı Su Cam 250 ml,...,Şölen Biscolata Mood 125 Gr,Şölen Biscolata Mood 40 Gr,Şölen Biscolata Nirvana Fındıklı 100 Gr,Şölen Biscolata Nirvana Roll Fındıklı 28 Gr,Şölen Biscolata Nirvana Roll Hindistan Cevizli 22.5 Gr,Şölen Biscolata Pia Kek Portakallı 100 Gr,Şölen Biscolata Pia Kek Çikolatalı 100 Gr,Şölen Biscolata Pia Çikolata 100 Gr,Şölen Biscolata Stix Pirinç Patlaklı 34 Gr,Şölen Biscolata Stix Sade 40 Gr,Şölen Biscolata Tria Fındıklı 100 Gr,Şölen Biscolata Veni Fındıklı Gofret 110 Gr,Şölen Biscolata Veni Çikolatalı 110 Gr,Şölen Boom Bastic Gofret Fındık Kremalı 32 Gr,Şölen Boombastic Burger 120 gr,Şölen Boombastic Marshmelow Çikolata Kek Bar 40 g,Şölen Boombastik Hindistan Cevizli 35Gr,Şölen Boombastıc Marshmallow'lu Kek Bar 40 Gr,Şölen Dökme Şeker kg,Şölen Hoppo Çilekli 40Gr,Şölen Jelly Şeker Dökme kg,Şölen Lokkum Meyveli 350 gr,Şölen Luppo Kakaolu Sandviç 184 Gr,Şölen Luppo Sandviç Kek 184 Gr,Şölen Luppo Sandviç Kek 184 g,Şölen Matmazel Çikolata 300 g,Şölen Minis 117Gr,Şölen Nirvana Hindistan Cevizli 100Gr,Şölen Nirvana Role Fındıklı 28Gr,Şölen Nutymax Antep Fıstıklı 44 Gr,Şölen Ozmo Burger 36 Gr,Şölen Ozmo Cornet MP 4x25 g,Şölen Ozmo Egg Oyuncaklı Yumurta 3'lü,Şölen Ozmo Go Beyaz Ve Bitter Çikolata 30 Gr,Şölen Ozmo Ogopogo Kek 30 Gr,Şölen Ozmo Yumurta Oyuncaklı 3*20 Gr,Şölen Ozmofun Sütlü Çikolata,Şölen Pia Portakal 100G,Şölen Stix Pirinç Patlaklı 34G,Şölen Tual grold 400 gr
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [15]:
# find mean occurence ratio for all the products
occurence_matrix.mean(axis=0)

17091 Command Orta Boy Şeffaf Askı                                       0.000091
17092 Command Küçük Boy Şeffaf Askı                                      0.000061
3 Mucizevi Kil Saç Kremi 360 ml                                          0.000091
3'lü Ekonomik Topraklı Grup Priz Anahtarlı 5 m                           0.000030
3'Ü 1 Arada Onarıcı Ve Koruyucu Şampuan, Saç Kremi, Bakım Kürü 470 ml    0.000030
                                                                           ...   
Şölen Ozmo Yumurta Oyuncaklı 3*20 Gr                                     0.000152
Şölen Ozmofun Sütlü Çikolata                                             0.001067
Şölen Pia Portakal 100G                                                  0.000091
Şölen Stix Pirinç Patlaklı 34G                                           0.000122
Şölen Tual grold 400 gr                                                  0.000091
Length: 9260, dtype: float64

In [16]:
# show total distribution of products
occurence_matrix.sum(axis=1).value_counts()

1      13829
2       4499
3       2568
4       1853
5       1494
       ...  
105        1
83         1
144        1
76         1
183        1
Length: 86, dtype: int64

In [17]:
# apply apriori algorithm to data with min support threshold of 0.001
frequent_itemsets = apriori(occurence_matrix, min_support = 0.001, max_len = 2, use_colnames = True)

# Print frequent itemsets.
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.001768,(Activia Shot Ahududu & Hibiskus 80 Ml)
1,0.001342,(Activia Shot limon Zencefil Matcha 80 Ml)
2,0.003903,(Alabalık 1 kg)
3,0.001555,"(Algida Carte D""or Selection Çikolata Karnaval..."
4,0.001525,(Algida Cornetto Classico Kaymak 125 ml)
...,...,...
2133,0.002470,"(Ülker Çikolatalı Gofret 5x36 gr, Ülker Çokona..."
2134,0.001189,"(İthal Muz 500 gr, İthal Ananas)"
2135,0.001433,"(Şeftali Paket 500 gr, İthal Ananas)"
2136,0.004726,"(Şeftali Paket 500 gr, İthal Muz 500 gr)"


In [18]:
# Recover association rules using support and a minimum threshold of 0.0001.
rules = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.0001)

# Print rules header.
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Ankara Burgu Makarna 500 gr),(Ankara Fiyonk Makarna 500 gr),0.005366,0.003964,0.001799,0.335227,84.572684,0.001778,1.498311
1,(Ankara Fiyonk Makarna 500 gr),(Ankara Burgu Makarna 500 gr),0.003964,0.005366,0.001799,0.453846,84.572684,0.001778,1.82116
2,(Avokado Adet),(Domates Salkım 500 gr),0.005702,0.041925,0.001006,0.176471,4.209241,0.000767,1.163377
3,(Domates Salkım 500 gr),(Avokado Adet),0.041925,0.005702,0.001006,0.024,4.209241,0.000767,1.018748
4,(Avokado Adet),(Limon Lamas 500 gr),0.005702,0.044089,0.00125,0.219251,4.972881,0.000999,1.224351


In [19]:
# Recover association rules using confidence threshold of 0.01.
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)

# Print rules.
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Ankara Burgu Makarna 500 gr),(Ankara Fiyonk Makarna 500 gr),0.005366,0.003964,0.001799,0.335227,84.572684,0.001778,1.498311
1,(Ankara Fiyonk Makarna 500 gr),(Ankara Burgu Makarna 500 gr),0.003964,0.005366,0.001799,0.453846,84.572684,0.001778,1.821160
2,(Avokado Adet),(Domates Salkım 500 gr),0.005702,0.041925,0.001006,0.176471,4.209241,0.000767,1.163377
3,(Domates Salkım 500 gr),(Avokado Adet),0.041925,0.005702,0.001006,0.024000,4.209241,0.000767,1.018748
4,(Avokado Adet),(Limon Lamas 500 gr),0.005702,0.044089,0.001250,0.219251,4.972881,0.000999,1.224351
...,...,...,...,...,...,...,...,...,...
2317,(İthal Ananas),(Şeftali Paket 500 gr),0.005610,0.032046,0.001433,0.255435,7.970975,0.001253,1.300026
2318,(Şeftali Paket 500 gr),(İthal Muz 500 gr),0.032046,0.025917,0.004726,0.147479,5.690418,0.003896,1.142591
2319,(İthal Muz 500 gr),(Şeftali Paket 500 gr),0.025917,0.032046,0.004726,0.182353,5.690418,0.003896,1.183829
2320,(Şeftali Paket 500 gr),(İçim Peynir Kaşar 600 Gr),0.032046,0.012105,0.001067,0.033302,2.751116,0.000679,1.021927


**COSINE SIMILARITY**

In [20]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# creating a textual description utilizing category and subcategory informaiton
product_df['description'] = product_df[['category', 'subcategory']].astype(str).agg(' '.join, axis=1)

# create a TF-IDF. vectorizer to represent textual description in a vector form
tfidf = TfidfVectorizer()

# construct a CSR matrix for representing description of each product instance
description_matrix = tfidf.fit_transform(product_df['description'])

# check the shape (number_of_instances x feature_length)
print(description_matrix.shape)

# Tfidf vectorizer normalizes the vectors so the A_transpose x A gives similarity between matrices calculated by linear_kernel()
similarity_matrix = linear_kernel(description_matrix,description_matrix)
print(similarity_matrix)

# create a id2index mapping between product_id and indices in product_df as it eases to access to other info
name2index = pd.Series(product_df.index,index = product_df['name'])

(10236, 232)
[[1.         0.         0.         ... 0.09546435 0.10367016 0.07299755]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.09546435 0.         0.         ... 1.         0.51662463 0.38238724]
 [0.10367016 0.         0.         ... 0.51662463 1.         0.25448435]
 [0.07299755 0.         0.         ... 0.38238724 0.25448435 1.        ]]


In [247]:
def recommend_by_apriori(product_in_cart, rules):
  # this method returns the products that have ten highest confidence scores according to the apriori algorithm
  similarity_relation = rules[rules["antecedents"].apply(lambda x: product_in_cart in str(x))]
  sorted_relation = similarity_relation.sort_values(['confidence'], ascending=False)[0:10]

  return sorted_relation


def recommend_by_similarity(product_in_cart, name2index, similarity_matrix ):
  # find index of product
  product_index = name2index[product_in_cart]
  
  # this score list includes mapping of index and similarity score of current product
  similarity_score = list(enumerate(similarity_matrix[product_index]))
  # sort in similarity scores in descending order
  similarity_score = sorted(similarity_score, key=lambda x: x[1], reverse=True)

  # extract the 10 most similar products to the current product
  similarity_score = similarity_score[0:10]

  # return product names according to the scores
  product_indices = [i[0] for i in similarity_score]

  return product_df['name'].iloc[product_indices], similarity_score

def recommend_the_most_popular(n):
  # this method returns the most popular item
  return data['name'].value_counts()[0:n].index.to_list()


def recommendation_system(test_cart):

  # print the cart content
  print('CART CONTENT: ')
  for cart_item in test_cart: print(cart_item)
    
  # create a list for apriori dataframes, cosine_similarity recommendations and dict for mapping similarity items to scores
  recommendation_data_frames  = list()
  similarity_recoms = list()
  sim_item_score_mapping = {}
  non_apriori_items = list()
  # find recommendation products for each product in cart according to the proper strategy(apriori, similarity, popularity)
  for product in test_cart:

    # at first apply apriori recommendation strategy which recommends more proper products
    recom_frame = recommend_by_apriori(product, rules)
    recom_frame = recom_frame[recom_frame["consequents"].apply(lambda x: product not in str(x))]

    # if apriori relation is found for a product add sub-dataframe to the list 
    if len(recom_frame) != 0:
      recommendation_data_frames.append(recom_frame)
    # if there is no apriori relation rule then group those products to apply similarity
    else: 
      non_apriori_items.append(product)

    # define required data structures for storage
    all_recoms = list()
    all_scores = list()
    popularity_recoms = list()
    apriori_recoms = list()
    apriori_scores = list()
    similarity_scores = list()
    popularity_scores = list()
    recommendation_strategy = list()

    # if sub-dataframes are found for each product merge dataframes to create a general list of recom. products
    if len(recommendation_data_frames) != 0:
      # sort instances by confidence score
      apriori_recoms = pd.concat(recommendation_data_frames, ignore_index=True).sort_values(['confidence'], ascending=False)
      # drop the instances that recommends the same consequent product
      apriori_recoms = apriori_recoms.drop_duplicates(subset=['consequents'], keep='first')
      
      apriori_recoms = apriori_recoms[['consequents', 'confidence']]
      # if more than 10 products are recommended by apriori, extract them and add to total recom. products and scores lists
      if len(apriori_recoms) >= 10:
        apriori_scores = apriori_recoms.confidence[0:10].to_list()
        apriori_recoms = apriori_recoms.consequents[0:10]
     
      # otherwise get the founds
      else:
        apriori_scores = apriori_recoms.confidence[0:len(apriori_recoms)].to_list()
        apriori_recoms = apriori_recoms.consequents[0:len(apriori_recoms)]

      all_recoms = [''.join(list(i)) for i in apriori_recoms]
      all_scores = [j for j in apriori_scores]
      # store strategy type
      recommendation_strategy = ['apriori' for j in apriori_scores]

    # if there is not enough recoms. by apriori(less then 10 or 0) apply similarity
    if len(apriori_recoms) < 10 or len(recommendation_data_frames) == 0: 
      items = non_apriori_items
      if len(non_apriori_items) == 0:
        items = test_cart

      for item in items:
        # extract similarity based recommendation products and similarity score
        sim_recom, sim_scores = recommend_by_similarity(item, name2index, similarity_matrix)
        sim_item_socre_mapping = map_sim_item_score(sim_item_score_mapping, sim_recom, sim_scores)
        similarity_recoms.append(sim_recom)

      # if there is not enough recommendations by apriori apply similarity to rest
      sim_recom_count = 10 - len(apriori_recoms)
      # clean duplicate recommendations or items already in cart
      clean_similarity_recoms = clean_similarity_list(apriori_recoms, test_cart, similarity_recoms)
      clean_similarity_recom_scores = [ (i, sim_item_socre_mapping.get(i)) for sim_list in clean_similarity_recoms for i in sim_list] 
      # if there are more products required to be recommended based on similarity and not enough recommendations
      # use popularity based recommendation
      if len(clean_similarity_recom_scores) < sim_recom_count:
        sim_recom_count = len(clean_similarity_recom_scores)-1
        popularity_recom_count = 10 - sim_recom_count
        # create recoms. for the rest by using popularity based recommendation
        popularity_recoms.extend(recommend_the_most_popular(popularity_recom_count))
        popularity_scores.extend([1.0 for i in popularity_recoms])
      # sort similarity recoms by score
      clean_similarity_recom_scores_sorted = sorted(clean_similarity_recom_scores, key=lambda x: x[1][1].any())[0:sim_recom_count]

      sim_recoms_scores = list(zip(*clean_similarity_recom_scores_sorted))
      all_recoms.extend(sim_recoms_scores[0])
      similarity_scores = [i[1] for i in sim_recoms_scores[1]]

      for j in sim_recoms_scores[0]: recommendation_strategy.append('similarity')
      all_scores.extend(similarity_scores)

      # if there is a recommendation based on popularity add them to total
      if len(popularity_recoms)!=0:
        for j in popularity_recoms: recommendation_strategy.append('popularity')
        all_recoms.extend(popularity_recoms)
        all_scores.extend(popularity_scores)

    # store result in dataframe so that the strategy can be observed
    result = {'products_you_might_also_like':all_recoms, 'recommendation_strategy':recommendation_strategy, 'scores':all_scores}
    result_df = pd.DataFrame(result)

  return  result_df

def map_sim_item_score(sim_item_score_mapping, sim_recom, sim_scores):
  # create a mapping dictionary between recommended product and similarity score
  sim_recom = sim_recom.to_list()
  for i in range(len(sim_recom)):
    sim_item_score_mapping[sim_recom[i]] = sim_scores[i]
  return sim_item_score_mapping


def remove_duplicate_recoms(similarity_recoms):
  # this method removes duplicated recommendations
  seen_items = []
  clean_list = []
  for sim_list in similarity_recoms:
    clean_sim_list = sim_list.to_list()
    for item in sim_list:
      if item not in seen_items:
        seen_items.append(item)
      else:
        clean_sim_list.remove(item)
    clean_list.append(clean_sim_list)

  return clean_list

def clean_similarity_list(apriori_recoms, cart, similarity_recoms):
  #this method cleans dataset from duplicate recoms and also products already in cart
  similarity_recoms_without_duplicates = remove_duplicate_recoms(similarity_recoms)
  
  clean_recoms = list()
  for sim_list in similarity_recoms_without_duplicates:
    clean_list = sim_list
    for item in sim_list:
      if (item in cart) or (item in apriori_recoms):
        clean_list.remove(item)
    clean_recoms.append(clean_list)

  return clean_recoms


## **Evaluation <a class="anchor" id="evaluation"></a>**

In the below cell, output of the developed strategy is demonstrated. This dataframe included recommended products, strategy and score for each product based on the strategy. 

In [263]:
# Test the model using target session no.
sessions = test_transactions.index
num_sessions = len(sessions)
session_no = 3092
test_session = sessions[session_no]
test_cart = test_transactions[test_session]

recommendation_system(test_cart)


CART CONTENT: 
Dana Tas Kebabı 500 gr
Pınar Sosis 430 g 10'lu


Unnamed: 0,products_you_might_also_like,recommendation_strategy,scores
0,Dana Biftek 250 gr,apriori,0.287081
1,Dana Sote 500 gr,apriori,0.267943
2,Dana Kıyma (%14-%20 Yağ) 250 gr,apriori,0.177033
3,Kuzu Saç Kavurma 500 gr,similarity,1.0
4,Dana Kuşbaşı 500 gr,similarity,1.0
5,Dana + Kuzu Karışım Kıyma 250 gr,similarity,1.0
6,Dilimli Dana Pirzola 500 gr,similarity,1.0
7,Dana Kontrafile 500 gr,similarity,1.0
8,Dana Nuar 500 gr,similarity,1.0
9,Torku Blok Kavurma 100 gr,similarity,1.0


In [None]:
# It can also be tested using a for loop.
for session_no in range(10,20):
  try:
    print(recommendation_system(test_transactions, session_no))
  except ValueError as e:
    print(e)
    continue

In below cell, I tried to measure the overall performance of this approach. For this purpose, I customized the a Top 10 accuracy logic to measure the quality of recommendations. I evaluated this model by simulating the shopping. By simulating the shopping the performance of this model is evaluated with top10 and sub category accuracies.


 Top 10 accuracy: This metric is measured by generating recommendations for products in shopping cart. The act.ve shopping cart is extended one by one. In each time, some of the products are considered as active cart products and recommendations are generated based on that and if any of the recommendations are in the rest of the actual shopping cart, it identified as hit, and it identified as miss otherwise.  It showed the hit rate in the shopping cart. 

 Sub-category accuracy: It is implemented in the same Top 10 accuracy is implemented. However, instead of direct product comparison category of recommended product is considered. 


As expected sub-cateogry accuracy performed better. I believe this strategy is applicable to real life. 


In [287]:
all_data = pd.concat([data, test_data], ignore_index=True)

In [None]:
# create a product to subcategory mapping
product2subcategory = all_data.set_index('name').to_dict()['subcategory']

Custom evaluation metrics are tested in below cell. Results are as follows:

Top 10 item accuracy: 0.0

Sub-category accuracy: 0.42


The results show that this methodology can be further developed. Eventhough, it is does not have any machine learning application in this model. Sub-category metric has shown that it performed considerably well. 

In [296]:
def get_sub_category(data, recoms):
  subcats = []
  for product in recoms.products_you_might_also_like.values:
    subcats.append(product2subcategory[product])
  return subcats

def evaluate(test_carts,product2subcategory):
  top10_list = list()
  subcat_list = list()
  for cart in test_carts:
    true_gt = 0 
    true_subcat = 0
    for i in range(1, len(cart)):
      current_list = cart[0:i]
      try:
        recoms = recommendation_system(current_list)
      except ValueError as e:
        print(e)
        break
      
      subcategories = get_sub_category(data,recoms)
      label = cart[i]
      label_subcat = product2subcategory[label]
      if label in recoms:
        true_gt += 1
      if label_subcat in subcategories:
        true_subcat += 1

    num_recoms = len(cart)-1
    top10_acc = true_gt/num_recoms
    top10_list.append(top10_acc)
    subcat_acc = true_subcat / num_recoms
    subcat_list.append(subcat_acc)
  top10_acc = sum(top10_list)/len(top10_list)
  subcat_acc = sum(subcat_list) / len(subcat_list)
  print('Top 10 acc: ', top10_acc)
  print('Sub-category acc: ',subcat_acc)




# Test the evalution metric
sessions = test_transactions.index
num_sessions = len(sessions)
session_no = 606

test_carts = []
for i in range(5, 100):
  test_session = sessions[i]
  test_cart = test_transactions[test_session]
  if len(test_cart)>=2:

    test_carts.append(test_cart)


evaluate(test_carts, product2subcategory)


CART CONTENT: 
Carrefour Hindi Salam 50 gr
CART CONTENT: 
Carrefour Hindi Salam 50 gr
Carrefour Yarım Yağlı Süt 1 lt
CART CONTENT: 
Carrefour Hindi Salam 50 gr
Carrefour Yarım Yağlı Süt 1 lt
Carrefour Sarı Leblebi 150 gr
CART CONTENT: 
Aptamil 5 Çocuk Devam Sütü 800 g 2 Yaş+ Akıllı Kutu
CART CONTENT: 
Eti Form Kepekli zeytinli Kraker 28 g
CART CONTENT: 
Eti Form Kepekli zeytinli Kraker 28 g
Marc Çamaşır Makinesi Temizleyicisi 2x250 ml
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
CART CONTENT: 
Dana Döş Sarma 500 gr
CART CONTENT: 
Doritos Mısır Cipsi Nacho Peynirli 113 gr
CART CONTENT: 
Doritos Mısır Cipsi Nacho Peynirli 113 gr
Cheetos Mısır Çerezi Peynirli 18 Gr
CART CONTENT: 
Doritos Mısır Cipsi Nacho Peynirli 113 gr
Cheetos Mısır Çerezi Peynirli 18 Gr
Cheetos Çerez Fıstıklı 43 gr
CART CONTENT: 
Doritos Mısır Cipsi Nacho Peynirli 113 gr
Cheetos Mısır Çerezi Peynirli 18 Gr
Cheetos Çerez Fıstıklı 43 gr
Lays Patates Cipsi Baharat Çeşnili 107

## **Further Ideas <a class="anchor" id="further"></a>**

This problem can be further investigated by applyling deep learning techniques. For example, we could also train an LSTM upon our data. Because if we think of sessions as sentences the problem can be reduced to next word prediction and our vocab would be the product ids. With that way it is possible to train a LSTM to predict next recommendation. We could also look for similarities between test session and training sessions and we could recommend according to the most similar carts.

I also realized another point that cosine similarity strategy can be very useful the recomend a exact/very similar product from other vendors in a e-commerce platform. It is fascinating that such a fundamental mathematical operation can solve this kind of a problem very fast and easy. 

## **Thanks <a class="anchor" id="thanks"></a>**

I would like to thank HepsiBurada for this case study. I have not had an oppurtunity to work on recommender systems before and working on this study was very beneficial. 