#  Hepsiburada Recommendation Team Data Scientist Assignment



**Task** : Recommending to the customers products similar to the ones in their carts.

Importing necessary libraries. 

In [1]:
import pandas as pd
import json
import string
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.doc2vec import Doc2Vec, TaggedDocument



## Getting The Data

There are two files: 

- **Events.json:** contains events of type 'cart', adding a product to the cart.
- **Meta.json:** contains info about the products. 

In [2]:
def read_json_to_df(file_path, json_key):
    """ Returns a dataframe given the path of a json file and the key of the json object """
    
    with open(file_path) as f:
        data_json = json.load(f)
    return pd.DataFrame.from_dict(data_json[json_key], orient='columns')

In [3]:
events_file_path, meta_file_path = 'data/events.json', 'data/meta.json'

events = read_json_to_df(events_file_path, 'events')
meta = read_json_to_df(meta_file_path, 'meta')

In [63]:
events.shape, meta.shape

((387650, 5), (10235, 5))

Pickling the **meta** dataframe to use it on the server.

In [6]:
meta.to_pickle('pickled_dfs/meta.pkl')

## Data Discovery

After a **quick glance** at the data, the following is noted:

- The features of the products are text, hence, a NLP technique may be of use.
- There seems to be a category which its value is written in English. 'Pet Shop'
- The data might need to be merged at some point on the productid feature if needed.
- The "brand" name seems to be stated again in the "name" column. 

In [7]:
events.head()

Unnamed: 0,event,sessionid,eventtime,price,productid
0,cart,a0655eee-1267-4820-af21-ad8ac068ff7a,2020-06-01T08:59:16.406Z,14.48,HBV00000NVZE8
1,cart,d2ea7bd3-9235-4a9f-a9ea-d7f296e71318,2020-06-01T08:59:46.580Z,49.9,HBV00000U2B18
2,cart,5e594788-78a0-44dd-8e66-37022d48f691,2020-06-01T08:59:33.308Z,1.99,OFIS3101-080
3,cart,fdfeb652-22fa-4153-b9b5-4dfa0dcaffdf,2020-06-01T08:59:31.911Z,2.25,HBV00000NVZBW
4,cart,9e9d4f7e-898c-40fb-aae9-256c40779933,2020-06-01T08:59:33.888Z,9.95,HBV00000NE0T4


In [8]:
meta.head()

Unnamed: 0,productid,brand,category,subcategory,name
0,HBV00000AX6LR,Palette,Kişisel Bakım,Saç Bakımı,Palette Kalıcı Doğal Renkler 10-4 PAPATYA
1,HBV00000BSAQG,Best,Pet Shop,Kedi,Best Pet Jöle İçinde Parça Etli Somonlu Konser...
2,HBV00000JUHBA,Tarım Kredi,Temel Gıda,"Bakliyat, Pirinç, Makarna",Türkiye Tarım Kredi Koop.Yeşil Mercimek 1 kg
3,HBV00000NE0QI,Namet,"Et, Balık, Şarküteri",Şarküteri,Namet Fıstıklı Macar Salam 100 gr
4,HBV00000NE0UQ,Muratbey,Kahvaltılık ve Süt,Peynir,Muratbey Burgu Peyniri 250 gr


### Data Types

As shown below, the type for all columns is **object**. The price column's dtype should rather be float. This column may be used in many ways. For instance, since customers like **whole numbers**, the price of the items that are recommended to the user may have a role in attracting the user if the price complements the price in the cart, making it a whole number.

In [9]:
events.dtypes

event        object
sessionid    object
eventtime    object
price        object
productid    object
dtype: object

In [10]:
meta.dtypes

productid      object
brand          object
category       object
subcategory    object
name           object
dtype: object

### Missing values

There seem to be missing values that will need to be handled in some way. The following shows the number of null values in each column. 

In [64]:
def col_null_count(data):
    """counts the null values for each column"""
    
    for column in data.columns:
        null_vals_num = data[column].isnull().values.sum()
        print(f'{column}: {null_vals_num}')

In [12]:
col_null_count(events)

event: 0
sessionid: 0
eventtime: 0
price: 6
productid: 6


In [13]:
col_null_count(meta)

productid: 1
brand: 459
category: 1
subcategory: 1
name: 1


In [14]:
meta.loc[(meta['productid'].isnull())]

Unnamed: 0,productid,brand,category,subcategory,name
5092,,,,,


### Value Counts

Looking at the value counts, some initial notes are: 

- Some users probably put similar products to their carts, hence, a user-based recommendation technique may be considered. 
- The categories and subcategories sets' size is small, hence, a content-based recommendation technique may be considered. 

In [15]:
events['productid'].value_counts()

HBV00000NVZGU      17082
HBV00000NVZBI       5557
HBV00000OE7X7       5070
HBV00000NVZBY       3824
HBV00000O2S62       3704
                   ...  
ZYLINSALNDT006         1
HBV00000U2BB3          1
HBV00000BRUEP          1
HBV00000QU3SM          1
ZYECZACI9221098        1
Name: productid, Length: 10235, dtype: int64

In [16]:
meta['category'].value_counts()

Atıştırmalık            1113
Ev Bakım ve Temizlik    1106
Kişisel Bakım           1059
Kahvaltılık ve Süt       974
Temel Gıda               961
İçecekler                890
Sağlık ve Kozmetik       862
Bebek                    508
Oyuncak ve Kırtasiye     407
Et, Balık, Şarküteri     382
Ev Yaşam ve Bahçe        371
Tatlı                    363
Pet Shop                 275
Pratik Yemekler          217
Meyve ve Sebze           185
Fırın                    182
Dondurma                 128
Organik ve Diyet         105
Spor, Outdoor ve Oto      88
Su                        59
Name: category, dtype: int64

In [17]:
meta['subcategory'].value_counts()

Saç Bakımı                    556
Çikolata, Gofret ve Barlar    377
Bisküvi ve Kekler             359
Çamaşır Yıkama                286
Duş Jelleri ve Sabunlar       276
                             ... 
Poşet                           1
Tatlandırıcılar                 1
Kaykay ve Paten                 1
Dart                            1
Ev Tekstili                     1
Name: subcategory, Length: 132, dtype: int64

## Data Processing

### Handling Missing Values

Missing values could be handled in many ways such as deleting the rows with missing values or imputing techniques. As seen earlier, the number of rows with missing values in the events datasets is small. 

Therefore, deleting those rows would be a way to handle the missing values. This is done in the following cells for the **events** dataset.  

In [18]:
events.dropna(inplace=True)
events.reset_index(drop=True, inplace=True)

In [19]:
col_null_count(events)

event: 0
sessionid: 0
eventtime: 0
price: 0
productid: 0


For **meta** dataset, the column **brand**, which has the most nan values, won't be used (since the brand is already stated again in the **name** column). Therefore, only a signle row will be removed where all values are nan. 

In [20]:
meta.loc[(meta['productid'].isnull())]

Unnamed: 0,productid,brand,category,subcategory,name
5092,,,,,


In [21]:
meta.drop(meta.index[(meta['productid'].isnull())], inplace=True)
meta.reset_index(drop=True, inplace=True)

In [22]:
col_null_count(meta)

productid: 0
brand: 458
category: 0
subcategory: 0
name: 0


### Processing Text  Features

To process the text in both **category** and **subcategory** it is a good idea to list all unique values since there aren't many. There seem to be a pattern in the text, either a comma or a "ve" could exist between two words.


In [23]:
meta['category'].unique()

array(['Kişisel Bakım', 'Pet Shop', 'Temel Gıda', 'Et, Balık, Şarküteri',
       'Kahvaltılık ve Süt', 'Su', 'Fırın', 'Meyve ve Sebze',
       'Oyuncak ve Kırtasiye', 'Ev Bakım ve Temizlik', 'Atıştırmalık',
       'Tatlı', 'Bebek', 'Ev Yaşam ve Bahçe', 'Sağlık ve Kozmetik',
       'Pratik Yemekler', 'Organik ve Diyet', 'İçecekler', 'Dondurma',
       'Spor, Outdoor ve Oto'], dtype=object)

In [24]:
meta['subcategory'].unique()

array(['Saç Bakımı', 'Kedi', 'Bakliyat, Pirinç, Makarna', 'Şarküteri',
       'Peynir', 'Yoğurt', 'Su', 'Tatlı ve Tuzlu Kurabiyeler', 'Meyve',
       'Ekmekler', 'Gazete ve Dergi', 'Mutfak Ve Banyo Ürünleri',
       'Çikolata, Gofret ve Barlar', 'Tatlı Malzemeleri',
       'Oda Kokusu ve Koku Gidericiler', 'Devam Sütleri ve Ek Gıdalar',
       'Oyuncak', 'Yapıştırıcı ve Etiketler', 'Haşere Öldürücüler',
       'Duş Jelleri ve Sabunlar', 'Ağız Bakım', 'Pil', 'Hijyenik Pedler',
       'Konserve', 'Hazır Yemekler', 'Kalemler', 'Tıraş Ürünleri',
       'El, Yüz ve Vücut Bakımı', 'Diyet Ürünler',
       'Dondurulmuş Sebze, Meyve', 'Çay', 'Gazsız İçecekler',
       'Bisküvi ve Kekler', 'Sıvı Yağ', 'Süt', 'Bebek Bakım ve Sağlığı',
       'Ev Temizlik Ürünleri', 'Meyve Suyu', 'Kahve',
       'Balık ve Deniz Mahsülleri', 'Sebze', 'Kümes Hayvanları',
       'Islak Mendil', 'Krem Çikolata ve Ezme', 'Bebek Bezi',
       'Baharat, Harç ve Bulyon', 'Parfüm, Deodorant',
       'Yufka, Taze Hamur ve M

To process the text in **name** column, other things should be taken into consideration.
- There maybe punctuation marks other than **commas**.
- There are stopwords other than **ve** such as **içinde**.
- It is clear that there are some **numbers** in the text, so, those should be handled as well.
- There are some **units** in the text. 

In [25]:
meta['name'].head()

0            Palette Kalıcı Doğal Renkler 10-4 PAPATYA
1    Best Pet Jöle İçinde Parça Etli Somonlu Konser...
2         Türkiye Tarım Kredi Koop.Yeşil Mercimek 1 kg
3                    Namet Fıstıklı Macar Salam 100 gr
4                        Muratbey Burgu Peyniri 250 gr
Name: name, dtype: object

The following function handles the concerns stated above:

In [26]:
# turkish stop words list 
with open('stop-words_tr.txt') as f:
    stop_words = f.read().splitlines()

In [27]:
def has_numbers(word):
    """return true if the input string has a number"""
    return any(char.isdigit() for char in word)

def is_unit(word):
    """ Assumes a unit is 2 chars or less. A better approach would be using a list of units"""
    return len(word) <= 2

def separate_words(sent):
    """ split words of a given sentence """
    
    sent = sent.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
    splitted = sent.lower().split(' ')
    return [word.strip() for word in splitted if word not in stop_words and not has_numbers(word)
            and not is_unit(word)]

In [28]:
meta_processed = meta.copy()

In [29]:
meta_processed['category'] = meta_processed['category'].apply(lambda sent: separate_words(sent))

In [30]:
meta_processed['subcategory'] = meta_processed['subcategory'].apply(lambda sent: separate_words(sent))

In [31]:
meta_processed['name'] = meta_processed['name'].apply(lambda sent: separate_words(sent))

This is what they look like now:

In [32]:
meta_processed.head()

Unnamed: 0,productid,brand,category,subcategory,name
0,HBV00000AX6LR,Palette,"[kişisel, bakım]","[saç, bakımı]","[palette, kalıcı, doğal, renkler, papatya]"
1,HBV00000BSAQG,Best,"[pet, shop]",[kedi],"[best, pet, jöle, i̇çinde, parça, etli, somonl..."
2,HBV00000JUHBA,Tarım Kredi,"[temel, gıda]","[bakliyat, pirinç, makarna]","[türkiye, tarım, kredi, koopyeşil, mercimek]"
3,HBV00000NE0QI,Namet,"[balık, şarküteri]",[şarküteri],"[namet, fıstıklı, macar, salam]"
4,HBV00000NE0UQ,Muratbey,"[kahvaltılık, süt]",[peynir],"[muratbey, burgu, peyniri]"


### Converting Data Types

Here, only the price column's type is changed to float.

In [33]:
events['price'] = events['price'].astype(float)

## Approach 1:  Content-Based Filtring

Here, the aim is to build a recommendation system which recommends products related/similar to the ones in the cart. Here, the focus will be on the meta dataset since it contains the features of the products.


This could be done in many ways. A couple of methods are implemented in the following cells.

### Method 1: CountVectorizer


For this method, all text fetures are combined together. Then, each set of words is converted to a vector representation on the basis of the frequency (count) of each word that occurs in the entire text.

In [34]:
meta_m1 = meta_processed.copy()

In [35]:
# combining the text features to form a bag of words
meta_m1['Bow'] = (meta_m1['category'] + meta_m1['subcategory'] + meta_m1['name']).apply(lambda ls: ' '.join(ls))
meta_m1 = meta_m1[['productid', 'Bow']]

In [36]:
meta_m1.head()

Unnamed: 0,productid,Bow
0,HBV00000AX6LR,kişisel bakım saç bakımı palette kalıcı doğal ...
1,HBV00000BSAQG,pet shop kedi best pet jöle i̇çinde parça etli...
2,HBV00000JUHBA,temel gıda bakliyat pirinç makarna türkiye tar...
3,HBV00000NE0QI,balık şarküteri şarküteri namet fıstıklı macar...
4,HBV00000NE0UQ,kahvaltılık süt peynir muratbey burgu peyniri


Transforming the text into vectors. 

In [37]:
count = CountVectorizer()
count_matrix = count.fit_transform(meta_m1['Bow'])

Calculating the similarities between all products using the cosine similarity method. 

In [38]:
sim_matrix = cosine_similarity(count_matrix, count_matrix)
sim_matrix.shape

(10235, 10235)

In [39]:
# converting the matrix to a dataframe
m1_sim_mat = pd.DataFrame(sim_matrix)
m1_sim_mat.columns = meta['productid'].values
m1_sim_mat.insert(0, 'productid', meta['productid'])

In [40]:
m1_sim_mat.head()

Unnamed: 0,productid,HBV00000AX6LR,HBV00000BSAQG,HBV00000JUHBA,HBV00000NE0QI,HBV00000NE0UQ,HBV00000NE1NR,HBV00000NH2LJ,HBV00000NVZ7D,HBV00000NVZCG,...,ZYBICN9286868,ZYECZACI9300200,ZYECZACI9470301,ZYFSAN6010014,ZYHPETICEDIY014,ZYHPREISBBKL008,ZYNES11470137,ZYPAREX1909309,ZYPAREX2107986,ZYPYON6690
0,HBV00000AX6LR,1.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,...,0.0,0.100504,0.125988,0.0,0.0,0.0,0.0,0.100504,0.096225,0.096225
1,HBV00000BSAQG,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,HBV00000JUHBA,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.239046,0.0,0.572078,0.0,0.0,0.0,0.0
3,HBV00000NE0QI,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,HBV00000NE0UQ,0.0,0.0,0.0,0.0,1.0,0.408248,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Here, we pickle the dataframe shown above to use it on a server.

In [41]:
m1_sim_mat.to_pickle("pickled_dfs/cb_m1.pkl")

The similarity matrix is ready and for any new product the top n items to be recommended can be determined. In the follwoing cells two examples test the method. 

In [42]:
def m1_recommend_products(product_id, n_products):
    """ Returns similar products ids to the product id given along with the similarities """
    product_profile = m1_sim_mat[m1_sim_mat['productid'] == product_id]
    product_profile_dict = product_profile.to_dict()
    del product_profile_dict['productid'] # deleting the product_id (1st column)
    productid_sim_tuples = [(key, sub_dict[list(sub_dict)[0]]) for key, sub_dict in product_profile_dict.items()]
    top_n_products = sorted(productid_sim_tuples, key=lambda productid_sim_tuples:productid_sim_tuples[1],
                  reverse=True)[1:n_products+1]
    product_ids, similarities = list(zip(*top_n_products))
    return product_ids, similarities

#### Example: 1

In [43]:
product_ids, similarities = m1_recommend_products('HBV00000NVZCG', 10)

##### Input Product

In [44]:
meta[meta['productid'] == 'HBV00000NVZCG']

Unnamed: 0,productid,brand,category,subcategory,name
8,HBV00000NVZCG,,Meyve ve Sebze,Meyve,Granny Smith Elma 500 gr


##### Recommended Products

The method seems to be working perfectly but it has its drawbacks. Notice for example the first recommended product. Although it is related, since it is a fruit, still other products should have had higher similarities. This is because the method is simple and depends only on the frequencies of the terms. 

In [45]:
output_products = meta[meta['productid'].isin(product_ids)].copy()
output_products.reset_index(drop=True, inplace=True)
output_products['similarity'] = similarities 
output_products

Unnamed: 0,productid,brand,category,subcategory,name,similarity
0,HBV00000PUQE0,,Meyve ve Sebze,Meyve,Mandalina kg 500 gr,0.942809
1,HBV00000O2SGQ,,Meyve ve Sebze,Meyve,Elma Starking 500 gr,0.875
2,HBV00000NVZCK,,Meyve ve Sebze,Meyve,Elma Granny Ekonomik 500 gr,0.801784
3,HBV00000O2SHB,,Meyve ve Sebze,Meyve,Elma Golden Organik 750 gr,0.801784
4,HBV00000NVZCE,,Meyve ve Sebze,Meyve,Elma Starking 500 gr,0.801784
5,HBV00000O2SHN,,Meyve ve Sebze,Meyve,Elma Amasya Paket 500 gr,0.801784
6,HBV00000O2SH1,,Meyve ve Sebze,Meyve,Elma Fujı Paket 500 gr,0.75
7,HBV00000O2SJF,,Meyve ve Sebze,Meyve,Elma Granny Smith Organik 750 gr,0.75
8,HBV00000O2SHK,,Meyve ve Sebze,Meyve,Amasya Elma 500 gr,0.75
9,HBV00000NVZCM,,Meyve ve Sebze,Meyve,Golden Elma 500 gr,0.721688


#### Example: 2

In [46]:
product_ids, similarities = m1_recommend_products('ZYBICN9286868', 10)

##### Input Product

In [47]:
meta[meta['productid'] == 'ZYBICN9286868']

Unnamed: 0,productid,brand,category,subcategory,name
10225,ZYBICN9286868,Lipton,İçecekler,Gazsız İçecekler,LİPTON IİCE TEA ŞEFTALI AROMALI TNK 500 ML


##### Recommended Products

For this example, the results seem to be well related to the input product. 

In [48]:
output_products = meta[meta['productid'].isin(product_ids)].copy()
output_products.reset_index(drop=True, inplace=True)
output_products['similarity'] = similarities 
output_products

Unnamed: 0,productid,brand,category,subcategory,name,similarity
0,ZYHPPEPSIGZS019,LIPTON ICE TEA,İçecekler,Gazsız İçecekler,"Lipton Ice Tea Şeftali 1,5 L",0.889499
1,ZYHPCOCACGZS010,Fuse Tea,İçecekler,Gazsız İçecekler,Fuse Tea Şeftali Pet 1 Lt,0.859338
2,ZYBICN9287068,Lipton,İçecekler,Gazsız İçecekler,LİPTON İCE TEA DOUBLE ŞEFTALİ & KAYISI AROMALI...,0.787726
3,HBV00000NFHOB,Lipton,İçecekler,Gazsız İçecekler,Lipton Ice Tea Şeftali 4 x 250 ml,0.7396
4,ZYBICN9286873,Lipton,İçecekler,Gazsız İçecekler,Lipton İce Tea Şeftali-Kayısı 1.5 Lt,0.701646
5,HBV00000PQKHY,Lipton,İçecekler,Gazsız İçecekler,Lipton İce Tea Şeftali Aromalı 6*330 Ml,0.701646
6,ZYBICN9286869,Lipton,İçecekler,Gazsız İçecekler,LİPTON İCE TEA LİMON AROMALI TNK 500 ML,0.64715
7,ZYBICN9286870,Lipton,İçecekler,Gazsız İçecekler,LİPTON İCE TEA DOUBLE ÇİLEK & KAVUN AROMALI TN...,0.64715
8,ZYBICN9310832,Lipton,İçecekler,Gazsız İçecekler,Lipton İce Tea Şeftali 2 Lt,0.64715
9,HBV00000PQKHW,Lipton,İçecekler,Gazsız İçecekler,Lipton İce Tea Şeftali Aromalı 500 Ml,0.64715


### Method 2: Doc2Vec


Another method that could be used to help in finding similar products to the products in cart involves Doc2vec. 

To find the similarity between the combined senteces (category + subcategory + name), Doc2Vec is used here to assign vectors to documents where each product's features represent a document. 

After assigninig each product a vector, cosine similarity could be used to find the similary between every two documents/products.

In [49]:
meta_m2 = meta_processed.copy()

Here, the features are combined in one feature. 

In [50]:
meta_m2['list_of_words'] = meta_m2['category'] + meta_m2['subcategory'] + meta_m2['name']

Now, list_of_words contains the combined lists of words.

In [51]:
meta_m2.head()

Unnamed: 0,productid,brand,category,subcategory,name,list_of_words
0,HBV00000AX6LR,Palette,"[kişisel, bakım]","[saç, bakımı]","[palette, kalıcı, doğal, renkler, papatya]","[kişisel, bakım, saç, bakımı, palette, kalıcı,..."
1,HBV00000BSAQG,Best,"[pet, shop]",[kedi],"[best, pet, jöle, i̇çinde, parça, etli, somonl...","[pet, shop, kedi, best, pet, jöle, i̇çinde, pa..."
2,HBV00000JUHBA,Tarım Kredi,"[temel, gıda]","[bakliyat, pirinç, makarna]","[türkiye, tarım, kredi, koopyeşil, mercimek]","[temel, gıda, bakliyat, pirinç, makarna, türki..."
3,HBV00000NE0QI,Namet,"[balık, şarküteri]",[şarküteri],"[namet, fıstıklı, macar, salam]","[balık, şarküteri, şarküteri, namet, fıstıklı,..."
4,HBV00000NE0UQ,Muratbey,"[kahvaltılık, süt]",[peynir],"[muratbey, burgu, peyniri]","[kahvaltılık, süt, peynir, muratbey, burgu, pe..."


List of words is used here first in tagging the documents, which is giving tags/identifiers to the vectors to be obtained. 

Then, Doc2Vec is train on the tagged documents with some specified hyperparameters which could tuned.  

In [52]:
list_of_words = meta_m2['list_of_words']

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(list_of_words)]

In [53]:
vec_size = 200
model = Doc2Vec(documents, vector_size=vec_size, workers=4)

In [54]:
def m2_recommend_products(n_products):
    """ Returns similar products ids to the product id given along with the similarities """

    product_sim = []
    for index, row in meta_m2[['productid', 'list_of_words']].iterrows():
        vec = model.infer_vector(row['list_of_words'])
        sim = cosine_similarity(rand_prod1_vec.reshape(1,vec_size), vec.reshape(1, vec_size))
        product_sim.append((row['productid'], sim[0][0]))

    top_n_products = sorted(product_sim, key=lambda product_sim:product_sim[1], reverse=True)[1:n_products+1]
    product_ids, similarities = list(zip(*top_n_products))
    return product_ids, similarities

#### Example: 1

##### Input Product

In [55]:
rand_prod1 = meta_processed.sample(n=1, random_state = 100)
rand_prod1

Unnamed: 0,productid,brand,category,subcategory,name
2454,PTINQUIK-078,Quik,"[pet, shop]",[kuş],"[quik, kumlu, tünek]"


Here, a vector is infered for this randomly chosen prodcut. 

In [56]:
rand_prod1 = rand_prod1['category'] + rand_prod1['subcategory'] + rand_prod1['name']
rand_prod1_vec = model.infer_vector(rand_prod1.values[0])

##### Recommended Products

For this example, the recommended products seem to be well related to the input product. 

In [57]:
product_ids, similarities = m2_recommend_products(10)

In [58]:
output_products = meta[meta['productid'].isin(product_ids)].copy()
output_products.reset_index(drop=True, inplace=True)
output_products['similarity'] = similarities 
output_products

Unnamed: 0,productid,brand,category,subcategory,name,similarity
0,HBV00000PQJQW,Jungle,Pet Shop,Kedi,Jungle Somonlu Kısır Kedi Maması 500 g,0.927541
1,HBV00000HWAX6,Bestpet,Pet Shop,Kedi,Bestpet Somonlu Kısırlaştırılmış Kedi Maması 1 Kg,0.922346
2,HBV00000NVZ6O,Quik,Pet Shop,Kuş,Quik Kuş Tüneği 4'lü,0.921395
3,HBV00000PQJ0L,Felix,Pet Shop,Kedi,Felix Balıklı Mama 100 g,0.919299
4,HBV00000PQJV7,Purina One,Pet Shop,Kedi,Purina One Sterilcat Sığır Etli Kedi Maması 800 g,0.917747
5,ZYNES12214275,Nesfit,Kahvaltılık ve Süt,Müsli ve Kahvaltılık Gevrek,Nesfit Ballı Bademli 400 gr,0.914566
6,PTINJNG-011,Jungle,Pet Shop,Kuş,Jungle Papağan Yemi 500 Gr,0.913718
7,HBV00000NE197,Namet,"Et, Balık, Şarküteri",Şarküteri,Namet Hindi Etli Salam 250 gr,0.910658
8,PTINJNG-012,Jungle,Pet Shop,Kemirgenler,Jungle Tavşan Yemi 500 Gr,0.909935
9,HBV00000ASP48,Bestpet,Pet Shop,Kedi,Bestpet Mix Karışıklı Etli Yetişkin Kedi Mamas...,0.908937


#### Example: 2

##### Input Product

In [59]:
rand_prod1 = meta_processed.sample(n=1, random_state = 55)
rand_prod1

Unnamed: 0,productid,brand,category,subcategory,name
752,HBV00000NG8KC,Universal,"[spor, outdoor, oto]","[spor, topları]","[universal, voleybol]"


In [60]:
rand_prod1 = rand_prod1['category'] + rand_prod1['subcategory'] + rand_prod1['name']
rand_prod1_vec = model.infer_vector(rand_prod1.values[0])

##### Recommended Products

For this example, the results are very interesting. The second recommended product, for example, does not relate with the input product directly at all. Further analysis may be conducted here to understand why such results are obtained. Also, it is good to keep in mind that Doc2Vec does not work very well on small datasets. The published paper itself uses tens-of-thouthands to millions of text. Therefore, certainly the results would become more meaningful as the dataset gets bigger. Finally, the hyperparameters of the model may be tuned and the results are compared to get more reliable results. 

In [61]:
product_ids, similarities = m2_recommend_products(10)

In [62]:
output_products = meta[meta['productid'].isin(product_ids)].copy()
output_products.reset_index(drop=True, inplace=True)
output_products['similarity'] = similarities 
output_products

Unnamed: 0,productid,brand,category,subcategory,name,similarity
0,HBV00000PQM69,Yurt,Pratik Yemekler,Hazır Yemekler,Yurt Haşlanmış Barbunya 800 g,0.936752
1,HBV00000U2B2S,Ersan,"Et, Balık, Şarküteri",Balık ve Deniz Mahsülleri,Erşan Et Dana 400 gr Sosis,0.928542
2,HBV00000NG8KC,Universal,"Spor, Outdoor ve Oto",Spor Topları,Universal CV302 N5 Voleybol Topu,0.925511
3,HBV00000PVAMC,Ogx,Kişisel Bakım,Saç Bakımı,Ogx Kırılma Karşıtı Keratin Oil Şampuan 385 ml,0.925216
4,AILEHSBR53378,Nerf,Oyuncak ve Kırtasiye,Oyuncak,Nerf N-Strike Elite Firestrike,0.924199
5,HBV00000U27CL,Nestle,Kahvaltılık ve Süt,Müsli ve Kahvaltılık Gevrek,Nestle Nesquik Gevrek 225 Gr,0.923787
6,ZYULKERSSU004,Saka,Su,Su,Saka Su 6x500 ml,0.922939
7,OFISFAB2500,Faber-Castell,Oyuncak ve Kırtasiye,Kalemler,Faber-Castell Goldfaber Dereceli Kurşun Kalem HB,0.922829
8,HBV00000O2SC1,Aytaç,"Et, Balık, Şarküteri",Şarküteri,Aytaç Piliç Salam 600 gr,0.922063
9,HBV00000QX25B,Yurt,Pratik Yemekler,Hazır Yemekler,Yurt Fasulye Pilaki 400 gr,0.921431
