# Learning Product Vector Embedding: Word2vec and Fasttext 

This notebook shows you how to learn effective vector representation of products with Word2vec and Fasttext. This method is primarily designed for modelling words. 

### Primer on word2vec

The general approach is to predict whether or not a word appears in some context. Similar words tend to appear in similar contexts. Here, context physically manifests as a bag-of-words. We then set up a training paradigm to predict word-context association. At the end of the training process, we obtain a bunch of word vectors Ww in space W and context vectors c in C, where similar words appear close to each other in W.

In reality, words in sentences appear in an order, and together construct a meaningful sentence. However, it makes statistical methods for NLP difficult. So we came up with the bag-of-word approach. Bag-of-word, as the name suggests, does not preserve the order of the words appearing in the sentence. It works okay-ish so far. 

### Word2vec for product vector learning
Applying it to ecommerce, however, reduce the downside of bag-of-word approach by an order of magnitude, as the sequence of items appear in a shopping cart is a lot less important than words in a sentence (EG: you have a shopping list and wonder in a mall to get whatever you want, at the end of the day, getting the list filled is the goal).


### What to do with the vectors
    - Similar product search: this is the main use case
    - Use it in downstream model: :)
    
That's what I'm trying to do with this notebook.

In [1]:
import pandas as pd 
import gensim
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import warnings
warnings.filterwarnings('ignore')

In [2]:
sale = pd.read_csv('./input/transactions_train.csv')
prod = pd.read_csv('./input/articles.csv')


In [3]:
sale.shape

(31788324, 5)

In [4]:
sale.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [5]:
sale_hist = sale.groupby('customer_id')['article_id'].apply(list)
#remove `standalone` items, as there is no context vector for it to cling on
sale_hist  = sale_hist[ sale_hist.apply(len) > 1]

# Word2vec
- plain old word2vec trainning on sale data

In [6]:
model_w2v = gensim.models.Word2Vec(sentences=sale_hist, vector_size=100, window=10, min_count=2, workers=8)

In [32]:
def prod_similarity(prod_id,topn=10,model=model_w2v,mapper=lambda x:x,inv_mapper=lambda x:x):
    """
    params:
        prod_id: the article id
        topn: how many results to display
        model: the word2vec/fasttext model
        mapper: callable function, use for fasttext 
        inv_mapper: callable function, use for fasttext
    return:
        pandas Dataframe
    """
    pid = mapper(prod_id)
    sims = model.wv.most_similar(pid, topn=topn)
    sims += (pid,1.),
    
    sims = {inv_mapper(k):v for k,v in sims}
    
    matched = prod[prod.article_id.isin(sims.keys())]
    matched['similarity'] = matched.article_id.map(sims)
    return matched.sort_values('similarity',ascending=False)


In [33]:
prod_similarity(524939008)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,rep,similarity
8658,524939008,524939,Left eye jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Loose-knit jumper in a nylon blend with droppe...,aid_524939008-type_252-app_1010016-color_13-de...,1.0
34121,640597003,640597,Hanna,252,Sweater,Garment Upper body,1010016,Solid,73,Dark Blue,4,Dark,2,Blue,1646,Knitwear Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1003,Knitwear,"V-neck jumper in a soft, loose knit with dropp...",aid_640597003-type_252-app_1010016-color_73-de...,0.907355
8303,522297001,522297,Margit Winter,94,Sneakers,Shoes,1010016,Solid,9,Black,4,Dark,5,Black,3929,Divided Shoes,D,Divided,2,Divided,52,Divided Accessories,1020,Shoes,Hi-tops in imitation suede with imitation leat...,aid_522297001-type_94-app_1010016-color_9-dept...,0.903446
11235,546610002,546610,Dido Jumper,252,Sweater,Garment Upper body,1010021,Lace,9,Black,4,Dark,5,Black,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Wide jumper in a soft, loose, fine knit with a...",aid_546610002-type_252-app_1010021-color_9-dep...,0.89915
33681,639478002,639478,Maybe you need a cardigan,245,Cardigan,Garment Upper body,1010020,Contrast,43,Dark Red,4,Dark,3,Orange,1626,Knitwear,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1003,Knitwear,"Cardigan in a soft, rib-knit, marled cotton bl...",aid_639478002-type_245-app_1010020-color_43-de...,0.896368
29984,626588006,626588,James Biker,262,Jacket,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,1,Mole,1201,Outwear,A,Ladieswear,1,Ladieswear,19,Womens Jackets,1007,Outdoor,Biker jacket in grained imitation leather with...,aid_626588006-type_262-app_1010016-color_13-de...,0.895824
38734,658684001,658684,Tiger jacket,262,Jacket,Garment Upper body,1010016,Solid,19,Greenish Khaki,2,Medium Dusty,20,Khaki green,1244,Outdoor/Blazers,D,Divided,2,Divided,53,Divided Collection,1007,Outdoor,Thin jacket in woven fabric with a drawstring ...,aid_658684001-type_262-app_1010016-color_19-de...,0.895648
14372,562582001,562582,PUFFY CREW V 5,254,Top,Garment Upper body,1010016,Solid,73,Dark Blue,4,Dark,2,Blue,1660,Jersey,A,Ladieswear,1,Ladieswear,6,Womens Casual,1005,Jersey Fancy,Top in light sweatshirt fabric with dropped sh...,aid_562582001-type_254-app_1010016-color_73-de...,0.893439
13323,557561001,557561,ENID TUBE,80,Scarf,Accessories,1010004,Check,12,Light Beige,1,Dusty Light,11,Beige,3409,Scarves,C,Ladies Accessories,1,Ladieswear,65,Womens Big accessories,1019,Accessories,Tube scarf in woven fabric.,aid_557561001-type_80-app_1010004-color_12-dep...,0.892133
12784,554757004,554757,Winona,245,Cardigan,Garment Upper body,1010010,Melange,19,Greenish Khaki,4,Dark,20,Khaki green,1626,Knitwear,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1003,Knitwear,"Knee-length cardigan in a soft, fine knit with...",aid_554757004-type_245-app_1010010-color_19-de...,0.88987


# Fasttext model

### Drawback of word2vec 
Word2vec took over the world around 2015. However, it still has its drawbacks. The most prominent one is that it could not construct new vectors for unseen words. This is important, as social network companies like Facebook have to deal with a lot of junk words appearing online, as people typign is usually sloppry.

In languages like English, words have roots. Example: the word `Recommendation` is a noun, rooting from the verb `recommend`. So is `Recommending`. They are morphics of the same root. The suffixes `-tion` and `-ing` also share meaning across other words as well.


So if we can take advantage of those shared sub-word information, we can build better word vectors.That's when fasttext comes into the scene. But building a parsing method to break down words like `Recommendation` to meaningful subwords is costly and we are kinda lazy. So we smash it to equal sized chunks called ngrams and use some of them to help us build a better vector.

# Fasttext for ecommerce data

In our case, we'll re-encode our article_id or product id with its property, with the hope that we can build a better vector. 

# Evaluation

To measure better-ness, it would require doing benchmarks and stuff and I'm kinda lazy. Luckily, this dataset has higher level information like category and such, we can use this information to have a feel for the quahahlity of the embedding. Similar products should be found in similar category/deparment/garment_group_name etc

Seriously, where can I find such benchmarks ?



## Mapping product code to string representation

In [9]:
for c in ['product_code', 'prod_name', 'product_type_no',
       'product_type_name', 'product_group_name', 'graphical_appearance_no',
       'graphical_appearance_name', 'colour_group_code', 'colour_group_name',
       'perceived_colour_value_id', 'perceived_colour_value_name',
       'perceived_colour_master_id', 'perceived_colour_master_name',
       'department_no', 'department_name', 'index_code', 'index_name',
       'index_group_no', 'index_group_name', 'section_no', 'section_name',
       'garment_group_no', 'garment_group_name', 'detail_desc']:
    prod[c]  = prod[c].astype(str)


In [10]:
col_identifier = {
    'article_id':'aid_',
    'product_type_no':'type_',
    'graphical_appearance_no':'app_',
    'colour_group_code':'color_',
    'department_no':'dept_',
    'index_code':'idx_',
    'index_group_no':'idg_',
    'section_no':'sec_',
    'garment_group_no':'garm_',
    
}
# squash all properties into a string, delimited by `-`
def _id2rep(row):
    prop = []
    for k,v in col_identifier.items():
        prop += (v + str(row[k])),
    return '-'.join(prop)

prod['rep'] = prod.apply(_id2rep,axis=1)



In [52]:
#sample encoded value
prod.rep.values[1]

'aid_108775044-type_253-app_1010016-color_10-dept_1676-idx_A-idg_1-sec_16-garm_1002'

In [11]:
id2rep = prod[['article_id','rep']].set_index('article_id').to_dict()['rep']
rep2id = {v:k for k,v in id2rep.items()}

In [46]:
sale_hist_ft = []
for order in sale_hist:
    sale_hist_ft += [id2rep[item] for item in order],


In [47]:
model_ft = gensim.models.FastText(sentences=sale_hist_ft, 
                                  vector_size=100, 
                                  window=10,
                                  min_count=2, 
                                  workers=8,)

In [48]:
from functools import partial
prod_similarity_ft = partial(prod_similarity,
                                model=model_ft,
                                mapper=lambda x: id2rep[x],
                                inv_mapper=lambda x: rep2id[x])


In [49]:
prod_similarity_ft2(524939008,topn=200)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,rep,similarity
8658,524939008,524939,Left eye jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Loose-knit jumper in a nylon blend with droppe...,aid_524939008-type_252-app_1010016-color_13-de...,1.0
43818,678280003,678280,Luve Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,4,Dark,11,Beige,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Jumper in loose-knit tape yarn with a slightly...,aid_678280003-type_252-app_1010016-color_13-de...,1.0
58087,718747003,718747,Betsy oversize v-neck,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Oversized jumper in a soft, fine knit with a V...",aid_718747003-type_252-app_1010016-color_13-de...,1.0
6396,505230006,505230,Zelda Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Wide polo-neck jumper knitted in bouclé yarn w...,aid_505230006-type_252-app_1010016-color_13-de...,0.999999
46493,685902005,685902,Shelly Poloneck,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,11,Beige,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Fitted jumper in a soft rib knit with a polo n...,aid_685902005-type_252-app_1010016-color_13-de...,0.999999
9344,532877002,532877,Sweetheart Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Jumper in a soft, fluffy knit with roll edges ...",aid_532877002-type_252-app_1010016-color_13-de...,0.999999
46927,687270002,687270,Valencia Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Fitted top in a soft rib knit with a V-neck fr...,aid_687270002-type_252-app_1010016-color_13-de...,0.999994
57092,715665002,715665,Twist knit LS,252,Sweater,Garment Upper body,1010016,Solid,53,Dark Pink,2,Medium Dusty,4,Pink,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Fitted top in a soft, fine knit with long slee...",aid_715665002-type_252-app_1010016-color_53-de...,0.998516
58989,721756001,721756,Louise LS,252,Sweater,Garment Upper body,1010016,Solid,53,Dark Pink,2,Medium Dusty,4,Pink,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Jumper in a soft, rib-knit viscose blend with ...",aid_721756001-type_252-app_1010016-color_53-de...,0.99851
43823,678280015,678280,Luve jumper,252,Sweater,Garment Upper body,1010016,Solid,20,Other Yellow,5,Bright,8,Yellow,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Jumper in loose-knit tape yarn with a slightly...,aid_678280015-type_252-app_1010016-color_20-de...,0.998435


# Fasttext with proper hash function


In [41]:

from gensim.models.fasttext_inner import ft_hash_bytes

def ft_ngram_hashes_custom(word, minn, maxn, num_buckets):
    """
        Note that this is a monkey patched version of a cython-based function, done 
        just to demonstrate an idea. The best way to do this is to do a gensim PR.
    """
    grams = [str.encode(x) for x in word.split('-')]
    hashes = [ft_hash_bytes(n) % num_buckets for n in grams]
    return hashes

gensim.models.fasttext.ft_ngram_hashes = ft_ngram_hashes2

In [43]:
ft_ngram_hashes_custom('aid_717490002-type_255-app_1010016-color_10-dept_1643-idx_D-idg_2-sec_51-garm_1002',2,3,2000)

[910, 1738, 1424, 628, 1477, 851, 100, 405, 1018]

In [18]:
model_ft2 = gensim.models.FastText(sentences=sale_hist_ft, 
                                  vector_size=100, 
                                  window=10,
                                  min_count=2, 
                                  workers=8,)

In [None]:
prod_similarity_ft2 = partial(prod_similarity,
                              model=model_ft2,
                              mapper=lambda x: id2rep[x],
                              inv_mapper=lambda x: rep2id[x])


In [53]:
prod_similarity_ft2(524939008,topn=50)

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,rep,similarity
8658,524939008,524939,Left eye jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Loose-knit jumper in a nylon blend with droppe...,aid_524939008-type_252-app_1010016-color_13-de...,1.0
43818,678280003,678280,Luve Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,4,Dark,11,Beige,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Jumper in loose-knit tape yarn with a slightly...,aid_678280003-type_252-app_1010016-color_13-de...,1.0
58087,718747003,718747,Betsy oversize v-neck,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Oversized jumper in a soft, fine knit with a V...",aid_718747003-type_252-app_1010016-color_13-de...,1.0
6396,505230006,505230,Zelda Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Wide polo-neck jumper knitted in bouclé yarn w...,aid_505230006-type_252-app_1010016-color_13-de...,0.999999
46493,685902005,685902,Shelly Poloneck,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,11,Beige,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Fitted jumper in a soft rib knit with a polo n...,aid_685902005-type_252-app_1010016-color_13-de...,0.999999
9344,532877002,532877,Sweetheart Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,2,Medium Dusty,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Jumper in a soft, fluffy knit with roll edges ...",aid_532877002-type_252-app_1010016-color_13-de...,0.999999
46927,687270002,687270,Valencia Jumper,252,Sweater,Garment Upper body,1010016,Solid,13,Beige,1,Dusty Light,1,Mole,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Fitted top in a soft rib knit with a V-neck fr...,aid_687270002-type_252-app_1010016-color_13-de...,0.999994
57092,715665002,715665,Twist knit LS,252,Sweater,Garment Upper body,1010016,Solid,53,Dark Pink,2,Medium Dusty,4,Pink,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Fitted top in a soft, fine knit with long slee...",aid_715665002-type_252-app_1010016-color_53-de...,0.998516
58989,721756001,721756,Louise LS,252,Sweater,Garment Upper body,1010016,Solid,53,Dark Pink,2,Medium Dusty,4,Pink,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,"Jumper in a soft, rib-knit viscose blend with ...",aid_721756001-type_252-app_1010016-color_53-de...,0.99851
43823,678280015,678280,Luve jumper,252,Sweater,Garment Upper body,1010016,Solid,20,Other Yellow,5,Bright,8,Yellow,5963,Tops Knitwear DS,D,Divided,2,Divided,58,Divided Selected,1003,Knitwear,Jumper in loose-knit tape yarn with a slightly...,aid_678280015-type_252-app_1010016-color_20-de...,0.998435


# Possible upgrades ? 
- Adding color indicators maybe ? 
- Params tuning of Fasttext ?
