### Product2vec: Instacart Item Embeddings

This is a naive example on how to generate embeddings from the instacart dataset

In [2]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from gensim.test.utils import get_tmpfile
from datetime import datetime
%matplotlib inline

In [18]:
# append our repeated orders with our prior orders, to get the complete t-log.

o1 = pd.read_csv('./data/order_products__train.csv')
o2 = pd.read_csv('./data/order_products__prior.csv')
orders = pd.concat([o1, o2], ignore_index=True)

products = pd.read_csv('data/products.csv')

In [19]:
relevant_cols = ['order_id','product_name']

#downsample while I test the code for faster iteration on syntax. run full dataset before commit.
sample_size = 1

baskets = (orders
           .merge(products,on='product_id',how='left')
           .sample(frac=sample_size)
          )[relevant_cols]

#memory management on my local computer
del([orders,products])

In [20]:
baskets.sort_values(['order_id']).head(20)

Unnamed: 0,order_id,product_name
4,1,Lightly Smoked Sardines in Olive Oil
3,1,Cucumber Kirby
6,1,Organic Hass Avocado
1,1,Organic 4% Milk Fat Whole Milk Cottage Cheese
0,1,Bulgarian Yogurt
7,1,Organic Whole String Cheese
2,1,Organic Celery Hearts
5,1,Bag of Organic Bananas
1384623,2,Original Unflavored Gelatine Mix
1384620,2,Coconut Butter


### Embedding Size

This will matter as we use gensim's word2vec implementation for this task. The learning task for word2vec is predicting the a missing word given a window of words around it using a single hidden layer.  Rather than caring about the quality of the prediction, the weights of the hidden layer are what represent the product embedding that we will use. The number of neurons in the hidden layer is a tunable parameter. Unfortunately, there isn't great guidance on select for this, but eyeballing the resulting embeddings can give guidance on quality of fit. Some people recommend using the 4th root of unique tokens in our corpus, which I'll try.

A tunable parameter for the algorithm is the context window, how many words around the target word to use for our prediction task. Given the lack of order, we will want to use the size of the largest basket, 145 for this parameter.

In [21]:
num_items = baskets.product_name.nunique()
embedding_size = np.floor(num_items**0.25).astype('int')
print('''Let's use vectors of length {n} for {tokens} products'''.format(n=embedding_size, tokens = num_items))

biggest_basket = np.max(baskets.groupby('order_id').product_name.nunique())
print('''The biggest basket (window in our algorithm) will be {}'''.format(biggest_basket))

Let's use vectors of length 14 for 49685 products
The biggest basket (window in our algorithm) will be 145


### Shaping our Data
The gensim implementation of word2vec expects each document to be a list. Traditionally, each document is a list of words. In this case, each basket is a list of products. We will use the product name, which will be more expensive in memory but will make interpretation easier.

In [23]:
baskets

Unnamed: 0,order_id,product_name
15854604,1526854,Salted Caramel Greek Nonfat Yogurt
29196199,2932874,Honey Almond Butter Single
11154781,1031594,Organic Ginger Root
26696970,2669586,Organic Bosc Pear
30049449,3023079,Non Fat Raspberry Yogurt
...,...,...
25142377,2505655,Extra Virgin Olive Oil
27358394,2739053,Organic Whole Strawberries
16770616,1623554,Chocolate Cashew Milk
12391114,1161953,Soft Pretzel Burger Buns


In [16]:
baskets =  [lambda baskets:baskets.product_name.tolist()]
baskets

[<function __main__.<lambda>(baskets)>]

In [24]:
#df_of_basket_lists = (baskets
#        .groupby('order_id')
#        .apply(lambda baskets :
#                baskets.product_name
#                .tolist()
#               )
#       )
df_of_basket_lists = baskets.groupby('order_id')['product_name'].agg(list)

#memory management
del(baskets)

In [25]:
df_of_basket_lists.head()

order_id
1    [Cucumber Kirby, Organic Whole String Cheese, ...
2    [All Natural No Stir Creamy Almond Butter, Coc...
3    [Lemons, Total 2% with Strawberry Lowfat Greek...
4    [Plain Pre-Sliced Bagels, Sugarfree Energy Dri...
5    [American Slices Cheese, Wafer, Chocolate, Bag...
Name: product_name, dtype: object

In [27]:
model = Word2Vec(df_of_basket_lists, vector_size=embedding_size, window=biggest_basket)

In [29]:
def cosine_similarity(word_u,word_v,model):
    """
    Cosine similarity gets the similarity for two products and computes the similarity 
    between two embeddings in our word2vec model
        
    Arguments:
        u - numpy array of shape (n,)        
        v - numpy array of shape (n,)

    Returns:
        cosine similarity between words u & v
    """
    #get embeddings from gensim model
    u = model.wv[word_u]
    v = model.wv[word_v]

    #compute similarity
    dot = np.dot(u, v)
    norm_u = np.sqrt(np.sum(u * u))
    norm_v = np.sqrt(np.sum(v * v))
    cosine_similarity = dot / (norm_u * norm_v)
    
    return cosine_similarity

In [30]:
#a pair of similar identity items
cosine_similarity('Organic Whole Milk','Organic Reduced Fat Milk',model)

0.9441413

In [31]:
# a pair of very different items
cosine_similarity('Bag of Organic Bananas','Party Tumblers',model)

-0.40276483

In [32]:
# a pair of similar items within a department
cosine_similarity('Bag of Organic Bananas','Limes',model)

0.38589585

In [33]:
cosine_similarity('Lemons','Limes',model)

-0.21006641

In [37]:
cosine_similarity('Bag of Organic Bananas','Organic Whole Strawberries',model)

0.6742805