# Embeddings










**Ariel Rossanigo**

git clone git@github.com:arielrossanigo/embeddings.git

### Quien soy?

* Ariel Rossanigo
* Artificial Intelligence teacher at UCSE-DAR
* Developer, Data Scientist
* Co-Founder of Bloom AI

### Embeddings 

<div><img src="./imgs/mnist_embedding.png" width="40%" style="float: right; margin: 10px;" align="right"></div>


#### *Mapping of a discrete (categorical) variable to a vector of continuous numbers preserving some meaningful properties*




### NLP & Word2Vec

#### How we can preprocess words in order to solve NLP tasks?

* One hot encoding 

* Word embeddings



#### One hot encoding 


* Each word is a vector of length *V*; where *V* is the vocabulary size
* Simple
* Huge vector when *V* is high (and tipically it is)
* Loose *semantic* meaning

<div><img src="./imgs/one_hot_encoding.png" width="35%" style="float: right; margin: 10px;"></div>



#### Word2Vec

* Maps each word to a low dimentional vector of real numbers
* Preserve semantic relationships
* Assumes that words that share similar contexts will be similar to each other

* In the original paper [1], autors present 2 models: Continuos Bag Of Words and Skip-Gram


<div><img src="./imgs/word2vec.png" height="20%" ></div>

### Word2Vec Basics

* The basic idea is to use context of words 

<div><img src="./imgs/context_window.png" width="50%" ></div>


### Word2Vec Basics

* Two different ways of looking at the problem

<div><img src="./imgs/skipgram_nn.png" width="40%" style="float: left; margin: 10px;"></div>
<div><img src="./imgs/skipgram_dual.png" width="50%" style="float: right; margin: 10px;"></div>


### Word2Vec SkipGramNegativeSampling

* An optimization in the second approach
* Instead of having only pair (w, c) it creates pairs (w, c) = 1 | (w, nc) = 0 and makes a classifier

* Another improvements are listed in original papers [2] and a *simpler* explanation can be found here [3]



### How can I obtain the "embedding"?

* In approach one, is the output of W1
* In the second approach, is the output of *target_embedding_layer* or a combination with *context_embedding_layer*, or maybe we can decide to have the exactly same weights using a siamese network.

* Another improvements are listed in original papers [2] and a *simpler* explanation can be found here [3]



### Some astonish facts with semantics...

* Before Word2Vec the semantics relationships were simple, for instance *France* is similar to *Italy* and another countries names...

* In Word2Vec paper they show that performing some algebraic operation they can answer some question:

    * vector(biggest) - vector(big) + vector(small) gives a vector where the closest word was "Smallest"
    * vector(France) - vector(Paris) + vector(Germany) ~= "Berlin"    
    * vector(possibly) - vector(impossibly) + vector(ethical) ~= "unethical"

<div><img src="./imgs/embeddings_can_fail.png" width="80%" style="float: middle; margin: 10px;"></div>

## Other uses of word embeddings: Prod2Vec[4]

* Same ideas, but instead of sentences composed of words we have orders composed of products...
* With dual vector approach we can find similar products and complementary products, useful for recommendations like:
    * if you want to buy this... another options are these
    * if you bought this... maybe you would like to buy these things too

<div><img src="./imgs/skipgram_dual.png" width="80%" style="float: middle; margin: 10px;"></div>


### Let's see some code...

We will use *gensim* library that provides word2vec implementations

In [1]:
# !pip install gensim
# !pip install pandas
# !pip install matplotlib
# !pip install pyarrow

In [2]:
import pandas as pd

In [4]:
df = pd.read_parquet('sample_of_orders.parquet')

In [5]:
df.shape

(1816652, 3)

In [6]:
# lets group orders as lists of products
sentences = df.groupby('txn_id').product_name.unique()

In [7]:
sentences.head()

txn_id
1000003160592747140146510920181013246      [ball park cheese franks            14 oz, pri...
10000043206055511401520209201903101016     [private label - fz seafood, private label - c...
1000006510601821720148820220190126466      [private label - internal analgesics, private ...
10000428105966328001159720181127959524     [bigelow bnft, private label - butter/butter b...
100004402061220148011101220190526441034    [lil potato savoury herb, bubba burger origina...
Name: product_name, dtype: object

In [8]:
# lets get the arrays
as_arrays = sentences.map(list).values

In [9]:
as_arrays[0:2]

array([list(['ball park cheese franks            14 oz', 'private label - rfg salad/coleslaw', 'private label - soup', 'private label - dough/biscuit dough - rfg', 'bar-s smokehouse franks']),
       list(['private label - fz seafood', 'private label - carbonated beverages', 'jack link beef orginal jerky     2.85 oz', 'jlytm blast o butter 10.5 oz', 'edys sc light vanilla bean cup'])],
      dtype=object)

In [10]:
import gensim



- sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
- window (int, optional) – Maximum distance between the current and predicted word within a sentence.
- min_count (int, optional) – Ignores all words with total frequency lower than this.
- hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
- negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
- ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.

In [13]:
# lets train the word2vec
window_size = 20
epochs = 5
embedding_size = 100

#Drop infrequent items in dataset
min_count = 5  
number_of_negative_samples = 7
ns_exponent = 0

model = gensim.models.Word2Vec(
    sentences=as_arrays,
    sg=1,  # use skipgram
    vector_size=embedding_size, 
    window=window_size, 
    min_count=min_count, 
    workers=4,
    hs=0,
    negative=number_of_negative_samples,
    ns_exponent=ns_exponent, 
    epochs=epochs
)

In [14]:
word_vectors = model.wv

In [15]:
# product = 'edys sc light vanilla bean cup'
product = 'coke classic  nr 20 oz'

In [16]:
word_vectors[product]

array([-0.37786067,  0.10014538, -0.3283386 , -0.2734145 ,  0.32060197,
       -0.5418468 , -0.05063178,  0.08003001,  0.11638877, -0.20754491,
       -0.11380889, -0.30682042,  0.12010448,  0.1645465 ,  0.51134473,
        0.18624924,  0.05772192,  0.19505486, -0.3844798 , -0.64751995,
       -0.1735677 , -0.47491857, -0.44679746, -0.09748479, -0.4362577 ,
        0.19383126, -0.40425456, -0.01894782, -0.1811316 , -0.10646037,
        0.49161908, -0.13181087, -0.12184609, -0.18782769, -0.16105498,
        0.05481014, -0.18293212,  0.19886328, -0.04804575, -0.42942217,
        0.01384899,  0.06187457, -0.25201553, -0.18278848,  0.15408042,
       -0.00554037, -0.3577863 , -0.39891836,  0.12177671,  0.1775741 ,
        0.14827101, -0.12403344, -0.14512165, -0.36569324,  0.09821078,
       -0.26531336,  0.2678501 ,  0.05258592, -0.28740278, -0.28290296,
       -0.27937248, -0.07517467,  0.17154448, -0.09534959, -0.32005057,
        0.28445038,  0.25396556,  0.47513235, -0.51146364,  0.15

In [17]:
word_vectors.similar_by_word(product, topn=10)

[('coke sprite nr 20 oz', 0.9018566608428955),
 ('dr ppr nr 20 oz', 0.8920454382896423),
 ('pepsi nr 20 oz', 0.8843024373054504),
 ('mt dew nr 20 oz', 0.8816001415252686),
 ('m&m pnut choc candy ks 1 ct', 0.8590784668922424),
 ('snkers almond mixed shpr 1 ct', 0.854461669921875),
 ('slmjim jim giant .97 oz', 0.8429577946662903),
 ('snkers king size 2 - piece bar 1 ct', 0.8365886807441711),
 ('reese pnt btr cup kng sz 1 ct', 0.8359562754631042),
 ('fanta orange 20 oz', 0.829744279384613)]

In [18]:
products = []
vectors = []
for product in df.product_name.unique():
    try:
        vectors.append(word_vectors[product])
        products.append(product)
    except:
        pass

In [19]:
df_products = pd.DataFrame({'product_name': products})

In [20]:
df_vectors = pd.DataFrame(vectors)

In [21]:
df_products.to_csv('products_metadata.tsv', sep='\t', index=False, header=False)
df_vectors.to_csv('products_embeddings.tsv', sep='\t', index=False, header=False)

* Now data can be loaded into https://projector.tensorflow.org/

## More uses for a future talk

* Encoding categorical variables Cat2Vec: TL;DR: use a supervise method to train embeddings


### Gracias! Preguntas?


<div style="float: left;"><img src="../common/imgs/man-qmark.jpg" width="300" align="middle"></div> 

<div>
<div>
  <img src="../common/imgs/gmail-1162901_960_720.png" style="width: 30px; float: left; vertical-align:middle; margin: 0px;">
  <span style="line-height:30px; vertical-align:middle; margin-left: 10px;">arielrossanigo@gmail.com</span>
</div>
<div>
  <img src="../common/imgs/twitter-312464_960_720.png" style="width: 30px; float: left; vertical-align:middle; margin: 0px;">
  <span style="line-height:30px; vertical-align:middle; margin-left: 10px;">@arielrossanigo</span>
</div>
<div>
  <img src="../common/imgs/github-154769__340.png" style="width: 30px; float: left; vertical-align:middle; margin: 0px;">
  <span style="line-height:30px; vertical-align:middle; margin-left: 10px;">https://github.com/arielrossanigo</span>
</div>
<div>
  <img src="../common/imgs/Linkedin_icon.svg" style="width: 30px; float: left; vertical-align:middle; margin: 0px;">
  <span style="line-height:30px; vertical-align:middle; margin-left: 10px;">https://www.linkedin.com/in/arielrossanigo/</span>
</div>

</div>


# References

[1] Efficient Estimation of Word Representations in Vector Space ( https://arxiv.org/abs/1301.3781)

[2] Distributed Representations of Words and Phrases and their Compositionality (https://arxiv.org/abs/1310.4546)

[3] word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method (https://arxiv.org/abs/1402.3722)

[4] E-commerce in Your Inbox: Product Recommendations at Scale ( https://arxiv.org/abs/1606.07154)
