# Sentence embedding - från start till bert
<img src='images/bert-ernie-sesame-street.jpg' style='width:800px'/>


### Kort om utvecklingen inom NLP
* Word2vec – embeddings
    * fastText och GloVe
    * doc2vec

        
* seq2seq neurala nätverk
    * Encoder-decoder nätverk
    * Attention
    * Transformer model
    * BERT

### word2vec 
https://arxiv.org/abs/1301.3781

Representera ord utifrån vilka ord som används i samma sammanhang

”show me your friends, and I'll tell who you are”

<img src='images/word2vec.png' style='width:600px'/>

## fastText och GloVe

* Användbara paket för träning av word2vec modeller
* Har förtränade modeller på olika språk
<img src='images/fasttext.png' style='width:400px'/>

### setup fastText
```bash

%%bash 
pip install -U tensorflow==1.14.0 gensim==3.8.0 fasttext==0.9.1 

wget 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip'
mkdir fasttext 
unzip 'wiki-news-300d-1M.vec.zip' -d fasttext
rm 'wiki-news-300d-1M.vec.zip'
```


In [8]:
import numpy as np

def cosine_sim(s1, s2):
    return np.dot(s1,s2)/(np.linalg.norm(s1)*np.linalg.norm(s2)+1e-9)


def get_fasttext_wv(word, model, vec_len=300):
    try:
        wv = model[word]
    except:
        wv = np.zeros(vec_len)
    return wv


def fasttext_doc2vec(string, model, stop_words=list(), weights=dict(), smoothing_factor=1, n_dim=300):
    """
    sentence embedding from a word embedding model and string sentence. 
    args:
    string(str): the string to be embedded
    model(gensim.models.keyedvectors.Word2VecKeyedVectors): the model
    stop_words(list): optional stop words to be removed
    weights(dict): optional dict with weights of words 
    smoothing_factor(int): helps smoothing the doc2 vec. A nonzero value is as 
    if all words without an explicit weight would have that value as explict. A smoothing factor=0 
    is equivalent to ignoring words that has no explcit weight. Set to 1 if unsure.
    
    """
    words = set(re.findall("[\w\d'-]+", string.lower())) # ignores multiple uses of a word in the doc
    
    word_weights = []
    if words:
        word_vectors = [get_fasttext_wv(word, model) for word in words if word not in stop_words]
        for word in words:
            try:
                word_weights.append(weights[word])
            except:
                word_weights.append(smoothing_factor)
                
        se = np.mean([vec / (np.linalg.norm(vec) + 1e-9) * weight for vec, weight in zip(word_vectors, word_weights)], axis = 0)
    else:
        print('Zero doc2vec vector for: ' + string)
        se = np.zeros(n_dim)
    return se  



In [9]:
from gensim.models import KeyedVectors
import re
import numpy as np
filename = 'fasttext/wiki-news-300d-1M.vec'
fasttext_model = KeyedVectors.load_word2vec_format(filename)



In [29]:
## text att jämföra
string_1 = 'What is your favourite hobby?'
string_2 = 'Wat do you really like to do in your spare time?'


str_1_emb = fasttext_doc2vec(string_1, fasttext_model)
str_2_emb = fasttext_doc2vec(string_2, fasttext_model)

s1_s2_sim_fasttext = cosine_sim(str_1_emb, str_2_emb)

print('Document similarity between \n\n(1) {s1}\n(2) {s2}\nis: {sim}'.format(s1=string_1,
                                                                       s2=string_2,
                                                                       sim=s1_s2_sim_fasttext))

Document similarity between 

(1) What is your favourite hobby?
(2) Wat do you really like to do in your spare time?
is: 0.8130461087004068


## doc2vec
* Inte helt lätt att skapa representationer av hela meningar/dokument
* BoW
* Viktade medelvektorer
* Doc2VecC - https://arxiv.org/pdf/1707.02377.pdf
* WMD - http://proceedings.mlr.press/v37/kusnerb15.pdf
* Svårt att hantera ordning på ord och homonymer (banan, rabatt, springa)

## seq2seq
* RNN till RNN 
* Två moduler – Encoder och decoder
<img src='images/encdec.gif' style='width:600px'/>

## Encoder-decoder network
<img src='images/enc-dec network.PNG' style='width:1000px'/>

### Encoder-decoder network  med attention
<img src='images/attention layer.PNG' style='width:1000px'/>

### Attention is all you need – Transformers 
> https://arxiv.org/abs/1706.03762 

> http://jalammar.github.io/illustrated-transformer/


<img src='images/the_transformer_3.png' style='width:1000px'/>
<img src='images/transformer_4.png' style='width:500px'/>

### Transformer multi-head attention

<img src='images/transform20fps.gif' style='width:400px'/>

* Inga RNN-moduler --> enklare parallellisering 


# BERT 
> "Bidirectional Encoder Representations from Transformers"

Fokus på representationerna från encoder-delen ≈ doc2vec





```bash
%%bash 
pip install bert-serving-server==1.9.6 bert-serving-client==1.9.6
wget 'https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip'
mkdir bert 
unzip 'uncased_L-12_H-768_A-12.zip' -d bert
rm 'uncased_L-12_H-768_A-12.zip'
```


##### starta bert-server (i separat terminal)
```bash
bert-serving-start -model_dir '/home/gurra/coding/bert/bert_tutorial/bert/uncased_L-12_H-768_A-12' -num_worker=1 
```

In [31]:
from bert_serving.client import BertClient
bc=BertClient()

In [32]:


s1 = bc.encode([string_1])
s2 = bc.encode([string_2])
s1_s2_sim_bert = cosine_sim(s1[0],s2[0])

print('Document similarity between \n\n(1){s1}\n(2){s2}\nis: {sim}'.format(s1=string_1,
                                                                       s2=string_2,
                                                                       sim=s1_s2_sim_bert))

Document similarity between 

(1)What is your favourite hobby?
(2)Wat do you really like to do in your spare time?
is: 0.8376639668328283


### Vidare forskning

* XLnet - https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/
    