# Sentence embedding - från start till bert

<img src='images/bert-ernie-sesame-street.jpg' style='width:800px'/>



# Delar i notebooken

### Word2vec – embeddings

    * fastText och GloVe
    * doc2vec
    * kodningsexempel - använda fastText 
    

        
### seq2seq neurala nätverk

    * Encoder-decoder nätverk
    * Attention
    * Transformer model
    * BERT
    * kodningsexempel - använda BERT 
    

# word2vec 
https://arxiv.org/abs/1301.3781

Representera ord utifrån vilka ord som används i samma sammanhang

”show me your friends, and I'll tell who you are”

<img src='images/word2vec.png' style='width:600px'/>

### fastText och GloVe

* Användbara paket för träning av word2vec modeller
* Har förtränade modeller på olika språk
<img src='images/fasttext.png' style='width:400px'/>

### setup av fastText
```bash

%%bash 
pip install -U tensorflow==1.14.0 gensim==3.8.0 fasttext==0.9.1 keras==2.2.5

wget 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip'
mkdir fasttext 
unzip 'wiki-news-300d-1M.vec.zip' -d fasttext
rm 'wiki-news-300d-1M.vec.zip'
```


### Använda fastText för document similarity

In [116]:
import numpy as np
from gensim.models import KeyedVectors
import re


def cosine_sim(s1, s2):
    return np.dot(s1,s2)/(np.linalg.norm(s1)*np.linalg.norm(s2)+1e-9)


def get_fasttext_wv(word, model, vec_len=300):
    try:
        wv = model[word]
    except:
        wv = np.zeros(vec_len)
    return wv
   


def fasttext_doc2vec(string, model, stop_words=list(), weights=dict(), smoothing_factor=1, n_dim=300):
    """
    sentence embedding from a word embedding model and string sentence. 
    args:
    string(str): the string to be embedded
    model(gensim.models.keyedvectors.Word2VecKeyedVectors): the model
    stop_words(list): optional stop words to be removed
    weights(dict): optional dict with weights of words 
    smoothing_factor(int): helps smoothing the doc2 vec. A nonzero value is as 
    if all words without an explicit weight would have that value as explict. A smoothing factor=0 
    is equivalent to ignoring words that has no explcit weight. Set to 1 if unsure.
    
    """
    string = ' '.join(string.split(' ')[0:256])# only take 256 first words (to make it fair with bert)
    words = set(re.findall("[\w\d'-]+", string.lower())) # ignores multiple uses of a word in the doc
    
    word_weights = []
    if words:
        word_vectors = [get_fasttext_wv(word, model) for word in words if word not in stop_words]
        for word in words:
            try:
                word_weights.append(weights[word])
            except:
                word_weights.append(smoothing_factor)
                
        se = np.mean([vec / (np.linalg.norm(vec) + 1e-9) * weight for vec, weight in zip(word_vectors, word_weights)], axis = 0)
    else:
        print('Zero doc2vec vector for: ' + string)
        se = np.zeros(n_dim)
    return se  



In [65]:
filename = 'fasttext/wiki-news-300d-1M.vec'
fasttext_model = KeyedVectors.load_word2vec_format(filename)

In [117]:
## text att jämföra
string_1 = 'The man is robbing the bank'
string_2 = 'The guy is stealing from the financial institution'

str_1_emb = fasttext_doc2vec(string_1, fasttext_model)
str_2_emb = fasttext_doc2vec(string_2, fasttext_model)

s1_s2_sim_fasttext = cosine_sim(str_1_emb, str_2_emb)

print('Document similarity between \n\n(1) {s1}\n(2) {s2}\nis: {sim}'.format(s1=string_1,
                                                                       s2=string_2,
                                                                       sim=s1_s2_sim_fasttext))

Document similarity between 

(1) The man is robbing the bank
(2) The guy is stealing from the financial institution
is: 0.893739540349151


### Använda fasttext för analys av IMDB reviews

```bash

%%bash
wget http://mng.bz/0tIo
mkdir imdb 
unzip '0tIo' -d imdb
rm '0tIo'

```

In [118]:
import keras
import os
import numpy as np
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers.normalization import BatchNormalization

from random import shuffle


def load_raw_imdb(imdb_dir = '/home/gurra/code/private/bert/imdb', data_type='train'):
    train_dir = os.path.join(imdb_dir, data_type)

    labels = []
    texts = []

    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(train_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname[-4:] == '.txt':
                f = open(os.path.join(dir_name, fname))
                texts.append(f.read())
                f.close()
                if label_type == 'neg':
                    labels.append(0)
                else:
                    labels.append(1)
    return texts, labels

def make_simple_model(input_dim):       
    model = Sequential()
    model.add(Dense(200, input_dim=input_dim))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Dropout(rate=0.5))
    model.add(Dense(100))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Dropout(rate=0.5))
    model.add(Dense(80))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Dropout(rate=0.5))
    model.add(Dense(50))
    model.add(Activation('relu'))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    return model


def shuffle_data(X, y):
    y=np.asarray(y)
    assert X.shape[0] == y.shape[0], 'data is not the same length'
    ind = list(range(X.shape[0]))
    shuffle(ind)
    return X[ind,:].copy(), y[ind].copy()


def compile_and_train(model, X_train, y_train):
    optimizer = optimizers.RMSprop(lr=0.0003, rho=0.9, epsilon=None, decay=0.001)
    model.compile(optimizer=optimizer, 
                   loss='binary_crossentropy', 
                   metrics=['accuracy'])
    
    hist = model.fit(X_train, y_train, epochs=100, batch_size=256, validation_split=0.1, shuffle=True, verbose = 0)
    for i in hist.history.keys():
        print('%s %.3f' % (i,hist.history[i][-1]))
    
    print('Best validation accuracy: %.3f ' % max(hist.history['val_acc']))
    return model, hist



In [119]:
train_texts, y_train = load_raw_imdb(data_type='train')
X_train = np.asarray([fasttext_doc2vec(sent, fasttext_model) for sent in train_texts])
X_fasttext, y_fasttext = shuffle_data(X_train, y_train)

In [120]:
word2vec_model = make_simple_model(X_train.shape[1])
w2v_mod, w2v_hist = compile_and_train(word2vec_model, X_fasttext, y_fasttext)

val_loss 0.332
val_acc 0.860
loss 0.297
acc 0.879
Best validation accuracy: 0.862 


### Summering av doc2vec
* Inte helt lätt att skapa representationer av hela meningar/dokument
* BoW
* Viktade medelvektorer
* Doc2VecC - https://arxiv.org/pdf/1707.02377.pdf
* WMD - http://proceedings.mlr.press/v37/kusnerb15.pdf
* Svårt att hantera ordning på ord och homonymer (banan, rabatt, springa)

# seq2seq 

* RNN till RNN 
* Två moduler – Encoder och decoder
<img src='images/encdec.gif' style='width:600px'/>

## Encoder-decoder network
<img src='images/enc-dec network.PNG' style='width:1000px'/>

### Encoder-decoder network  med attention
<img src='images/attention layer.PNG' style='width:1000px'/>

### Attention is all you need – Transformers 
> https://arxiv.org/abs/1706.03762 

> http://jalammar.github.io/illustrated-transformer/


<img src='images/the_transformer_3.png' style='width:1000px'/>
<img src='images/transformer_4.png' style='width:500px'/>

### Transformer multi-head attention

<img src='images/transform20fps.gif' style='width:400px'/>

* Inga RNN-moduler --> enklare parallellisering 


# BERT 

> "Bidirectional Encoder Representations from Transformers"

Fokus på representationerna från encoder-delen ≈ doc2vec





### komma igång

```bash
%%bash 
pip install bert-serving-server==1.9.6 bert-serving-client==1.9.6
wget 'https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip'
mkdir bert 
unzip 'uncased_L-12_H-768_A-12.zip' -d bert
rm 'uncased_L-12_H-768_A-12.zip'
```


##### starta bert-server (i separat terminal)
```bash
bert-serving-start -model_dir '/home/gurra/coding/bert/bert_tutorial/bert/uncased_L-12_H-768_A-12' -num_worker 8 -batch_size 10 -max_seq_len 256 
```

### Använda BERT för sentence similarity

In [6]:
from bert_serving.client import BertClient
bc=BertClient(check_length=False)

In [106]:
string_1 = 'The man is robbing the bank'
string_2 = 'The guy is stealing from the financial institution'


s1 = bc.encode([string_1])
s2 = bc.encode([string_2])
s1_s2_sim_bert = cosine_sim(s1[0],s2[0])

print('Document similarity between \n\n(1){s1}\n(2){s2}\nis: {sim}'.format(s1=string_1,
                                                                       s2=string_2,
                                                                       sim=s1_s2_sim_bert))

Document similarity between 

(1)The man is robbing the bank
(2)The guy is stealing from the financial institution
is: 0.9162930302304907


### Använda BERT för IMDB reviews

In [62]:

#X_train_bert = bc.encode(train_texts) 
X_train_bert = pickle.load(open('X_bert.pkl', 'rb'))
X_bert,y_bert=shuffle_data(X_train_bert, y_train)

In [101]:
bert_model = make_simple_model(X_train_bert.shape[1])
bert_mod, _ = compile_and_train(bert_model, X_bert, y_bert)

val_loss 0.277
val_acc 0.892
loss 0.217
acc 0.918
Best validation accuracy: 0.894 


In [122]:
print('fastText comparison:')
for i in w2v_hist.history.keys():
    print('%s %.3f' % (i,w2v_hist.history[i][-1]))

print('Best validation accuracy: %.3f ' % max(w2v_hist.history['val_acc']))

fastText comparison:
val_loss 0.332
val_acc 0.860
loss 0.297
acc 0.879
Best validation accuracy: 0.862 


### Vidare forskning

* XLnet - https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/
    