# Language-agnostic BERT Sentence Embedding (LaBSE)

LaBSE is a recent transformer model developed by google to create Language Agnostic Embeddings.

According to their results LaBSE outperforms LASER in most multilingual benchmarks with the advantage of running well in windows 😂.

In this notebook I will give you the basics on "how to get sentence embeddings using LaBSE". I hope that this will foster some ideas for the project.

[Official Google blog](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html)

In [1]:
# We will import the model from the transformers library. Make sure you have it installed along with pytorch!
!pip install transformers

Collecting transformers

You should consider upgrading via the 'c:\programdata\anaconda3\python.exe -m pip install --upgrade pip' command.



  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting regex!=2019.12.17
  Downloading regex-2021.4.4-cp37-cp37m-win_amd64.whl (269 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.2-cp37-cp37m-win_amd64.whl (2.0 MB)
Installing collected packages: huggingface-hub, regex, sacremoses, tokenizers, transformers
Successfully installed huggingface-hub-0.0.8 regex-2021.4.4 sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.6.1


# Model Architecture

![image.png](attachment:image.png)

LaBSE follows a dual encoder architecture in which the source (text to be translated) and target text (translated text) are encoded using a shared transformer embedding network separately. The model is then trained in a translation ranking task in which the text representations of paraphares and translations is forced to be close together.


![TranslationRanking.gif](attachment:TranslationRanking.gif)

### Usage
Using the model:

In [2]:
import torch
from transformers import BertModel, BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE")
model = BertModel.from_pretrained("setu4993/LaBSE")
model = model.eval()

In [3]:

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
italian_sentences = [
    "cane",
    "I cuccioli sono carini.",
    "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.",
]
italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True)
english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True)

with torch.no_grad():
    italian_outputs = model(**italian_inputs)
    english_outputs = model(**english_inputs)

To get the sentence embeddings, use the pooler output:

In [4]:
italian_embeddings = italian_outputs.pooler_output
english_embeddings = english_outputs.pooler_output

For similarity between sentences, an L2-norm is recommended before calculating the similarity:

In [5]:
import torch.nn.functional as F


def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )

print(similarity(italian_embeddings, english_embeddings))

tensor([[0.6318, 0.1176, 0.1488],
        [0.3069, 0.8598, 0.3252],
        [0.4435, 0.3607, 0.9543]])


In [6]:
print(similarity(italian_embeddings, english_embeddings))

tensor([[0.6318, 0.1176, 0.1488],
        [0.3069, 0.8598, 0.3252],
        [0.4435, 0.3607, 0.9543]])


In [7]:
italian_embeddings

tensor([[ 0.3342,  0.1322,  0.0324,  ...,  0.2717,  0.0606,  0.2922],
        [-0.6099, -0.4250, -0.1817,  ..., -0.2334, -0.1287, -0.3527],
        [ 0.2083, -0.4750,  0.0386,  ..., -0.0012, -0.1445, -0.6810]])

In [47]:
import config
import pandas as pd
d = "en-fi"
csv_path = config.preprocess_data_dir + d + "\scores.csv"
df = pd.read_csv(csv_path).head(1000)

In [48]:
reference = list(df.reference)
translation = list(df.translation)

In [49]:
reference_inputs = tokenizer(reference, return_tensors="pt", padding=True)
translation_inputs = tokenizer(translation, return_tensors="pt", padding=True)

with torch.no_grad():
    reference_outputs = model(**reference_inputs)
    translation_outputs = model(**translation_inputs)

In [50]:
reference_embeddings = reference_outputs.pooler_output
translation_embeddings = translation_outputs.pooler_output

In [51]:
train_features = similarity(reference_embeddings, translation_embeddings)

In [52]:
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
import numpy as np

In [53]:
normalizer = preprocessing.Normalization()

In [54]:
normalizer.adapt(np.array(train_features))

In [55]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

[<tf.Tensor 'Placeholder:0' shape=(None, 1000) dtype=float32>]


In [56]:
linear_model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [57]:
train_labels = np.array(df['avg-score'])

In [58]:
np.array(train_features)

array([[0.923994  , 0.23361918, 0.238406  , ..., 0.21755192, 0.2699421 ,
        0.19083962],
       [0.24992196, 0.9634627 , 0.2595322 , ..., 0.4010892 , 0.43251735,
        0.3170369 ],
       [0.24613653, 0.21969862, 0.88448304, ..., 0.3633506 , 0.25411075,
        0.22023405],
       ...,
       [0.11102992, 0.2835458 , 0.30057135, ..., 0.73496926, 0.1262648 ,
        0.11551017],
       [0.30467045, 0.46393263, 0.23715556, ..., 0.21917142, 0.9345061 ,
        0.17550147],
       [0.17516404, 0.19602442, 0.23605244, ..., 0.13643321, 0.12034281,
        0.7815325 ]], dtype=float32)

In [59]:
history = linear_model.fit(
    np.array(train_features), train_labels, 
    epochs=100,
    # suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

[<tf.Tensor 'sequential_1/normalization_2/truediv:0' shape=(32, 1000) dtype=float32>]
[<tf.Tensor 'sequential_1/normalization_2/truediv:0' shape=(32, 1000) dtype=float32>]
[<tf.Tensor 'sequential_1/normalization_2/truediv:0' shape=(None, 1000) dtype=float32>]


In [60]:
import matplotlib.pyplot as plt
def plot_loss(history):
    plt.plot(history.history['loss'], label='loss')
    plt.plot(history.history['val_loss'], label='val_loss')
    plt.ylim([0, 10])
    plt.xlabel('Epoch')
    plt.ylabel('Error [MPG]')
    plt.legend()
    plt.grid(True)

In [61]:
df['linear_model'] = linear_model.predict(
    np.array(train_features), verbose=0)

[<tf.Tensor 'sequential_1/normalization_2/truediv:0' shape=(None, 1000) dtype=float32>]


In [62]:
df

Unnamed: 0.1,Unnamed: 0,source,reference,translation,z-score,avg-score,annotators,linear_model
0,0,"You can turn yourself into a pineapple, a dog ...",voit muuttaa itsesi ananasta koirasta tai roy...,voit muuttaa itsesi ananakseksi koiraksi tai ...,-0.286195,34.20,5,46.566574
1,1,Also shot were three men: two 29-year-olds and...,my s ammuttiin kolme miest kaksi vuotiait...,my s kolmea miest ammuttiin kahta vuotias...,0.547076,58.40,5,95.484039
2,2,The information is stored at the cash register...,tiedot tallennetaan kassakoneisiin joka tapauk...,tiedot kuitenkin tallentuvat kassoilla joka ta...,1.122476,74.60,5,80.595642
3,3,Xinhua says that there were traces of hydrochl...,xinhua kertoo ett xinyin n ytteest oli sunn...,xinhua kertoo ett xinyin sunnuntaina antamas...,0.383095,53.60,5,52.604256
4,4,"MacDonald, who was brought on board CBC's comm...",voitaisiin kuulla cbd n kommenttitiimin toimi...,macdonaldin joka tuli cbc n selostajatiimiin ...,-0.493065,32.25,4,59.188179
...,...,...,...,...,...,...,...,...
995,995,"I realised that we had to soon rely on plan B,...",tajusin ett meid n on pian luottaa suunnitel...,tajusin ett t m h n menee nyt plan b n puole...,0.426938,73.00,1,40.940754
996,996,"Kendall, who is an Estee Lauder brand ambassad...",kendall joka on estee lauder tuotemerkin suur...,kendall joka on estee lauderin br ndil hettil...,0.634702,79.00,1,43.255928
997,997,"Treats says nude calendar for 'women, as well ...",kohtelee sanoo alaston kalenterin naiset sek ...,treats sanoo ett alastonkalenteri on naisil...,0.080664,63.00,1,63.666637
998,998,The other two victims were not in the car and ...,kaksi uhria ei autossa ja poliisi tutkii onko ...,kaksi muuta uhria eiv t olleet autossa ja pol...,-0.542628,45.00,1,71.633583
