<a href="https://colab.research.google.com/github/finardi/tutos/blob/master/Sentence_BERT_DirtyTalk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install ftfy
!pip install datasets
!pip install magic_timer
!pip install transformers
!pip install sentencepiece
!pip install sentence_transformers

# <font color='darkorange'>**Problema**: </font> Criar uma representação densa de qualidade para sentenças.

---

### <font color='lightyellow'>**Modelos e Arquitetura**: </font> 
#### <font color='whiteblue'> >>> Transformers $-$ pontos chave:</font> 
- #### Positional Encoding
- #### Self-attention
- #### Multi-head attention

#### <font color='magenta'> >>> BERT $-$ pontos chave:</font> 
- #### Transformer Architecture
- #### Pre-training from unlabeled text
- #### Bi-directional contextual models 
- #### Textual entailment (Next Sentence Prediction)

### <font color='darkorange'>**Desafio**: </font> Transformers funcionam com embeddings no nível de palavra/token, não embeddings no nível de sentença. 

---

### **<font color='lightgreen'>Em qual task devemos fazer fine tuning?</font>**
> #### Natual Language Inference (**<font color='lightgreen'>NLI</font>**) é a tarefa de determinar se uma `“hipótese”` é *verdadeira* (implicação), *falsa* (contradição) ou *indeterminada* (neutra) dada uma `“premissa”`. Exemplo:

| Premissa | Label | Hipótese |  
|----------|-------|----------|
| O cientista preparou uma nova solução | implicação | O solução criada pelo cientista é  boa |  
| Um cientista alto e um baixo estão sorrindo | neutra | Cientistas riem ao ver F$_1=0$ |  
| O Haddop está lento | contradição | A query no Hadoop está executando rápida |  


#### def $:=$ [NLI](http://nlpprogress.com/english/natural_language_inference.html)


# **<font color='lightpink'>Datasets de NLI</font>**
- ### SNLI
 - #### The Stanford Natural Language Inference (SNLI) Corpus contains around 550k hypothesis/premise pairs. Models are evaluated based on accuracy. 

- #### MNLI
 - #### The Multi-Genre Natural Language Inference (MultiNLI) corpus contains around 433k hypothesis/premise pairs. It is similar to the SNLI corpus, but covers a range of genres of spoken and written text and supports cross-genre evaluation. 

- #### STSb
 - STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums. 





# **<font color='light magenta'>Tradução dos Datasets</font>**

In [None]:
from transformers import MarianMTModel, MarianTokenizer
from transformers import logging
logging.set_verbosity_error()

import datasets
from ftfy import fix_encoding
from magic_timer import MagicTimer

import torch
from spacy.lang.en import English


def pickle_file(path, data=None):
    import pickle
    if data is None:
        with open(path, 'rb') as f:
            return pickle.load(f)
    if data is not None:
        with open(path, 'wb') as handle:
            pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)

MANUAL_SEED = 341
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def deterministic(rep=True):
    torch.manual_seed(MANUAL_SEED)
    if torch.cuda.is_available():
            torch.cuda.manual_seed(MANUAL_SEED)
            torch.cuda.manual_seed_all(MANUAL_SEED)
            torch.backends.cudnn.enabled = False 
            torch.backends.cudnn.benchmark = False
            torch.backends.cudnn.deterministic = True
            print(f'Experimento deterministico, seed: {MANUAL_SEED} -- ', end = '')
            print(f'Existe {torch.cuda.device_count()} GPU {torch.cuda.get_device_name(0)} disponível.')
    else:
        print('Device CPU')
deterministic()        

Device CPU


In [None]:
snli = datasets.load_dataset('snli', split='train')
mnli = datasets.load_dataset('glue', 'mnli', split='train')

In [None]:
snli

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})

In [None]:
mnli

Dataset({
    features: ['premise', 'hypothesis', 'label', 'idx'],
    num_rows: 392702
})

In [None]:
mnli_like_snli = {'premise': mnli['premise'], 'hypothesis': mnli['hypothesis'], 'label': mnli['label']}
mnli_like_snli = datasets.Dataset.from_dict(mnli_like_snli)
mnli_like_snli

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 392702
})

In [None]:
snli = snli.cast(mnli_like_snli.features)
dataset_train = datasets.concatenate_datasets([snli, mnli_like_snli])
dataset_train

  0%|          | 0/56 [00:00<?, ?ba/s]

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 942854
})

In [None]:
print(f"before filter: {len(dataset_train)} rows")
dataset_train = dataset_train.filter(lambda x: True if x['label'] == 0 else False)
print(f"after: {len(dataset_train)} rows")

before filter: 942854 rows


  0%|          | 0/943 [00:00<?, ?ba/s]

after: 314315 rows


In [None]:
dataset_train

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 314315
})

In [None]:
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
marian_tokenizer = MarianTokenizer.from_pretrained(model_name)
marian_model = MarianMTModel.from_pretrained(model_name)

In [None]:
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))
def chunkstring_spacy(text):
    chunck_sentences = []
    doc = nlp(str(text))
    for sent in doc.sents:
        chunck_sentences.append('>>pt_br<<' + ' ' + sent.text)
        
    return chunck_sentences
    
def translate(aux_sent):
    max_length = 512
    num_beams = 1

    sentence = chunkstring_spacy(aux_sent)

    marian_model.to(device)
    marian_model.eval()

    tokenized_text = marian_tokenizer.prepare_seq2seq_batch(sentence, max_length=max_length, return_tensors='pt')
                        
    translated = marian_model.generate(input_ids=tokenized_text['input_ids'].to(device), 
                                        max_length=max_length, 
                                        num_beams=num_beams, 
                                        early_stopping=True, 
                                        do_sample=False)
                        
    tgt_text = [fix_encoding(marian_tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
    return ' '.join(tgt_text)

In [None]:
deterministic()        

path_base = '/content/drive/MyDrive/Dirty-Talks/Sentence-Transformers/data/'

CONTINUE_FROM = 0

timer = MagicTimer()  
translated_premise, translated_hypothesis = [], []
for idx, (premise, hypothesis) in enumerate(
    zip(
        dataset_train['premise'][CONTINUE_FROM:], 
        dataset_train['hypothesis'][CONTINUE_FROM:],
        ), start=CONTINUE_FROM
    ):
    
    translated_premise.append(translate(premise))
    translated_hypothesis.append(translate(hypothesis))
    
    if (idx > CONTINUE_FROM and idx%1000==0) or (idx==len(dataset_train)-1):
        print(f'\tprocessed {idx}/{len(dataset_train)} samples. Time elapsed: {timer}')
        pickle_file(path_base+'translated_premise_MARIAN_'+str(idx), translated_premise)
        pickle_file(path_base+'translated_hypothesis_MARIAN_'+str(idx), translated_hypothesis)
        translated_premise, translated_hypothesis = [], []

In [None]:
sts = datasets.load_dataset('glue', 'stsb', split='validation')

In [None]:
deterministic()        

sentence1, sentence2 = [],[]
timer = MagicTimer()  
for sent1, sent2 in zip(sts['sentence1'], sts['sentence2']):
    sentence1.append(translate(sent1))
    sentence2.append(translate(sent2))

print(f'\tSTS translated Time elapsed: {timer}')

sts_PT = {
    'sentence1':sentence1, 
    'sentence2':sentence2,
    'label':sts['label'], 
    'idx':sts['label'],
    }    

pickle_file(path_base+'STS_PT', sts_PT)
STS_PT = pickle_file(path_base+'STS_PT')

STS_PT = datasets.Dataset.from_dict(STS_PT)

In [None]:
import os
import glob
import random
import pandas as pd

path_base = '/content/drive/MyDrive/Dirty-Talks/Sentence-Transformers/data/'

os.chdir(path_base)
premise, hypothesis = [],[]
for i, (p_file, h_file) in enumerate(zip(
    sorted(glob.glob("translated_premise_MARIAN_*"), key=os.path.getmtime), 
    sorted(glob.glob("translated_hypothesis_MARIAN_*"), key=os.path.getmtime))):
    if p_file[-3:] == h_file[-3:]:
        premise += pickle_file(p_file)
        hypothesis += pickle_file(h_file)

    else:
        print(f'not append p_file: {p_file}')
        print(f'not append h_file: {h_file}')

assert len(premise) == len(hypothesis)

#-----------------------------------------------------------------------------------
df = pd.DataFrame({'premise':premise, 'hypothesis':hypothesis})
df = df[~df.duplicated()].sample(frac=1, random_state=MANUAL_SEED)

mnli_snli_PT = {'premise':df.premise.to_list(), 'hypothesis':df.hypothesis.to_list()}

pickle_file(path_base+'SNLI_MNLI_POSITIVES_PT', mnli_snli_PT)
mnli_snli_PT = pickle_file(path_base+'SNLI_MNLI_POSITIVES_PT')

mnli_snli_PT = datasets.Dataset.from_dict(mnli_snli_PT)

K = random.randrange(len(premise)-10)
for i, (p_, h_) in enumerate(zip(premise, hypothesis)):
    if i>=K: 
        print(f'\tPremise:    {p_}')
        print(f'\tHypothesis: {h_}\n')
    if i==K+10: break 

print(f"len(premise): {len(mnli_snli_PT['premise'])} -- len(hypothesis): {len(mnli_snli_PT['hypothesis'])}")

mnli_snli_PT

	Premise:    Mas então o que filme não é mais
	Hypothesis: Hoje em dia, todos os filmes são assim.

	Premise:    Membros da LMA ajudarão os grupos a desenvolver planos de negócios e estratégias de marketing a longo prazo.
	Hypothesis: Membros da AML estão trabalhando nos planos de negócios.

	Premise:    Quando o povo está com fome e tudo o que nos pedem é de entregas, por que nosso povo não pode pedir uma entrega?
	Hypothesis: Por que não podemos alimentar nosso próprio povo?

	Premise:    Sim, não é ridículo.
	Hypothesis: Não é ridículo?

	Premise:    Se os executivos da CNN caíssem frequentemente, estariam mortos e, portanto, incapazes de exigir programas tão chatos.
	Hypothesis: Executivos da CNN não poderão exigir programas chatos se baterem com frequência.

	Premise:    A fórmula permite saborear pequenas diferenças e adaptações.
	Hypothesis: A fórmula permite que alguém saboreie pequenas diferenças e adaptações.

	Premise:    Acho que é por isso que os atores e atrizes ganham mi

Dataset({
    features: ['premise', 'hypothesis'],
    num_rows: 313629
})

In [None]:
# Load Train_data:
data_train = pickle_file(path_base+'SNLI_MNLI_POSITIVES_PT')
data_train = datasets.Dataset.from_dict(data_train)
data_train


# **<font color='lightyellow'>Sentence Transfomers (SBERT)</font>**

[ArXiv link](https://arxiv.org/pdf/1908.10084.pdf)

# **<font color='darkorange'>CrossEncoder e Bi-Encoder</font>**
<img src="https://drive.google.com/uc?id=1QtAvpTlgdYnBpb3eLdXnL-Om7fFaLXVp" alt="drawing" width="700"/>


> ## **<font color='violet'>Cross Encoder</font>**
- #### <font color='lightgreen'>**Alta Precisão:**</font> os pesos das duas sentenças são otimizados juntos e a atenção é mais "*global*" (pondera as duas sentenças ao mesmo tempo) no processo de otimização. 
- #### <font color='red'>**Slow asf:**</font> para encontrar em uma coleção de $n = 10.000$ sentenças o par com a maior similaridade requer com BERT $n · (n − 1) / 2 = 49.995.000$ cálculos de inferência.


> ## **<font color='yellow'>Bi Encoder</font>**
- #### <font color='lightgreen'>**Escalável:**</font> são capazes de indexar os candidatos codificados e comparar essas representações para cada entrada, resultando em tempos de previsão rápidos. Com a mesma complexidade de agrupar $10.000$ sentenças, o tempo é reduzido de $65$ horas para $5$ segundos. 
- #### <font color='red'>**Menor precisão:**</font> o método Bi-encoders geralmente atinge desempenho inferior em comparação com o método Cross-encoders e requer uma grande quantidade de dados de treinamento. 


<img src="https://drive.google.com/uc?id=18CjUjY78W0Kwc07XUnnoRhML4x0kC0cI" alt="drawing" width="500"/>

### **<font color='white'>Correlação de Spearman:**
> #### O coeficiente $\rho$ avalia com que intensidade a relação entre duas variáveis pode ser descrita pelo uso de uma função monótona. Obs.: O Pearson avalia relações lineares e é mais sensível a outiliers do que ao $\rho$ de Spearman. Mais sobre o assunto [aqui](https://pt.wikipedia.org/wiki/Coeficiente_de_correla%C3%A7%C3%A3o_de_postos_de_Spearman)</font>

---

# **<font color='magenta'>Principal Resultado:**
<img src="https://drive.google.com/uc?id=18fKjo9rUhmt6W06fdfqSGKKbZ4vRB8KO" alt="drawing" width="800"/>


# E quando existe pouco dado? 
> # <font color='lightyellow'>**Augmented BERT**</font>
- [Aug SBERT ](https://www.sbert.net/examples/training/data_augmentation/README.html)

## <font color='lightpink'>**Cenário 1**$_\text{ In domain}$:</font>
- ### Conjuntos de dados limitado ou pouca rotulação:
 -  #### <font color='lightpink'>Etapa 1:</font> Treinar um cross encoder (BERT) sobre o pequeno conjunto de dados (<font color='yellow'>gold dataset</font>)
 - #### <font color='lightpink'>Etapa 2.1:</font> Criar pares por recombinação e reduzir os pares via BM25 ou busca semântica
 - #### <font color='lightpink'>Etapa 2.2:</font> Rotular "*fracamente*" os novos pares com o cross-encoder (BERT). Estes são pares são chamadados de (<font color='gray'>siver dataset</font>)
 - #### <font color='lightpink'>Etapa 3:</font> Treine um bi-codificador (SBERT) no conjunto de dados de treinamento estendido (<font color='yellow'>gold </font> $+$ <font color='gray'>silver</font>)

<img src="https://drive.google.com/uc?id=1KEK4hNt0DSAMAUa4VzqZX7HkrUUsBhYp" alt="drawing" width="500"/>

---
## <font color='lightpink'>**Cenário 2**$_\text{ domain adaption}$:</font>
- ### Sem dados anotados (apenas pares de frases sem rótulos):
 -  #### <font color='lightpink'>Etapa 1:</font> Fine tuning de um cross encoder (BERT) sobre o domínio de origem contendo anotações de pares.
 - #### <font color='lightpink'>Etapa 2:</font> Utiliza o BERT da <font color='lightpink'>Etapa 1</font> para rotulação do domínio sem rótulo.
 - #### <font color='lightpink'>Etapa 3:</font> Treine um bi-encoder (SBERT) sobre o pares de frases rótullados na  <font color='lightpink'>Etapa 2</font>

 <img src="https://drive.google.com/uc?id=1-Pzkfmy6bUyvuvy8CAwtgSig30lNMMq0" alt="drawing" width="500"/>

---





# <font color='lightblue'>**SBERT, em detalhes...**</font>

In [None]:
import os
import datasets

from transformers import BertModel, BertPreTrainedModel, BertTokenizerFast
from transformers import logging
logging.set_verbosity_error()

from sentence_transformers import InputExample, SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

In [None]:
BERTMODEL = 'neuralmind/bert-base-portuguese-cased'

tokenizer = BertTokenizerFast.from_pretrained(BERTMODEL)
inputs = tokenizer("Oi, eu sou o goku!", return_tensors="pt")
print(f"inputs_ids: {inputs['input_ids']}")

model = BertModel.from_pretrained(BERTMODEL, return_dict=True)
outputs = model(**inputs)
outputs.keys()

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

inputs_ids: tensor([[  101,   231, 22283,   117,  2779,  7206,   146,  3746,  4093,   106,
           102]])


odict_keys(['last_hidden_state', 'pooler_output'])

In [None]:
path_base = '/content/drive/MyDrive/Dirty-Talks/Sentence-Transformers/data/'

# Load Train_data:
data_train = pickle_file(path_base+'SNLI_MNLI_POSITIVES_PT')
data_train = datasets.Dataset.from_dict(data_train)
data_train

Dataset({
    features: ['premise', 'hypothesis'],
    num_rows: 313629
})

In [None]:
MAX_LEN = 32
BSIZE = 4

## Tokenize Dataset dentro de datasets com batched

In [None]:
data_train = data_train.map(
    lambda x: tokenizer(
            x['premise'], max_length=MAX_LEN, padding='max_length',
            truncation=True, add_special_tokens=False), 
            batched=True)

data_train = data_train.rename_column('input_ids', 'anchor_ids')
data_train = data_train.rename_column('attention_mask', 'anchor_mask')

data_train = data_train.map(
    lambda x: tokenizer(
            x['hypothesis'], max_length=MAX_LEN, padding='max_length',
            truncation=True, add_special_tokens=False), 
            batched=True)

data_train = data_train.rename_column('input_ids', 'positive_ids')
data_train = data_train.rename_column('attention_mask', 'positive_mask')

data_train = data_train.remove_columns(['premise', 'hypothesis', 'token_type_ids'])

data_train

  0%|          | 0/314 [00:00<?, ?ba/s]

  0%|          | 0/314 [00:00<?, ?ba/s]

Dataset({
    features: ['anchor_ids', 'anchor_mask', 'positive_mask', 'positive_ids'],
    num_rows: 313629
})

## Dataloader

In [None]:
data_train.set_format(type='torch', output_all_columns=True)

loader = torch.utils.data.DataLoader(
    data_train, 
    batch_size=BSIZE, 
    shuffle=True,
    pin_memory=True,
    num_workers=os.cpu_count(),
    )

dl0 = next(iter(loader))
dl0

{'anchor_ids': tensor([[ 1263,  2397,   125,   139,   283,  8489,  1968,   320,  1341,   125,
            230,  2606,   170,  2968,   173,   222,  2740,   119,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0],
         [ 9551,  2943,  1376,   695,   557,  1089, 10674,   221,  4270,   173,
            222,  6711,   119,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0],
         [  989, 21905,  1257,   653,   119,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
              0,     0],
         [ 1263,  2397,   325,  4575,   698,   173,  5530,   125,   222, 14608,
           4712,   122, 15605,  3723, 16467,   117, 20373,  3233,   221,   327,
            419,  4803,   146, 

In [None]:
anchor_ids, anchor_mask = dl0['anchor_ids'], dl0['anchor_mask']
positive_ids, positive_mask = dl0['positive_ids'], dl0['positive_mask']

# Processa Anchor e Positives para a MNR Loss

> ### <font color='lightpink'>Espia **[aqui](https://www.sbert.net/examples/training/nli/README.html)** antes</font>

In [None]:
# TODO

anchor_embds = model(anchor_ids)

print(anchor_embds['last_hidden_state'].size())
print(anchor_mask.unsqueeze(-1).size())
a_pool = anchor_embds['last_hidden_state'] * anchor_mask.unsqueeze(-1)
print(a_pool.size())
a_pool = a_pool.sum(1)
a_pool = torch.nn.functional.normalize(a_pool, p=2, dim=-1)
print(a_pool.size(),'\n')

# -------------------------------------------------------------------------

pos_embds = model(positive_ids)

print(pos_embds['last_hidden_state'].size())
print(positive_mask.unsqueeze(-1).size())
p_pool = pos_embds['last_hidden_state'] * positive_mask.unsqueeze(-1)
print(p_pool.size())
p_pool = p_pool.sum(1)
p_pool = torch.nn.functional.normalize(p_pool, p=2, dim=-1)
print(p_pool.size())

In [None]:
cos_sim = torch.nn.CosineSimilarity()
scores = []
scale = 20
for a_i in a_pool:
    scores.append(cos_sim(a_i.unsqueeze(0), p_pool))

scores = torch.vstack(scores)*scale
scores

tensor([[17.8764, 12.2431, 10.9101, 13.1676],
        [12.9470,  9.8511, 10.8194, 12.3147],
        [11.3346, 11.9801, 18.6552, 10.1344],
        [11.0722,  8.7092, 14.2375, 12.1963]], grad_fn=<MulBackward0>)

In [None]:
labels = torch.arange(scores.size()[0], dtype=torch.long)
labels

tensor([0, 1, 2, 3])

In [None]:
loss_fct = torch.nn.CrossEntropyLoss()
loss_fct(scores, labels[:scores.size(0)])

tensor(1.4607, grad_fn=<NllLossBackward>)

<img src="https://drive.google.com/uc?id=1Rrrp3Glu-A2FbyNAtJLUfxA6E_oDsN-u" alt="drawing" width="500"/>


In [None]:
# ---------------------------- #
# --------- Evaluate --------- #
# -----------------------------#

model.save_pretrained('./model')
tokenizer.save_pretrained('./model')
model = SentenceTransformer('/content/model/')

# Load STS Test_data
STS_PT = pickle_file(path_base+'STS_PT')
STS_PT = datasets.Dataset.from_dict(STS_PT)

STS_PT = STS_PT.map(lambda x: {'label': x['label'] / 5.0})
samples = []
for sample in STS_PT:
    samples.append(InputExample(
        texts=[sample['sentence1'], sample['sentence2']],
        label=sample['label']
    ))
model[0].max_seq_length = MAX_LEN
model[1].pooling_mode_cls_token = False
model[1].pooling_mode_mean_tokens = True  # <---
model[1].pooling_mode_max_tokens = False

model



  0%|          | 0/1500 [00:00<?, ?ex/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 32, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [None]:
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(samples, write_csv=False)
evaluator(model)

0.6804931094011382

# <font color='lightpink'>(SBERT$_\text{MNR}$ treino NLI$_\text{PTBr}$)$_\text{PyTorch}$</font> <font color='yellow'> Vs</font> # <font color='lightblue'>(SBERT$_\text{MNR}$ treino NLI$_\text{PTBr}$)$_\text{lib SBERT}$</font>

In [None]:
import os
import gc
import torch
import numpy as np
from tqdm.auto import tqdm

import datasets
from transformers import BertModel, BertPreTrainedModel, BertTokenizerFast
from transformers.optimization import get_linear_schedule_with_warmup, AdamW
from transformers import logging
logging.set_verbosity_error()

from sentence_transformers import InputExample, SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

MANUAL_SEED = 341

Experimento deterministico, seed: 341 -- Existe 1 GPU Tesla P100-PCIE-16GB disponível.


## TRAIN: SNLI e MNLI positivos

> The SNLI corpus is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE).
---
> The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.



In [None]:
path_base = '/content/drive/MyDrive/Dirty-Talks/Sentence-Transformers/data/'

# Load Train_data:
data_train = pickle_file(path_base+'SNLI_MNLI_POSITIVIES_PT')
data_train = datasets.Dataset.from_dict(data_train)
data_train

Dataset({
    features: ['premise', 'hypothesis'],
    num_rows: 313629
})

## Constants

In [None]:
BSIZE = 48
MAX_LEN = 64
EPOCHS = 1

BERTMODEL = 'neuralmind/bert-base-portuguese-cased'

## Avg Len das sentenças (usado para definir a MAX_LEN)

In [None]:
tokenizer = BertTokenizerFast.from_pretrained(BERTMODEL)

def tokenized_lenghts(dataset):
    lengths = []
    for i, seq in enumerate(dataset):
        input_ids = tokenizer.encode(seq)

        lengths.append(len(input_ids))
    return lengths

# Cálcula a média da seq. em num. de tokens
RUN = True
if RUN:
    lengths_p = tokenized_lenghts(data_train['premise'])
    print(f' Min len P:   {min(lengths_p):>8} tokens')
    print(f' Max len P:   {max(lengths_p):>8} tokens')
    print(f' Mean lean P: {np.median(lengths_p):,} tokens')

    lengths_h = tokenized_lenghts(data_train['hypothesis'])
    print(f' Min len H:   {min(lengths_h):>8} tokens')
    print(f' Max len H:   {max(lengths_h):>8} tokens')
    print(f' Mean lean H: {np.median(lengths_h):,} tokens')

Downloading:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/647 [00:00<?, ?B/s]

 Min len P:          3 tokens
 Max len P:        513 tokens
 Mean lean P: 19.0 tokens
 Min len H:          3 tokens
 Max len H:        510 tokens
 Mean lean H: 11.0 tokens


## Dataset e Dataloader

In [None]:
# Tokenize Premises
data_train = data_train.map(
    lambda x: tokenizer(
            x['premise'], max_length=MAX_LEN, padding='max_length',
            truncation=True,
            add_special_tokens=False,
        ), batched=True
)

data_train = data_train.rename_column('input_ids', 'anchor_ids')
data_train = data_train.rename_column('attention_mask', 'anchor_mask')

# Tokenize Hypothesis
data_train = data_train.map(
    lambda x: tokenizer(
            x['hypothesis'], max_length=MAX_LEN, padding='max_length',
            truncation=True,
            add_special_tokens=False,
    ), batched=True
)

data_train = data_train.rename_column('input_ids', 'positive_ids')
data_train = data_train.rename_column('attention_mask', 'positive_mask')

data_train = data_train.remove_columns(['premise', 'hypothesis', 'token_type_ids'])

data_train

  0%|          | 0/314 [00:00<?, ?ba/s]

  0%|          | 0/314 [00:00<?, ?ba/s]

Dataset({
    features: ['anchor_ids', 'anchor_mask', 'positive_mask', 'positive_ids'],
    num_rows: 313629
})

In [None]:
data_train.set_format(type='torch', output_all_columns=True)

loader = torch.utils.data.DataLoader(
    data_train, 
    batch_size=BSIZE, 
    shuffle=True,
    pin_memory=True,
    num_workers=os.cpu_count(),
    )

dl0 = next(iter(loader))
dl0

{'anchor_ids': tensor([[ 5857,  2459, 19135,  ...,     0,     0,     0],
         [ 1431,  9586, 12232,  ...,     0,     0,     0],
         [ 2664,  4366,   122,  ...,     0,     0,     0],
         ...,
         [ 9009,   203,  4863,  ...,     0,     0,     0],
         [ 5705,  3542, 13394,  ...,     0,     0,     0],
         [ 1431,  2606,   122,  ...,     0,     0,     0]]),
 'anchor_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'positive_ids': tensor([[ 3074,   864,  2459,  ...,     0,     0,     0],
         [ 1431,  9586,   698,  ...,     0,     0,     0],
         [ 2542, 18661,   271,  ...,     0,     0,     0],
         ...,
         [ 7508,   268, 11304,  ...,     0,     0,     0],
         [ 5705,  3542, 13394,  ...,     0,     0,     0],
         [  230,  2606,   122,  ...,     0,     0,

## Modelo SBERTMNR


In [None]:
class SBERTMNR(BertPreTrainedModel):
    def __init__(self, path_model, similarity_metric='cosine'):
        super(SBERTMNR, self).__init__(path_model)

        self.cos_sim = torch.nn.CosineSimilarity()
        self.similarity_metric = similarity_metric
        self.bert = BertModel(path_model)
        self.init_weights()

    def forward(self, anchor_ids, anchor_mask, positive_ids, positive_mask):
        return self.mnr_score(anchor_ids, anchor_mask, positive_ids, positive_mask)

    def _mean_pool(self, token_embeds, attention_mask):
        in_mask = attention_mask.unsqueeze(-1).float()
        pool = token_embeds * in_mask
        pool = pool.sum(1)
        return torch.nn.functional.normalize(pool, p=2, dim=-1)

    def mnr_score(self, anchor_ids, anchor_mask, positive_ids, positive_mask, scale=20):
         anchor_embds = self.bert(
             input_ids=anchor_ids, 
             attention_mask=anchor_mask)['last_hidden_state']
         positive_embds = self.bert(
             input_ids=positive_ids, 
             attention_mask=positive_mask)['last_hidden_state']

         anchor = self._mean_pool(anchor_embds, anchor_mask)
         positive = self._mean_pool(positive_embds, positive_mask)

         scores = torch.stack([self.cos_sim(a_i.unsqueeze(0), positive) for a_i in anchor])
        
         labels = torch.arange(BSIZE, dtype=torch.long, device=scores.device)

         return (scores*scale, labels)
# ----------------------------------------------------------------

# Test Forward passage
model = SBERTMNR.from_pretrained(BERTMODEL).to(device)

anchor_ids, anchor_mask = dl0['anchor_ids'].to(device), dl0['anchor_mask'].to(device)
positive_ids, positive_mask = dl0['positive_ids'].to(device), dl0['positive_mask'].to(device)

scores, labels = model(anchor_ids, anchor_mask, positive_ids, positive_mask)
scores, labels

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

(tensor([[16.3493, 12.7038,  9.9842,  ...,  8.2255, 12.1129, 12.2569],
         [13.7774, 16.2452,  9.1206,  ...,  7.9705, 10.6306, 11.4294],
         [ 9.6914, 10.8983, 16.4038,  ...,  9.0461,  8.7948,  9.4617],
         ...,
         [ 8.6105,  8.7850,  9.2745,  ..., 18.1179,  7.7623,  7.4786],
         [11.7315,  9.4326,  9.2220,  ...,  7.1206, 20.0000,  9.1470],
         [14.0907, 14.3647, 10.4639,  ...,  7.9580, 11.1152, 13.3830]],
        device='cuda:0', grad_fn=<MulBackward0>),
 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47], device='cuda:0'))

In [None]:
# Load STS Test_data
"""
Evaluate in STS benchmark dataset Traduzido (STSb): 
>> STS Benchmark comprises a selection of the English datasets used in the STS tasks 
>> organized in the context of SemEval between 2012 and 2017. The selection of datasets 
>> include text from image captions, news headlines and user forums.
"""
STSb_PT = pickle_file(path_base+'STS_PT')
STSb_PT = datasets.Dataset.from_dict(STSb_PT)
STSb_PT = STSb_PT.map(lambda x: {'label': x['label'] / 5.0})

samples = []
for sample in STSb_PT:
    samples.append(InputExample(
        texts=[sample['sentence1'], sample['sentence2']],
        label=sample['label']
    ))

# ------------------------------------------------------------------------------------------
def train(epoch, model, loader, loss_fct, optim, scheduler, device='cpu', eval_steps=4_000, eval_samples=samples):
    model.train()  
    loop = tqdm(loader, leave=True)
    
    for idx, batch in enumerate(loop):
        model.zero_grad()

        anchor_ids, anchor_mask, pos_ids, pos_mask= (
            b.to(device) for b in (batch['anchor_ids'], batch['anchor_mask'], 
                                   batch['positive_ids'], batch['positive_mask']))

        scores, labels = model(anchor_ids, anchor_mask, pos_ids, pos_mask)
        loss = loss_fct(scores, labels[:scores.size(0)])
        loss.backward()
        optim.step()
        scheduler.step()

        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

        if (idx > 0 and idx%eval_steps == 0) or (idx == len(loader)-1):
            eval_STSb(model, samples, idx=idx)

# ------------------------------------------------------------------------------------------
def eval_STSb(model, samples, idx=None):
    model.save_pretrained(path_base+'bertimbau/')
    tokenizer.save_pretrained(path_base+'bertimbau/')
    
    sbert = SentenceTransformer(path_base+'bertimbau/')
    sbert[0].max_seq_length = MAX_LEN
    sbert[1].pooling_mode_cls_token = False
    sbert[1].pooling_mode_mean_tokens = True  # <---
    sbert[1].pooling_mode_max_tokens = False

    evaluator = EmbeddingSimilarityEvaluator.from_input_examples(samples, write_csv=False)
    score_eval = evaluator(sbert)
    print(f'\nEvaluating model at batch {idx} -- Score: {score_eval:.4}\n')        

  0%|          | 0/1500 [00:00<?, ?ex/s]

## Training Loop

In [None]:
try:
    del model
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass

model = SBERTMNR.from_pretrained(BERTMODEL).to(device)

loss_fct = torch.nn.CrossEntropyLoss()

optim = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-5)

total_steps = int(len(data_train['anchor_ids']) / BSIZE)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(optim, num_warmup_steps=warmup_steps,
                                            num_training_steps=total_steps-warmup_steps)

for epoch in range(1, EPOCHS+1):
    train(epoch, model, loader, loss_fct, optim, scheduler, device=device, eval_steps=1_000)

  0%|          | 0/6534 [00:00<?, ?it/s]




Evaluating model at batch 1000 -- Score: 0.8203






Evaluating model at batch 2000 -- Score: 0.8248






Evaluating model at batch 3000 -- Score: 0.8336






Evaluating model at batch 4000 -- Score: 0.8284






Evaluating model at batch 5000 -- Score: 0.8293






Evaluating model at batch 6000 -- Score: 0.8303






Evaluating model at batch 6533 -- Score: 0.8303



# <font color='lightblue'>(SBERT$_\text{MNR}$ treino NLI$_\text{PTBr}$)$_\text{lib SBERT}$</font>

In [None]:
data_train = pickle_file(path_base+'SNLI_MNLI_POSITIVIES_PT')
data_train = datasets.Dataset.from_dict(data_train)

train_samples = []
for row in tqdm(data_train):
    train_samples.append(InputExample(
        texts=[row['premise'], row['hypothesis']]
    ))

  0%|          | 0/313629 [00:00<?, ?it/s]

In [None]:
from sentence_transformers import models, losses, datasets

try:
    del model
    gc.collect()
    torch.cuda.empty_cache()
except:
    pass

def eval_STSb_sbert(model, samples):
    sbert = SentenceTransformer('/content/sbert')
    sbert[0].max_seq_length = MAX_LEN
    sbert[1].pooling_mode_cls_token = False
    sbert[1].pooling_mode_mean_tokens = True  # <---
    sbert[1].pooling_mode_max_tokens = False

    evaluator = EmbeddingSimilarityEvaluator.from_input_examples(samples, write_csv=False)
    score_eval = evaluator(sbert)
    print(f'\nEvaluating model with SBERT: score {score_eval:.4}\n')        

# Data Loader
loader = datasets.NoDuplicatesDataLoader(train_samples, batch_size=BSIZE)

# Init BERT from pre_trained
bert = models.Transformer(BERTMODEL)

# pass the pooler - Default is mean
pooler = models.Pooling(
    bert.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True
)

# pass model and poole to SentenceTransformer main class
model = SentenceTransformer(modules=[bert, pooler])

# truncate max seq length to MAX_LEN
model[0].max_seq_length = MAX_LEN

# Using MNR Loss
loss = losses.MultipleNegativesRankingLoss(model)

# monotone lr
warmup_steps = int(len(loader) * 0.1)

# trainer obj sbert
model.fit(
    train_objectives=[(loader, loss)],
    epochs=1,
    warmup_steps=warmup_steps,
    output_path='./sbert',
    show_progress_bar=True)

# Get score in STSb
eval_STSb_sbert(model, samples)      

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6533 [00:00<?, ?it/s]


Evaluating model with SBERT: score 0.8401

