# Spanish QA asymmetric

Training a model to create embeddings for asymmetric semantic search using Spanish language:

`roberta-base-bne-finetuned-msmarco-qa-es`

### References:

* https://www.sbert.net/docs/training/overview.html
* https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py
* https://www.pinecone.io/learn/nlp/
* https://www.pinecone.io/learn/fine-tune-sentence-transformers-mnr/
* https://huggingface.co/datasets/unicamp-dl/mmarco/viewer/spanish/train
* https://huggingface.co/datasets/unicamp-dl/mrobust/viewer/queries-spanish

# Config

In [1]:
# Base model
# length_embedding = 512  # TODO: check other sizes (32, 64, 128, 256, 512)
base_model_name = 'PlanTL-GOB-ES/roberta-base-bne'  # TODO: check large?
max_seq_length = 512

# Train model
epochs = 10  # 4, 10, 30
warmup_steps = 1000 # 100, 1000
batch_size = 16 # 32
optimizer_params = {'lr': 2e-5}
loss = 'mnrl'  # 'mnrl', 'mse', 'tl'

# Dataset
dataset_train_size = 500_000  # 20_000 # 100_000 # bottleneck: GPU memory limits
multiple_negatives = False

# General
seed = 42

In [2]:
if multiple_negatives:
  dataset_name = "IIC/ms_marco_es"
else:
  dataset_name = "dariolopez/ms-marco-es-500k"

In [3]:
print(dataset_name)

dariolopez/ms-marco-es-500k


# Install libraries

In [4]:
!pip install sentence-transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
!pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Import libraries

In [6]:
import os
from datetime import datetime

from sentence_transformers import InputExample, SentenceTransformer, models, losses
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Check GPU

In [7]:
!nvidia-smi

Tue May  2 14:21:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    15W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [9]:
print(device)

cuda


# Seed

In [10]:
import numpy as np


def set_seed(seed):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed(seed)

# Model

In [11]:
%%time
# Define the model
word_embedding_model = models.Transformer(
    model_name_or_path=base_model_name,
    max_seq_length=max_seq_length,
    #model_args={"truncation": True, "padding": "max_length", "max_length": max_seq_length},
    #tokenizer_args={"truncation": True, "padding": "max_length", "max_length": max_seq_length},
    tokenizer_name_or_path=base_model_name
)
pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(),
    #pooling_mode_cls_token=True,
    #pooling_mode_mean_tokens=False
)
"""
dense_model = models.Dense(
    in_features=pooling_model.get_sentence_embedding_dimension(),
    out_features=length_embedding,
    activation_function=nn.Tanh()
)
asym_model = models.Asym({
    'query': [models.Dense(pooling_model.get_sentence_embedding_dimension(), length_embedding)], 
    'doc': [models.Dense(pooling_model.get_sentence_embedding_dimension(), length_embedding)]
})
"""

base_model = SentenceTransformer(
    modules=[word_embedding_model, pooling_model],
    device=device
)

Some weights of the model checkpoint at PlanTL-GOB-ES/roberta-base-bne were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


CPU times: user 1.75 s, sys: 706 ms, total: 2.46 s
Wall time: 5.17 s


In [12]:
length_embedding = word_embedding_model.get_word_embedding_dimension()

In [13]:
print(length_embedding)

768


In [14]:
print(base_model)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)


# Load Dataset

In [15]:
%%time
import datasets

marco_es = datasets.load_dataset(dataset_name)



  0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 1.03 s, sys: 170 ms, total: 1.2 s
Wall time: 5.55 s


In [16]:
print(marco_es['train'])

Dataset({
    features: ['query', 'positive', 'negative'],
    num_rows: 500000
})


# Prepare for training

In [17]:
if multiple_negatives:  # query - passage - label https://huggingface.co/datasets/IIC/ms_marco_es
    train_samples = [
        InputExample(texts=[row['query'], row['passages']], label=row['labels'])
        for row in marco_es['train'].select(range(dataset_train_size))
    ]
else:  # query - positive - negative  https://huggingface.co/datasets/dariolopez/ms-marco-es
    train_samples = [
        InputExample(texts=[row['query'], row['positive'], row['negative']])
        for row in marco_es['train'].select(range(dataset_train_size))
    ]

In [18]:
print(f"length train samples: {len(train_samples)}")

length train samples: 4096


In [19]:
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=batch_size)

In [20]:
now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
model_save_path = f'output/train_bi-encoder-{loss}-length_embedding_{length_embedding}-multiple_negatives_{multiple_negatives}-dataset_name_{dataset_name.replace("/", "-")}-{now}-{base_model_name.replace("/", "-")}-batch_size_{batch_size}-dataset_train_size_{dataset_train_size}'
os.makedirs(model_save_path, exist_ok=True)

In [21]:
import gc

def free_memory(score, epoch, steps):
    torch.cuda.empty_cache()
    gc.collect()

In [22]:
if loss == 'mnrl':
    train_loss = losses.MultipleNegativesRankingLoss(model=base_model)
elif loss == 'mse':
    train_loss = losses.MarginMSELoss(model=base_model)
elif loss == 'tl':
    train_loss = losses.TripletLoss(model=base_model)
else:
    train_loss = losses.CosineSimilarityLoss(model=base_model)

In [23]:
print(train_loss)

MultipleNegativesRankingLoss(
  (model): SentenceTransformer(
    (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
    (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  )
  (cross_entropy_loss): CrossEntropyLoss()
)


In [24]:
import copy

# Tune the model
model = copy.deepcopy(base_model)

# Train

In [25]:
%%time

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    save_best_model=True,
    show_progress_bar=True,
    use_amp=True,  # If your GPU does not have FP16 cores, set use_amp=False
    callback=free_memory,
    checkpoint_save_steps=len(train_dataloader),
    checkpoint_path=model_save_path,
)

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

Iteration:   0%|          | 0/256 [00:00<?, ?it/s]

CPU times: user 16min 52s, sys: 2min, total: 18min 52s
Wall time: 19min 13s


# Save and push to HuggingFace

In [26]:
# Train latest model
model.save(model_save_path)

In [27]:
print(model_save_path)

output/train_bi-encoder-mnrl-length_embedding_768-multiple_negatives_False-dataset_name_dariolopez-ms-marco-es-500k-2023-05-02_14-21-42-PlanTL-GOB-ES-roberta-base-bne-batch_size_16-length_embedding_768-dataset_train_size_4096


# Manual testing

In [37]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter


loader = TextLoader('boe.txt')
document = loader.load()
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=1600, chunk_overlap=0)
docs = text_splitter.split_documents(document)

In [38]:
len(docs)

177

In [39]:
corpus = [doc.page_content for doc in docs]

In [40]:
queries = [
    {'query': '¿Qué contenidos deberán incluirse en el sistema educativo español de acuerdo con lo que establece esta ley?', 'test': 'Artículo 7'},
    {'query': '¿En qué consiste el Servicio de recuperación integral?', 'test': 'Artículo 35'},
    {'query': '¿A qué ayudas económicas pueden acceder las mujeres víctimas de violencias sexuales?', 'test': 'Artículo 41'},
    {'query': '¿Qué valoración realizarán las Unidades de valoración forense integral?', 'test': 'Artículo 47'},
    {'query': '¿Cuál es la edad de Leo Messi?', 'test': 'fake'},
    {'query': 'Someone in a gorilla costume is playing a set of drums.', 'test': 'fake'}
]

In [41]:
corpus[0]

'TEXTO ORIGINAL\nFELIPE VI\n\nREY DE ESPAÑA\n\nA todos los que la presente vieren y entendieren.\n\nSabed: Que las Cortes Generales han aprobado y Yo vengo en sancionar la siguiente ley orgánica:\n\nÍNDICE\n\nPreámbulo.\n\nTítulo preliminar.\u2003Disposiciones generales.\n\nArtículo 1.\u2003Objeto y finalidad.\n\nArtículo 2.\u2003Principios rectores.\n\nArtículo 3.\u2003Ámbito de aplicación.\n\nTítulo I.\u2003Investigación y producción de datos.\n\nArtículo 4.\u2003Investigación y datos.\n\nArtículo 5.\u2003Órgano responsable.\n\nArtículo 6.\u2003Fomento de la investigación en materia de violencia sexual.\n\nTítulo II.\u2003Prevención y detección.\n\nCapítulo I.\u2003Medidas de prevención y sensibilización.\n\nArtículo 7.\u2003Prevención y sensibilización en el ámbito educativo.\n\nArtículo 8.\u2003Prevención y sensibilización en el ámbito sanitario, sociosanitario y de servicios sociales.\n\nArtículo 9.\u2003Campañas institucionales de prevención e información.\n\nArtículo 10.\u2003Me

In [42]:
len(model.encode(queries, convert_to_tensor=True)[0])

768

In [43]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

In [44]:
%%time
from sentence_transformers import util

corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# Find the closest 4 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(4, len(corpus))
for query in queries:
    print(f"Query: {query['query']}")
    query_embedding = model.encode(query['query'], convert_to_tensor=True)
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]
    test_passed = False
    scores = []
    for hit in hits:
        scores.append(hit['score'])
        #print(f"Id: {hit['corpus_id']}")
        if query['test'] in corpus[hit['corpus_id']]:
            test_passed = True
        # print(f"Answer: {corpus[hit['corpus_id']]}")
        # print("\n\n")
        #print("\n\n\n\n\n\n\n\n")
        # print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    print(scores)
    print(f"Test passed: {test_passed}")
    print("\n\n\n\n")

Query: ¿Qué contenidos deberán incluirse en el sistema educativo español de acuerdo con lo que establece esta ley?
[0.9689760804176331, 0.9688634276390076, 0.9676809310913086, 0.9666774272918701]
Test passed: True





Query: ¿En qué consiste el Servicio de recuperación integral?
[0.9477847218513489, 0.9468966126441956, 0.9463406801223755, 0.9455181360244751]
Test passed: False





Query: ¿A qué ayudas económicas pueden acceder las mujeres víctimas de violencias sexuales?
[0.9597879648208618, 0.9563784599304199, 0.9559546709060669, 0.9551206827163696]
Test passed: False





Query: ¿Qué valoración realizarán las Unidades de valoración forense integral?
[0.9407835006713867, 0.939946174621582, 0.9378331303596497, 0.9354345798492432]
Test passed: True





Query: ¿Cuál es la edad de Leo Messi?
[0.9295246601104736, 0.9243342876434326, 0.9202563762664795, 0.9201865792274475]
Test passed: False





Query: Someone in a gorilla costume is playing a set of drums.
[0.9075138568878174, 0.905198

In [45]:
%%time
from sentence_transformers import util

corpus_embeddings = base_model.encode(corpus, convert_to_tensor=True)

# Find the closest 4 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(4, len(corpus))
for query in queries:
    print(f"Query: {query['query']}")
    query_embedding = base_model.encode(query['query'], convert_to_tensor=True)
    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k)[0]
    test_passed = False
    scores = []
    for hit in hits:
        scores.append(hit['score'])
        #print(f"Id: {hit['corpus_id']}")
        if query['test'] in corpus[hit['corpus_id']]:
            test_passed = True
        #print(corpus[hit['corpus_id']])
        #print("\n\n\n\n\n\n\n\n")
        # print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
    print(scores)
    print(f"Test passed: {test_passed}")
    print("\n\n")

Query: ¿Qué contenidos deberán incluirse en el sistema educativo español de acuerdo con lo que establece esta ley?
[0.8100699782371521, 0.5381356477737427, 0.5097401738166809, 0.4760800302028656]
Test passed: True



Query: ¿En qué consiste el Servicio de recuperación integral?
[0.49888911843299866, 0.4766852557659149, 0.4300900101661682, 0.4190990924835205]
Test passed: False



Query: ¿A qué ayudas económicas pueden acceder las mujeres víctimas de violencias sexuales?
[0.8558606505393982, 0.7952286601066589, 0.7911955118179321, 0.7670859098434448]
Test passed: False



Query: ¿Qué valoración realizarán las Unidades de valoración forense integral?
[0.8147773742675781, 0.6011478304862976, 0.5034526586532593, 0.4894534647464752]
Test passed: True



Query: ¿Cuál es la edad de Leo Messi?
[0.16004899144172668, 0.14487004280090332, 0.14123597741127014, 0.13931210339069366]
Test passed: False



Query: Someone in a gorilla costume is playing a set of drums.
[0.000565103255212307, -0.0044295