# Train and Fine-Tune Sentence Transformers Models - Notebook Companion

In [None]:
%%capture
!pip install sentence-transformers

## How Sentence Transformers models work


In [None]:
from sentence_transformers import SentenceTransformer, models

## Step 1: use an existing language model
word_embedding_model = models.Transformer('distilroberta-base')

## Step 2: use a pool function over the token embeddings
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

## Join steps 1 and 2 using the modules argument
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## How to prepare your dataset for training a Sentence Transformers model


In [None]:
%%capture
!pip install datasets

In [None]:
from datasets import load_dataset

#dataset_id = "embedding-data/QQP_triplets"
dataset_id = "embedding-data/sentence-compression"

dataset = load_dataset(dataset_id)



  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(f"- The {dataset_id} dataset has {dataset['train'].num_rows} examples.")
print(f"- Each example is a {type(dataset['train'][0])} with a {type(dataset['train'][0]['set'])} as value.")
print(f"- Examples look like this: {dataset['train'][0]}")

- The embedding-data/sentence-compression dataset has 180000 examples.
- Each example is a <class 'dict'> with a <class 'list'> as value.
- Examples look like this: {'set': ["The USHL completed an expansion draft on Monday as 10 players who were on the rosters of USHL teams during the 2009-10 season were selected by the League's two newest entries, the Muskegon Lumberjacks and Dubuque Fighting Saints.", 'USHL completes expansion draft']}


In [None]:
dataset['train'][0]

{'set': ["The USHL completed an expansion draft on Monday as 10 players who were on the rosters of USHL teams during the 2009-10 season were selected by the League's two newest entries, the Muskegon Lumberjacks and Dubuque Fighting Saints.",
  'USHL completes expansion draft']}

Convert the examples into `InputExample`s. It might around 10 seconds in Google Colab.

In [None]:
from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# For agility we only 1/2 of our available data
n_examples = dataset['train'].num_rows // 10

for i in range(n_examples):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example[0], example[1]]))

In [None]:
from sentence_transformers import InputExample

train_examples = []
train_data = dataset['train']['set']
# For agility we only 1/2 of our available data
n_examples = dataset['train'].num_rows // 2

for i in range(n_examples):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example['query'], example['pos'][0], example['neg'][0]]))

In [None]:
print(f"We have a {type(train_examples)} of length {len(train_examples)} containing {type(train_examples[0])}'s.")

We have a <class 'list'> of length 18000 containing <class 'sentence_transformers.readers.InputExample.InputExample'>'s.


We wrap our training dataset into a Pytorch `Dataloader` to shuffle examples and get batch sizes.

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

## Loss functions for training a Sentence Transformers model


In [None]:
from sentence_transformers import losses

train_loss = losses.MultipleNegativesRankingLoss(model=model)

In [None]:
# ORIGINAL CODE
from sentence_transformers import losses

train_loss = losses.TripletLoss(model=model)

## How to train a Sentence Transformer model


In [None]:
len(train_dataloader)

1125

In [None]:
num_epochs = 3 # 10 original

warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) #10% of train data

Training takes around 45 minutes with a Google Colab Pro account. Decrease the number of epochs and examples if you are using a free account or no GPU.

In [None]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps) 

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1125 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1125 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1125 [00:00<?, ?it/s]

## How to share a Sentence Transformers to the Hugging Face Hub

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [None]:
model.save_to_hub(
    "distilroberta-sentence-transformer-test", 
    #organization="embedding-data",
    train_datasets=["embedding-data/sentence-compression"],
    exist_ok=True, 
    )

Cloning https://huggingface.co/edumunozsala/distilroberta-sentence-transformer-test into local empty directory.


Upload file pytorch_model.bin:   0%|          | 3.34k/313M [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/edumunozsala/distilroberta-sentence-transformer-test
   615c920..aaeb0a5  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/edumunozsala/distilroberta-sentence-transformer-test
   615c920..aaeb0a5  main -> main



'https://huggingface.co/edumunozsala/distilroberta-sentence-transformer-test/commit/aaeb0a54b2259a207b631168d6b6488b3ab857db'

In [None]:
model.save_to_hub(
    "distilroberta-base-sentence-transformer", 
    organization="embedding-data",
    train_datasets=["embedding-data/QQP_triplets"],
    exist_ok=True, 
    )

## Extra: How to fine-tune a Sentence Transformer model


Now we will fine-tune our Sentence Transformer model.

In [None]:
modelB = SentenceTransformer('embedding-data/distilroberta-base-sentence-transformer')

In [None]:
dataset_id = "embedding-data/sentence-compression"
datasetB = load_dataset(dataset_id)

In [None]:
print(f"Examples look like this: {datasetB['train']['set'][0]}")

In [None]:
train_examplesB = []
train_dataB = dataset['train']['set']
n_examples = dataset['train'].num_rows

for i in range(n_examples):
  example = train_dataB[i]
  train_examplesB.append(InputExample(texts=[example[0], example[1]]))

In [None]:
train_dataloaderB = DataLoader(train_examplesB, shuffle=True, batch_size=64)
train_lossB = losses.MultipleNegativesRankingLoss(model=modelB)
num_epochsB = 10
warmup_stepsB = int(len(train_dataloaderB) * num_epochsB * 0.1) #10% of train data

In [None]:
model.fit(train_objectives=[(train_dataloaderB, train_lossB)],
          epochs=num_epochsB,
          warmup_steps=warmup_stepsB) 

In [None]:
model.save_to_hub(
    "distilroberta-base-sentence-transformer", 
    organization="embedding-data",
    train_datasets=["embedding-data/sentence-compression"],
    exist_ok=True, 
    )