# Finetuning for Text Vectorisation

> If you are new to text vectorisation be sure to look at the text vectorisation notebook first.

Finetuning a text vectorisation task is mostly a matter of optimisation.

Our supported text vectorisation models are applicable to multiple types of text vectorisation use cases: from detecting similar questions to finding paragraphs that contain answers to some question.

However, you may be able to make your use case significantly more accurate with finetuning.

In [1]:
import backprop

## Getting the data

The most value for text vectorisation comes by using your own data for finetuning.

For this example, we will be using the Quora duplicate questions dataset. One row of data contains two questions and whether they are duplicate or not.

Finetuning for text vectorisation uses cosine similarity to compare how similar the vectors are. Therefore, we can score duplicates as `1.0` and non duplicates as `0.0`. Any value between 0 and 1 works, but this dataset does not contain more finegrained information.

Our input data will be a list of question tuples (`[(q1, q2), (q3, q4)]`) and our output data will be a list of corresponding scores (`[0.0, 1.0]`).

In [21]:
from datasets import load_dataset
dataset = load_dataset("quora")

Using custom data configuration default
Reusing dataset quora (/home/kristo/.cache/huggingface/datasets/quora/default/0.0.0/2be517cf0ac6de94b77a103a36b141347a13f40637fbebaccb56ddbe397876be)


In [22]:
dataset["train"][0]

{'is_duplicate': False,
 'questions': {'id': [1, 2],
  'text': ['What is the step by step guide to invest in share market in india?',
   'What is the step by step guide to invest in share market?']}}

In [23]:
dataset["train"][7]

{'is_duplicate': True,
 'questions': {'id': [15, 16],
  'text': ['How can I be a good geologist?',
   'What should I do to be a great geologist?']}}

In [24]:
input_data = []
output_data = []

num_positive = 0
num_negative = 0

for i in range(len(dataset["train"])):
    # Get 500 positive and 500 negative examples
    similarity = 1.0 if dataset["train"][i]["is_duplicate"] else 0.0
    
    if similarity == 1.0 and num_positive >= 500:
        continue
    else:
        num_positive += 1
        
    if similarity == 0.0 and num_negative >= 500:
        continue
    else:
        num_negative += 1
    
    questions = dataset["train"][i]["questions"]
    q1 = questions["text"][0]
    q2 = questions["text"][1]
    # Tuple
    input_data.append((q1, q2))
    
    output_data.append(similarity)

In [25]:
input_data[0], output_data[0]

(('What is the step by step guide to invest in share market in india?',
  'What is the step by step guide to invest in share market?'),
 0.0)

In [26]:
input_data[7], output_data[7]

(('How can I be a good geologist?',
  'What should I do to be a great geologist?'),
 1.0)

It is a good idea to keep the examples roughly balanced. Otherwise finetuning just makes the model more biased toward some score.

## Finetuning

All we do is pass in our question pairs as input data and our similarity scores as output data.

In [27]:
# Start a text vectorisation task with a text vectorisation model
tv = backprop.TextVectorisation(backprop.models.DistiluseBaseMultilingualCasedV2)
# Length here refers to number of tokens (1 token ~ 1 word)
tv.finetune(input_data, output_data, max_input_length=64)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores


Processing data...
Finding the optimal batch size...


Batch size 2 succeeded, trying batch size 4
Batch size 4 succeeded, trying batch size 8
Batch size 8 succeeded, trying batch size 16
Batch size 16 succeeded, trying batch size 32
Batch size 32 succeeded, trying batch size 64
Batch size 64 succeeded, trying batch size 128
Batch size 128 succeeded, trying batch size 256
Batch size 256 failed, trying batch size 128
Finished batch size finder, will continue with full run using batch size 128
Restored states from the checkpoint file at /home/kristo/Documents/backprop/examples/scale_batch_size_temp_model.ckpt
GPU available: True, used: True
TPU available: None, using: 0 TPU cores

  | Name  | Type                | Params
----------------------------------------------
0 | model | SentenceTransformer | 135 M 
----------------------------------------------
135 M     Trainable params
0         Non-trainable params
135 M     Total params
540.511   Total estimated model params size (MB)


Validation sanity check: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Validating: |          | 0/? [00:00<?, ?it/s]

Training finished! Save your model for later with backprop.save or upload it with backprop.upload


In [28]:
q1 = tv("Where did Bill Gates go to school?")
q2 = tv("What school did Bill Gates go to?")

In [29]:
backprop.cosine_similarity(q1, q2)

0.9043131470680237

In [30]:
q1 = tv("Where did Bill Gates go to school?")
q2 = tv("What company did Bill Gates found?")

In [31]:
backprop.cosine_similarity(q1, q2)

0.7232611179351807

In [32]:
q1 = tv("Where did Bill Gates go to school?")
q2 = tv("How big is the moon?")

In [33]:
backprop.cosine_similarity(q1, q2)

0.16400930285453796

As we can see, the most similar questions get the highest score while the least similar questions get the lowest score.