<a href="https://colab.research.google.com/github/fastdatascience/fine-tune-llm/blob/main/fine_tune_llm_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example of how to fine tune your own large language model

This is an example script which you can use to fine tune a large language model for sentence similarity. Check out the [accompanying blog post](https://naturallanguageprocessing.com/train-ai/fine-tune-large-language-model-for-sentence-similarity/) and [video tutorial](https://www.youtube.com/watch?v=SyHXRxkO0tQ).

This is if you have training data that indicates which sentences you consider to be similar, and you want to have a custom sentence similarity model.

It's adapted from the scripts for the [Harmony](https://harmonydata.ac.uk/doxa/)/[DOXA AI](https://doxaai.com/competition/harmony-matching) competition (fine tune an LLM for the psychology domain), but it's general and can be applied to other domains. Credit to Jeremy Lo Ying Ping for the training code.

In [None]:
# !pip install pandas==2.2.2 transformers==4.43.1 sentence-transformers[train]==3.0.1

In [None]:
import pandas as pd

In [None]:
!wget https://naturallanguageprocessing.com/harmony-matching-training-data.csv.zip

--2024-11-20 11:12:19--  https://naturallanguageprocessing.com/harmony-matching-training-data.csv.zip
Resolving naturallanguageprocessing.com (naturallanguageprocessing.com)... 3.125.36.175, 3.124.100.143, 2a05:d01c:9e6:f100::1f4, ...
Connecting to naturallanguageprocessing.com (naturallanguageprocessing.com)|3.125.36.175|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21159 (21K) [application/zip]
Saving to: ‘harmony-matching-training-data.csv.zip.1’


2024-11-20 11:12:20 (242 KB/s) - ‘harmony-matching-training-data.csv.zip.1’ saved [21159/21159]



In [None]:
df = pd.read_csv("harmony-matching-training-data.csv.zip")

In [None]:
df

Unnamed: 0,sentence1,sentence2,score
0,Do you believe in telepathy (mind-reading)?,I believe that there are secret signs in the w...,0.15
1,"Irritable behavior, angry outbursts, or acting...",Felt “on edge”?,0.62
2,I have some eccentric (odd) habits.,I often have difficulty following what someone...,0.00
3,Do you often feel nervous when you are in a gr...,Been easily annoyed by different things?,0.00
4,Do you believe in telepathy (mind-reading)?,Most of the time I find it is very difficult t...,0.26
...,...,...,...
2346,Little interest or pleasure in doing things,At times I have wondered if my body was really...,0.00
2347,"Feeling down, depressed, or hopeless?",I find that I am very often confused about wha...,0.00
2348,Not being able to stop or control worrying?,"If given the choice, I would much rather be wi...",0.16
2349,"Feeling nervous, anxious or on edge?",Have had changes in appetite or sleep?,0.16


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_train, df_test = train_test_split(df)

In [None]:
df_train.reset_index(inplace=True)
df_test.reset_index(inplace=True)

In [None]:
df_train.drop(columns=["index"], inplace=True)
df_test.drop(columns=["index"], inplace=True)

In [None]:
len(df_train), len(df_test)

(1763, 588)

In [None]:
from datasets import Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
dataset_train = Dataset.from_pandas(df_train)

In [None]:
dataset_train

Dataset({
    features: ['sentence1', 'sentence2', 'score'],
    num_rows: 1763
})

In [None]:
dataset_test = Dataset.from_pandas(df_test)

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer("all-mpnet-base-v2")

In [None]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [None]:
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments

In [None]:
from sentence_transformers.losses import CosineSimilarityLoss

In [None]:
loss = CosineSimilarityLoss(model)

In [None]:
trainer = SentenceTransformerTrainer(
    model = model,
    args = SentenceTransformerTrainingArguments(
        output_dir="checkpoints",
        num_train_epochs=3,
        per_device_eval_batch_size=16
    ),
    train_dataset=dataset_train,
    loss = loss
)

In [None]:
trainer

<sentence_transformers.trainer.SentenceTransformerTrainer at 0x7a081c90f810>

In [None]:
trainer.train()

Step,Training Loss
500,0.059


                                                                                

TrainOutput(global_step=663, training_loss=0.05342011286302569, metrics={'train_runtime': 553.4312, 'train_samples_per_second': 9.557, 'train_steps_per_second': 1.198, 'total_flos': 0.0, 'train_loss': 0.05342011286302569, 'epoch': 3.0})

In [None]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [None]:
model.encode(["I feel nervous, anxious or afraid"])

array([[-9.93097015e-03, -7.88239613e-02,  6.90429413e-04,
        -5.47722541e-03, -7.27352547e-03,  5.51959611e-02,
        -1.99138150e-02,  3.13243978e-02, -2.88939774e-02,
         4.32792865e-02, -1.23789981e-02, -2.06399094e-02,
        -7.18724504e-02,  3.27489302e-02, -4.31292318e-02,
         6.49863109e-02,  1.22211650e-02,  5.89610077e-02,
         2.65878662e-02, -3.05647179e-02,  4.32941429e-02,
        -2.40641739e-03, -5.77951176e-03, -2.87189931e-02,
         3.09480391e-02,  3.15308012e-02,  3.17449169e-03,
         2.75935028e-02, -2.97866464e-02, -7.41776377e-02,
        -2.04117745e-02, -5.84216751e-02,  5.05814608e-03,
        -3.80146243e-02, -8.17573047e-04,  6.60591898e-03,
        -4.15992066e-02, -1.89618208e-02, -1.86170433e-02,
        -7.61985127e-03,  1.60014117e-03, -1.46569568e-04,
         1.97578203e-02,  7.30020879e-03, -3.35050002e-02,
        -5.17571718e-02,  4.44276333e-02, -6.86508091e-03,
         7.25207552e-02,  5.16764298e-02, -2.18805224e-0

In [None]:
sentence_1_embeddings = model.encode(df_test.sentence1)

In [None]:
sentence_2_embeddings = model.encode(df_test.sentence2)

In [None]:
sentence_1_embeddings.shape

(588, 768)

In [None]:
from numpy import dot, matmul, ndarray, matrix
from numpy.linalg import norm
import numpy as np
def cosine_similarity(vec1: ndarray, vec2: ndarray) -> ndarray:
    dp = dot(vec1, vec2.T)
    m1 = matrix(norm(vec1, axis=1))
    m2 = matrix(norm(vec2.T, axis=0))

    return np.asarray(dp / matmul(m1.T, m2))


In [None]:
similarity_matrix = cosine_similarity(sentence_1_embeddings, sentence_2_embeddings)

In [None]:
similarity_matrix.shape

(588, 588)

In [None]:
df_test["y_pred"] = [similarity_matrix[i,i] for i in range(len(similarity_matrix))]

In [None]:
df_test

Unnamed: 0,sentence1,sentence2,score,y_pred
0,I sometimes jump quickly from one topic to ano...,People find my conversations to be confusing o...,0.70,0.700950
1,"Trouble concentrating on things, such as readi...",Felt nervous or anxious?,0.00,0.289797
2,Loss of interest in activities that you used t...,My thoughts and behaviors are almost always di...,0.70,0.197288
3,Avoiding external reminders of the stressful e...,Some people can make me aware of them just by ...,0.00,0.114830
4,I sometimes jump quickly from one topic to ano...,I have trouble following conversations with ot...,0.25,0.433558
...,...,...,...,...
583,I often ramble on too much when speaking.,I have had the momentary feeling that someone'...,0.01,0.049073
584,I sometimes jump quickly from one topic to ano...,Avoiding external reminders of the experience ...,0.40,0.114422
585,Being so restless that it is hard to sit still?,I believe that dreams have magical properties.,0.37,0.066781
586,Trouble relaxing?,"Throughout my life, very few things have been ...",0.00,0.244570


In [None]:
df_test["residual"] = df_test.y_pred - df_test.score

In [None]:
df_test

Unnamed: 0,sentence1,sentence2,score,y_pred,residual
0,I sometimes jump quickly from one topic to ano...,People find my conversations to be confusing o...,0.70,0.700950,0.000950
1,"Trouble concentrating on things, such as readi...",Felt nervous or anxious?,0.00,0.289797,0.289797
2,Loss of interest in activities that you used t...,My thoughts and behaviors are almost always di...,0.70,0.197288,-0.502712
3,Avoiding external reminders of the stressful e...,Some people can make me aware of them just by ...,0.00,0.114830,0.114830
4,I sometimes jump quickly from one topic to ano...,I have trouble following conversations with ot...,0.25,0.433558,0.183558
...,...,...,...,...,...
583,I often ramble on too much when speaking.,I have had the momentary feeling that someone'...,0.01,0.049073,0.039073
584,I sometimes jump quickly from one topic to ano...,Avoiding external reminders of the experience ...,0.40,0.114422,-0.285578
585,Being so restless that it is hard to sit still?,I believe that dreams have magical properties.,0.37,0.066781,-0.303219
586,Trouble relaxing?,"Throughout my life, very few things have been ...",0.00,0.244570,0.244570


In [None]:
np.mean(df_test.residual * df_test.residual)

np.float64(0.06922512204080025)

In [None]:
np.mean(np.abs(df_test.residual))

np.float64(0.20813965150724928)