# Fine-tuning a custom Sentence Transformers model using synthetic data

This notebook shows at a high level how we can define a pipeline for generating synthetic datasets for training/fine-tuning [Sentence Transformers](https://sbert.net) models for a custom domain using an LLM to help you generate relevant data. 

## Why fine-tune? 

There are already many good open source embedding models you can use but you may:

- work in a specific domain where existing embeddings might not work super well
- have a specific concept of similarity you want to capture
- want to optimize for a particular task 

In all of these cases, even a little fine-tuning might help. 

## How to get custom data? 

One of the main barriers to fine-tuning a custom model has been the cost and effort involved in creating the datasets needed for this training.
Recently, there has been an increased usage of LLMs for generating synthetic datasets.
We'll see in this notebook how we can use an LLM for creating training datasets for fine-tuning a sentence similarity model. 


## Creating sentence transformers compatible synthetic datasets using distilabel

We can approach generating synthetic datasets for training and fine-tuning sentence similarity/embedding models in a variety of different ways. 
Here we focus on generating data that is compatible with the recently introduced Sentence Transformers training API. 
We'll be using a library called `distilabel` to define and run our pipeline. 

For hosting the models we're using as part of our pipeline we'll be using Hugging Face [Inference Endpoints](https://huggingface.co/inference-endpoints/dedicated). 

Let's start by installing the libraries we need

In [1]:
%pip install distilabel huggingface_hub transformers

Collecting distilabel
  Downloading distilabel-1.1.1-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.6/235.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting datasets>=2.14.0 (from distilabel)
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.25.2 (from distilabel)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess>=0.70 (from distilabel)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.14.0->distilabel)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2

We also need ot authenticate with Hugging Face so we can push datasets to the Hub and consume the endpoints were our models are hosted. 

In [2]:
from huggingface_hub import login

In [3]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## The data we're creating new embeddings for

As was mentioned above, you may want to create custom fine-tuned embeddings for a variety of different purposes. In this example we'll work with a subset of [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k). This is a dataset for training instruct models for code generation. 

In this example we'll focus on creating a dataset which aims to create a sentence similarity model that does a good job of encoding the similarity of natural language prompts aimed at generating code. Let's take a look at a few examples

In [19]:
from datasets import load_dataset

In [22]:
ds = load_dataset("davanstrien/self-oss-instruct-sc2-exec-filter-50k-short", split="train")

We can see that there are various column we could focus on

In [23]:
ds

Dataset({
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 2365
})

We'll focus on the `instruction` column. Let's look at a few examples

In [25]:
ds[:3]['instruction']

['Write a Python function that checks if an object is a subclass of a given class. The function should safely handle exceptions and return `False` if an exception is raised.',
 'Create a Python function that takes a list of `uploaded_files` and modifies the `upload_path` field of each object to remove the extension. The function should return a list of modified objects.',
 "For a given dictionary, recursively flatten all nested dictionaries to a single dictionary. For example, given `{'a': 1, 'b': {'c': 2, 'd': 3}}`, the output should be `{'a': 1, 'b.c': 2, 'b.d': 3}`."]

So what we want is to be able to use an embedding model to say whether the prompt "write a function that sorts a generator and returns a list" is closer to "create a generator sorting function that responds with a list" or `write a function that sorts a list and returns a new list" is more similar. 

We can use the `distilabel` library to help us generate synthetic data for this task. 

**Note** this example may not be relevant to your use case but the same principles can be adapted to any domain. 

# Using an LLM to create similar and dissimilar prompts

Since our goal is to train a sentence similarity model i.e. a model that can take an input sentence and compare it to a list of other sentences and return a similarity score, we need to generate data that has similar and dissimilar sentences. 

In practice with the recent release of Sentence Transformers we need a dataset with the following columns:

- `anchor`: the sentence we want to compare to the other sentences
- `positive`: a sentence that is similar to the anchor
- `negative`: a sentence that is dissimilar to the anchor

In our case the `anchor` will be the original prompt. So how do we create the `positive` and `negative` examples? Spoiler, this is where we'll use an LLM.

Prompting an LLM with a prompt and asking it to generate similar and dissimilar prompts is a sensible way to generate synthetic data for this task. Here's an example of the kind of prompt we might use:


In [36]:

def format_prompt(text: str) -> str:
    return f"""
Here is a natural language query from a user for writing Python code:
<text>
{text}
</text>

<task>
Your role is to rewrite this text command to create both similar and dissimilar examples.
"""


In [37]:
print(format_prompt("Create a list of all the even numbers from 1 to 10."))


Here is a natural language query from a user for writing Python code:
<text>
Create a list of all the even numbers from 1 to 10.
</text>

<task>
Your role is to rewrite this text command to create both similar and dissimilar examples.



## Testing out the prompts

In practice Hugging Chat was used quite heavily for testing out different prompts but we can also quickly test out prompts in this notebook.

In [43]:
from huggingface_hub import InferenceClient

In [44]:
client = InferenceClient("meta-llama/Meta-Llama-3-8B-Instruct")

In [45]:
print(client.text_generation(format_prompt("Create a list of all the even numbers from 1 to 10."),return_full_text=False))

</task>

Here are the rewritten examples:

1. **Similar example:**
Create a list of all the even numbers from 1 to 20.
2. **Dissimilar example 1:**
Write a Python program to find the sum of all the odd numbers from 1 to 50.
3. **Dissimilar example 2:**
Create a list of all the prime numbers from 1 to 100.
4. **Dissimilar example 3:**
Write


Often when we're creating similarity datasets, we want to choose a "hard negative" i.e. a dissimilar sentence that is quite close to the anchor (whilst still being dissimilar along whatever lines of similarity we care about). We'll see how we can choose this "hard negative" in practice later on but in order to do this we need to be able to generate a larger number of similar and dissimilar sentences.

In [49]:

def format_prompt(text: str) -> str:
    return f"""
Here is a natural language query from a user for writing Python code:
<text>
{text}
</text>

<task>
Your role is to rewrite this text command to create both similar and dissimilar examples.

1. Generate three 'good' examples where the text command has the same meaning and intent but is phrased differently. Vary the phrasing, terminology, or structure while preserving the original meaning.
2. Generate three 'bad' examples where the text command significantly changes in meaning or intent, enough to alter what an appropriate Python function would do.
<details>
- For the 'good' examples, imagine all the functions resulting from the rephrased prompt would pass the same test cases.
- For the 'bad' examples, imagine the functions returned from the response to the rephrased prompt would fail the test cases.
- The length of the generated examples should be similar to the original text.
</details>
</task>
"""


Let's see what this looks like with an example from our dataset. 

In [50]:
print(client.text_generation(format_prompt(ds[0]["instruction"]),return_full_text=False))

</text>

Here are the generated examples:

**Good Examples**

1. <text>
Write a Python function that determines whether an object is a direct or indirect subclass of a specified class. The function should handle potential exceptions and return `False` if an exception occurs.
</text>

2. <text>
Create a Python function that checks if an object inherits from a given class, either directly or indirectly. The function should be robust and return `False` in case of an exception.
</text


## Making sure we have the number of examples we want

Since we want to create a dataset with a certain number of examples we need to make sure we have enough examples for each instance. We could do this step after we've generated the data but we can also use "structured text generation" to generate the exact number of examples we want.

We can use Structured Text Generation via Inference API models which are hosted using Text Generation Inference. We won't discuss how this works under the hood in this post (see https://huggingface.co/docs/text-generation-inference/conceptual/guidance for a nice guide on this). We'll instead focus on how we can use this to improve the results we're getting from our open LLM.

When doing structured text generation we use something known as a "grammar" to specify what we want our output to look like. There are various ways of creating these but one way is to use a Pydantic model. Pydantic is a very heavily used data validation library for Python which can be used to validate data fits a certain format. This library was originally designed more for validating data coming via APIs etc but can also be very useful in the context of LLMs. 

In this Pydantic model we can specify the structure of the data we want to generate. We say we want two keys `good` and `bad` and we want the value of each of these keys to be a list of strings (with some constraints on the minimum length of these strings).

In [52]:
from pydantic import BaseModel, conlist, constr

class Prompts(BaseModel):
    good: conlist(constr(min_length=100), min_length=2, max_length=2)  # type: ignore
    bad: conlist(constr(min_length=100), min_length=2, max_length=2)  # type: ignore


schema = Prompts.model_json_schema()


In [53]:

def format_prompt(text: str) -> str:
    return f"""
Here is a natural language prompt from a user for writing Python code: 

"{text}"

Task:
Your role is to rewrite this prompt to create both similar and dissimilar examples.

1. Generate 2 'good' examples where the prompt has the same meaning and intent but is phrased differently. 
   - Vary the phrasing, terminology, or structure while preserving the original meaning.
   - The functions resulting from the rephrased prompts should pass the same test cases as the original.

2. Generate 2 'bad' examples where the prompt command significantly changes in meaning or intent.
   - The changes should be substantial enough to alter what an appropriate Python function would do.
   - The functions returned from the response to the rephrased prompts should fail the test cases of the original.

Additional guidelines:
- The length of the generated examples should be similar to the original text.
- Ensure the 'bad' examples are reasonable prompts, but with a different meaning or intent.

Return your examples as a JSON object with the keys 'good' and 'bad', and the rewritten prompts as an array. Use the following JSON schema:

{schema}
"""

In [57]:
print(
    client.text_generation(
        format_prompt(ds[0]["instruction"]),
        return_full_text=False,
        grammar={"type": "json", "value": Prompts.model_json_schema()},
        max_new_tokens=800,
    )
)

This roughly covers the main parts of the "generation" part of the pipeline. 

## Mining hard negatives

Once we have done the step of generating some candidate similar and dissimilar sentences we can then use a similarity model to find the "hard negatives" i.e. the dissimilar sentences that are closest to the anchor. For this step we'll use an Inference Endpoint hosted version of the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. We could also use a different model for this step but this model is very cheap to run and should work fine as as a starting point for this task.

In `distilabel` we have the concept of `Step`s which are the main building blocks of the pipeline. We can define a custom step that takes in a list of candidate similar and dissimilar sentences and returns the hard negatives. Roughly what the step below does it:

- Loads the client for the similarity model
- Loops through the candidate similar and dissimilar sentences
- Checks that the data generated by the LLM matches the Pydantic model we defined earlier and is valid JSON
- Sends the data to the similarity model
- Returns the hard negatives


```
@step(
    inputs=["generation", "instruction"],
    outputs=["positive", "negative"],
)
def mine_hard_negative(inputs: StepInput) -> StepOutput:
    """Mine hard negative examples for the generation."""
    # Initialize the inference client
    client = InferenceClient(
        model=EMBEDDING_MODEL_ENDPOINT_URL,
        token=get_token(),
    )
    clean = []
    for input in inputs:
        try:
            original_text = input["instruction"]
            data = json.loads(input["generation"])
            # Validate the data matches the schema
            try:
                _ = Prompts(**data)
            except Exception:
                # Skip the input if it doesn't match the schema
                continue
            # Select a random positive example
            positive = random.choice(data["good"])
            negative_candidates = data["bad"]
            # Find the most similar negative example
            embeddings = client.sentence_similarity(
                original_text, negative_candidates
            ).get("similarities")
            most_similar = negative_candidates[embeddings.index(max(embeddings))]
            negative = most_similar
            input["positive"] = positive
            input["negative"] = negative
            clean.append(input)
        except Exception as e:
            print(e)
            continue
    yield clean
```

You can see the full pipeline code in the `pipeline.py` file in the repository (we'll grab this later) but the main steps are as follows:

- load the data (in this case the dataset we're using)
- generate similar and dissimilar prompts using an LLM
- mine hard negatives using a similarity model
- remove column we don't need when training the model


```python
    with Pipeline(
        name="create-embeddings",
        description="Create embeddings for text data",
    ) as pipeline:
        load_data = LoadHubDataset(
            name="load_dataset",
            output_mappings=column_name_mapping,
        )
        format_input = format_prompts(name="format_input")
        text_generation = TextGeneration(
            name="paraphrase_text",
            llm=llm,
            input_batch_size=llm_inference_batch_size,
        )
        select_sentences = mine_hard_negative(name="select_sentences")
        columns_to_keep = KeepColumns(
            columns=["text", "positive", "negative"],
            output_mappings={"text": "anchor"},
        )
        # assemble the pipeline
        (
            load_data
            >> format_input
            >> text_generation
            >> select_sentences
            >> columns_to_keep
        )

    return pipeline
```    

# Run the pipeline 

In practice it's often sensible to run the pipeline on a machine which won't time out but we can also run it in this Colab notebook.

For running the pipeline we download two files. One of these defines the `distilabel` pipeline we're using to generate our embeddings, the other defines a custom LLM which allows use to use our structured generation schema. If you want to modify the prompt or other details of how the pipeline works you can modify them in the `pipeline.py` file. 

In [36]:
!wget https://raw.githubusercontent.com/davanstrien/awesome-synthetic-datasets/main/examples/embedding-datasets/custom_llm.py
!wget https://raw.githubusercontent.com/davanstrien/awesome-synthetic-datasets/main/examples/embedding-datasets/custom_llm.py

### Define some parameters

We need to define a few parameters for the pipeline to run.

- EMBEDDING_MODEL_ENDPOINT_URL: this is the URL to the inference endpoints hosted version of the embedding model we're using to mine our hard negatives. You can use a different model if preferred but I used [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) since it's very cheap to host. 
- INPUT_DATASET_ID: the dataset we're using as a starting point. **note** we're using a dataset already hosted on the Hub but you could also use a local dataset.
- OUTPUT_DATASET_ID: the ID on the Hub where we want to store the output of the pipeline.
- NUM_EXAMPLES: the number of examples we want to generate, it is worth starting with a small number to check the quality of the data before generating a large number of examples.
- MODEL_ID: the ID of the model we're using for the LLM (if we're using a dedicated Inference Endpoint model we can leave this as `None`)
- END_POINT_NAME: the name of the endpoint where the model is hosted.

In [3]:
EMBEDDING_MODEL_ENDPOINT_URL = (
    "https://tmu6gkvjx3vvppfl.us-east-1.aws.endpoints.huggingface.cloud"
)

INPUT_DATASET_ID = "davanstrien/self-oss-instruct-sc2-exec-filter-50k-short"
OUTPUT_DATASET_ID = "davanstrien/similarity-dataset-sc2-8b-test-test-test"
NUM_EXAMPLES = 2  # set to None to use full dataset
MODEL_ID = None
TEXT_COLUMN_NAME = "instruction"
END_POINT_NAME = "meta-llama-3-8b-instruct-aeu"

We can then import the main run command from the `pipeline.py` file and run the pipeline. 

In [6]:
from pipeline import run_pipeline

In [7]:
run_pipeline(
    INPUT_DATASET_ID,
    output_dataset_id=OUTPUT_DATASET_ID,
    endpoint_name=END_POINT_NAME,
    num_examples=NUM_EXAMPLES,
    text_column_name=TEXT_COLUMN_NAME,
)

  return [self.format_input(input) for input in inputs]


Generating train split: 0 examples [00:00, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/3.22k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.22k [00:00<?, ?B/s]