# 🦙 ⚗️ Synthetic data generation with Llama 3.1 405B and distilabel

This notebook shows how to generate synthetic datasets using the new Llama 3.1 models using [distilabel](https://github.com/argilla-io/distilabel), an open-source framework for synthetic data generation.

Thanks to the new 3.1 license, you can now build synthetic datasets to fine-tune smaller, more specialized models using the larger 405B and 70B Llama models.

Synthetic data generation is a broad topic and there's many exciting developments and libraries coming out in the past months. distilabel enables you to implement end-to-end data generation pipelines, covering different stages and use cases, such as:

- [Generating](https://distilabel.argilla.io/latest/components-gallery/tasks/genstruct/) and [evolving instructions](https://distilabel.argilla.io/latest/components-gallery/tasks/evolinstruct/).
- [Generating and selecting data](https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/deita/?h=deita) for supervised fine tuning.
- Rating responses for [preference tuning](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/?h=ultrafeedback) with [LLM-as-a-judge methods](https://distilabel.argilla.io/latest/components-gallery/tasks/prometheuseval/?h=prometheus).

In this notebook, you'll learn the basics of distilabel by generating a preference dataset from scratch using Hugging Face Inference Endpoints. Besides Inference Endpoints, distilabel provides many [out-of-the-box options](https://distilabel.argilla.io/latest/components-gallery/llms/) for running LLM inference, from running local models to using inference providers.

Let's get started 🚀
## Install distilabel
First you need to install distilabel and the inference endpoints dependencies.

In [None]:
!pip install distilabel[hf-inference-endpoints] -U -qqq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.9/290.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.7/46.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

## Login Hugging Face Hub

You need to login to be able to use Inference Endpoints. You should use a token with enough rights to run Inference endpoints. If you don't have a token, you can generate one [here](https://huggingface.co/settings/tokens).

In [None]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Quickstart

Let's start with a quick example: a pipeline to build a preference dataset with the following steps:

- Load a dataset with instructions from the Hugging Face Hub using the `LoadDataFromHub` [step](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/).
- For each prompt, generate two responses using the `TextGeneration` task with the `InferenceEndpointsLLM` [LLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/) and the 405B and 70B models.
- Combine the two responses into a list of responses using the `CombineColumns` [step](https://distilabel.argilla.io/latest/components-gallery/steps/combinecolumns/).
- Compare and rate the responses using the `UltraFeedback` [llm-as-a-judge task](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/) with the 405B model.

See below the input dataset with instructions. The pipeline will use the `instruction` column to generate responses with Llama 3.1 models.

In [None]:
from IPython.display import HTML
iframe_html = """
<iframe src="https://huggingface.co/datasets/argilla/10Kprompts-mini/embed/viewer/train" width="80%" height="560px"></iframe>
"""
display(HTML(iframe_html))

Now let's run the pipeline:

In [None]:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.steps import CombineColumns

llama70B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
)
llama405B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
)

with Pipeline(name="synthetic-data-with-llama3") as pipeline:

    # load dataset with prompts
    load_dataset = LoadDataFromHub(
        repo_id= "argilla/10Kprompts-mini"
    )

    # generate two responses
    generate = [
        TextGeneration(llm=llama70B),
        TextGeneration(llm=llama405B)
    ]

    # combine responses into one col
    combine = CombineColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"]
    )

    # rate responses with 405B LLM-as-a-judge
    rate = UltraFeedback(aspect="overall-rating", llm=llama405B)

    # define and run pipeline
    load_dataset >> generate >> combine >> rate

In [None]:
distiset = pipeline.run(use_cache=False)

In [None]:
distiset['default']['train'].to_pandas().head()

Unnamed: 0,instruction,topic,generations,distilabel_metadata,model_names,ratings,rationales,model_name
0,How can I create an efficient and robust workf...,Software Development,[To create an efficient and robust workflow th...,{'raw_output_ultra_feedback_0': '#### Output f...,"[llhf/Meta-Llama-3.1-70B-Instruct, sllhf/Meta-...","[5.0, 1.0]","[The output provides a clear, step-by-step gui...",sllhf/Meta-Llama-3.1-405B-Instruct-FP8
1,Is it possible to convert DC welding machine t...,Literature and Arts,[While it's technically possible to modify a D...,{'raw_output_ultra_feedback_0': '#### Output f...,"[llhf/Meta-Llama-3.1-70B-Instruct, sllhf/Meta-...","[4.0, 3.0]",[The text provides accurate and informative co...,sllhf/Meta-Llama-3.1-405B-Instruct-FP8
2,Delete a part of the sentence that does not fi...,Science and Technology,[The part of the sentence that does not fit th...,{'raw_output_ultra_feedback_0': '#### Output f...,"[llhf/Meta-Llama-3.1-70B-Instruct, sllhf/Meta-...","[5.0, 4.0]",[The output accurately identifies the part of ...,sllhf/Meta-Llama-3.1-405B-Instruct-FP8
3,Construct a daily schedule that allocates exac...,Health and Wellness,[Here is a daily schedule that allocates exact...,{'raw_output_ultra_feedback_0': '#### Output f...,"[llhf/Meta-Llama-3.1-70B-Instruct, sllhf/Meta-...","[4.0, 3.0]",[The schedule provided is generally accurate a...,sllhf/Meta-Llama-3.1-405B-Instruct-FP8
4,If a particular argument hinges on an anecdota...,Others,"[If an argument hinges on anecdotal evidence, ...",{'raw_output_ultra_feedback_0': '#### Output f...,"[llhf/Meta-Llama-3.1-70B-Instruct, sllhf/Meta-...","[5.0, 1.0]",[The text provides accurate and informative co...,sllhf/Meta-Llama-3.1-405B-Instruct-FP8


Optionally, we can push the dataset to the Hub:

In [None]:
from google.colab import userdata

# set a secret in colab with enough rights to write repos
hf_token = userdata.get('HF_TOKEN')

distiset.push_to_hub(
    "argilla/synthetic-data-generation-with-llama3-405B",
    token=hf_token,
    private=True
)

You can now explore the resulting dataset below. The most relevant columns are:

- `generations`: A list of the two generated responses (70B and 405B), generated in the `generate` step.
- `ratings`: A list with a rating for each response, generated by the `rate` step.
- `rationales`: A list with rationale for the rating of each response, generated by the `rate` step.

In [None]:
from IPython.display import HTML
iframe_html = """
<iframe src="https://huggingface.co/datasets/argilla/synthetic-data-generation-with-llama3-405B/embed/viewer/train" width="80%" height="560px"></iframe>
"""
display(HTML(iframe_html))

🎉 Congrats! You've generated your first synthetic dataset with distilabel and Llama3.1 405B.

The next section covers how to further configure the pipeline and introduces other useful out-of-the-box steps offered by distilabel.

## Advanced usage

The above example, although simple, is effective and got us a nice dataset but we can try to improve it tweaking the generation parameters of the used LLMs or even combining a few LLMs to generate better texts. Let's see how!

### Tweaking the generation parameters

We can define new generation parameters for both the models we used to generate texts (`Llama 3.1 70B Instruct` and `Llama 3.1 405B Instruct`) and the model (`Llama 3.1 405B Instruct`) we used to rate those generations using the `parameters` argument of the `run` method:


1. For the Llamas that will be used to generate text with the `TextGeneration` task we will define that we want them at max generating `512` tokens with the `max_new_tokens` parameter (default was `128`), and in addition we will set the `temperature` to `0.7` to make the probability distribution of the tokens predicted more uniform or random, so we get more rich and creative texts.
2. For the Llama used to rate the generations with `UltraFeedback` task we will set the maximun number of tokens to be generated to `2048`, as the LLM will have to generate a rationale and score for each generation (in this case 2) so we want to be sure that the LLM will have enough tokens to do so. In this case, as we're using an LLM to annotate the generations, we want it to be as deterministic as possible so we will set the `temperature` to `0.1`.

In most of the cases, setting the `max_new_tokens` and `temperature` is enough to achieve the results that we want, but we can define [much more parameters](https://distilabel.argilla.io/latest/api/llm/huggingface/#distilabel.llms.huggingface.inference_endpoints.InferenceEndpointsLLM.agenerate) such as the `top_p` and `top_k` to adjust even more the tokens generated.



In [None]:
parameters={
    # Llama 3.1 70B Instruct used for text generation
    generate[0].name: {
        "llm": {
            "generation_kwargs": {
                "max_new_tokens": 512,
                "temperature": 0.7,
            }
        }
    },
    # Llama 3.1 405B Instruct used for text generation
    generate[1].name: {
        "llm": {
            "generation_kwargs": {
                "max_new_tokens": 512,
                "temperature": 0.7,
            }
        }
    },
    # Llama 3.1 405B Instruct used judging responses
    rate.name: {
        "llm": {
            "generation_kwargs": {
                "max_new_tokens": 2048,
                "temperature": 0.1
            }
        }
    }
}

### Testing the new generation parameters with `dry_run`

As we're trying new parameters (or if it's the first time executing the pipeline), it's not ideal to execute the pipeline with the whole dataset, as it could fail or the results are not as we expected, wasting money and time.

To test that everything works as expected with a small subset of the dataset, we can use the `dry_run` method:

In [None]:
distiset = pipeline.dry_run(parameters=parameters)

Cool! It worked flawlessly! Now that we're sure, let's execute the pipeline again but this time with the entire dataset.

In [None]:
distiset = pipeline.run(parameters=parameters, use_cache=False)

### Combining a few LLMs to generate better responses

We can even try to go a step further and use a Mixture-of-Agents (MoA) to combine a few LLMs to try generating richer and better responses.

The idea behind MoA is quite simple:

1. We have a few LLMs that we will call proposers generating an output for a given input. We will do this certain number of times, providing the previous outputs in the system prompt. This little trick will help the LLM to generate better responses every turn even if the previous outputs are not very good. In order to cover as much fields as possible, it's better to use specialized LLMs as proposers.
2. We have a final LLM that we will call aggregator. This aggregator LLM will receive the outputs of the proposers to create and aggregated final output. For the aggregator LLM, we will want to use an LLM that it's proficient at generating text and aggregating the outputs to synthesize a high-quality response.

So... what can be a good LLM to be used as an aggregator? 🤔

Yes, you guessed it! 🎉 `Llama 3.1 405B Instruct`

For the proposers LLMs we will use the following models available with [Inference for PROs](https://huggingface.co/blog/inference-pro):

- `Code Llama Instruct`: a conversational code assistant. Good at coding 👨🏻‍💻
- `Llama 3.1 70B Instruct`: a good chat model that is good at everything.

In [None]:
from distilabel.llms import InferenceEndpointsLLM, MixtureOfAgentsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.steps import CombineColumns

with Pipeline(name="synthetic-data-with-llama3-moa") as pipeline:
    load_dataset = LoadDataFromHub(
        repo_id= "argilla/10Kprompts-mini"
    )

    generate = TextGeneration(
        llm=MixtureOfAgentsLLM(
            proposers_llms=[
                InferenceEndpointsLLM(
                    model_id="codellama/CodeLlama-34b-Instruct-hf",
                    generation_kwargs={
                        "max_new_tokens": 1024,
                        "temperature": 0.7,
                    }
                ),
                InferenceEndpointsLLM(
                    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
                    generation_kwargs={
                        "max_new_tokens": 1024,
                        "temperature": 0.7,
                    }
                ),
            ],
            aggregator_llm=InferenceEndpointsLLM(
                model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
                generation_kwargs={
                    "max_new_tokens": 1024,
                    "temperature": 0.7,
                }
            )
        ),
        num_generations=2,
        group_generations=True,
    )

    combine = CombineColumns(
      columns=["generation", "model_name"],
      output_columns=["generations", "model_names"]
    )

    rate = UltraFeedback(aspect="overall-rating", llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
        generation_kwargs={
            "max_new_tokens": 2048,
            "temperature": 0.0,
        }
    ))

    load_dataset >> generate >> combine >> rate

In [None]:
distiset = pipeline.run(use_cache=False)

## What's next?

This notebook has scratched the surface of what's possible with the new Llama 3.1 models and distilabel. There's many things to discover and experiment with.

This notebook uses Hugging Face Inference Endpoints for PROs. This is good for experimentation. For larger datasets we recommend using local LLMs, TGI, vLLM, and even the upcoming Ray integration for running data generation on GPU clusters.

Regarding the pipelines, the best place to discover out-of-the-box components is the [Component Gallery](https://distilabel.argilla.io/latest/components-gallery/).

But probably the biggest strength of distilabel is the ability to develop your custom components on top a scalable and robust data generation framework, you can read this [guide to get started](https://distilabel.argilla.io/latest/sections/how_to_guides/basic/step/#define-steps-for-your-pipeline).