# Bring any Huggingface model

**⏱️ Time to complete**: 60 minutes

This guide will showcase how you can finetune any transformer-based LLM found on the Hugging Face (HF) hub and customize the chat template or prompt format to finetune it on your own data. As an example we will show how to finetune [Meta Llama Guard 2](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-guard-2/) model which has a non-chat based inference pattern.

The two capabilities showcased here are
1. Support for any HF model.
2. Customizing the chat template or prompt format

This guide assumes you have familiarized yourself with the [basic fine-tuning guide](../../README.md).


# Table of Contents
1. [Config parameters](#config-parameters)
2. [Example](#example)
3. [How prompt formatting works in `llmforge`](#how-prompt-formatting-works-in-llmforge)
4. [Customizing data preprocessing and the prompt format](#customizing-data-preprocessing-and-the-prompt-format)

## Config parameters

* `model_id`: For the base model id we support any huggingface hub model id. For a sub-set of model_ids we have baked in both usability (e.g. using the default chat templates) and performance optimizations (e.g. such as fast model downloading), but beyond that set of natively supported model ids, we fall back to huggingface hub for retrieving the weights and tokenizer. You can find the list of natively supported models [here](../../README.md#whats-the-full-list-of-supported-models)
* `generation_config`: For finetuning you can control the way chat messages are converted to sequence of text by modifying the `prompt_format` in this config. For base models that are on huggingface but are not natievely supported specifying the `generation_config` is required, but for native models it is optional.

For more details about these configs read the config [docs](https://docs.anyscale.com/reference/finetuning-config-api). `Llama-Guard-2` is not a native model. So in the next example we show how we can use these two config parameters to customize this model.  

## Example

To get started, we can run the following illustrative example. Run this command from the root of the template.


```
llmforge anyscale finetune cookbooks/bring_any_hf_model/llama-guard-2.yaml
```

> **Note**: Running this example, requires having HF token that has access to [Llama Guard 2](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B) repository. You can setup `HF_TOKEN` environment variable by defining it under dependencies in your cluster setup.

<img src="./assets/add_env_vars.png" alt="Add environment variable" height="300"/>

Let's look at the content of the important parts of the yaml file:

```yaml
model_id: meta-llama/Meta-Llama-Guard-2-8B
train_path: s3://air-example-data/nvidia-aegis-llama-guard-defensive/nvidia-aegis-llama-guard-defensive-train.jsonl
valid_path: s3://air-example-data/nvidia-aegis-llama-guard-defensive/nvidia-aegis-llama-guard-defensive-test.jsonl
generation_config:
  prompt_format:
    system: "{instruction}"
    user: "{instruction}"
    assistant: "{instruction} </s>"
    trailing_assistant: ""
    bos: ""
  stopping_sequences: ["</s>"]
```

A few notes:

1. `meta-llama/Meta-Llama-Guard-2-8B` is not in the [list of native supported models](../../README.md#whats-the-full-list-of-supported-models). This is why we may see messages that show failure of AWS downloads and instead falling back to HF for downloading the weights and tokenizer.
2. The prompt format is configured such that no formating is applied during data ingestion. We are simply concatenating the message roles. Therefore, the formatting of inputs is completely delegated to formatting of the examples in the dataset. 
3. We need to format the data, to make it look like the Llama-Guard-2's original prompt format, so that we can maximally leverage its pre-trained weights. We can take a look at an example from this dataset. During inference we need to send our messages in the same way.

In [1]:
import json

!aws s3 cp s3://air-example-data/nvidia-aegis-llama-guard-defensive/nvidia-aegis-llama-guard-defensive-train.jsonl .

with open('nvidia-aegis-llama-guard-defensive-train.jsonl', 'r') as file:
    first_line = file.readline()
    parsed_json = json.loads(first_line)
    print(json.dumps(parsed_json, indent=2))

download: s3://air-example-data/nvidia-aegis-llama-guard-defensive/nvidia-aegis-llama-guard-defensive-train.jsonl to ./nvidia-aegis-llama-guard-defensive-train.jsonl
{
  "messages": [
    {
      "content": "<|begin_of_text|>[INST] Task: Check if there is unsafe content in 'User' messages in conversations according our safety policy with the below categories.\n\n<BEGIN UNSAFE CONTENT CATEGORIES>\nO1: Violence.\nShould not\n- Help people plan or engage in violence.\n- Encourage people to engage in violence.\n- Contain unpleasant vivid violent imagery or violent description that arouses discomfort in\nthe reader.\nCan\n- Provide information on violence.\n- Discuss the topics of violence, and can discuss historical events involving violence.\nO2: Sexual.\nShould not\n- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to\nengage in sexual activities.\n- Engage or encourage sexual activity that is considered illegal.\nCan\n- Discuss topics related to 

## How prompt formatting works in `llmforge`

Here's a quick rundown of how prompt formatting or chat templating works: the training or validation data needs to be formatted in the OpenAI messages format. Each example has a "messages" entry consisting a conversation with "system", "user" and "assistant" roles. For example:

```json 
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"}, 
    {"role": "user", "content": "What's the value of 1+1?"}, 
    {"role": "assistant", "content": "The value is 2"}
    ]
  }
```

For each role, depending on the model, we add certain tokens as headers or footers along with a BOS token at the start of the conversation and an EOS token at the end of each assistant response. This templating/ formatting is a crucial preprocessing step in bringing the conversation format into a plain text input - which is later tokenized and fed into the model. For Llama-3-8B, the above example would be formatted as follows:

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's the value of 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe value is 2<|eot_id|>
```

The prompt format can be specified in our YAML as a part of the `generation_config` for the model (the same format is used in our inference code):

```yaml
generation_config:
  prompt_format:
    system: 
    user: 
    assistant:
    trailing_assistant:  # inference-only
    bos: # optional
    system_in_user: # optional
    default_system_message: # optional
```

For the native models in the [list of supported models](../../README.md#faqs), we have default generation config parameters. This means that `generation_config` need not be specified when you just want to finetune a model like `meta-llama/Meta-Llama-3-8B-Instruct` directly.  

### Examples
For `meta-llama/Meta-Llama-3-8B`, we use the following prompt format:
```yaml
generation_config:
  prompt_format:
    system: "<|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|>"
    user: "<|start_header_id|>user<|end_header_id|>\n\n{instruction}<|eot_id|>"
    assistant: """<|start_header_id|>assistant<|end_header_id|>\n\n{instruction}<|eot_id|>"
    trailing_assistant: "<|start_header_id|>assistant<|end_header_id|>\n\n" # inference-only 
    bos: "<|begin_of_text|>"
    system_in_user: False
    default_system_message: ""
```

For `mistralai/Mistral-7B`, we the below prompt format:
```yaml
generation_config:
  prompt_format:
    system: "{instruction} + "
    user: "[INST] {system}{instruction} [/INST]"
    assistant: " {instruction}</s>"
    trailing_assistant: "" # inference-only 
    bos: "<s>"
    system_in_user: True
    default_system_message: ""
```

You can find more information on each entry in the [config API reference](https://docs.anyscale.com/reference/finetuning-config-api). Each `system`, `user` and `asistant` entry must contain the format specifier `{instruction}` which will actually format the corresponding entry in `messages`.  

## Customizing data preprocessing and the prompt format

Different LLMs are trained with different prompt formats (or chat templates). In order to figure out the right way to format your data and choose the config parameters, we need to work backwards from the text input to the model. For Llama Guard 2, for instance, the format that the model is trained with is the following:

```text
<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

User: {{ user_message_1 }}

Agent: {{ model_answer_1 }}

<END CONVERSATION>

Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]
```

This is wildly different from Llama-3's prompt format. Fortunately, the OpenAI conversation format is highly flexible and adaptable to almost any input. In a case like Llama Guard 2, it is beneficial to take control of the templating yourself and format your inputs beforehand in the data preprocessing stage. For example, here's one way your data can look:

```json
{
  "messages": [
    {"role": "system", 
    "content": "<|begin_of_text|>[INST] Task: Check if there is unsafe content in '{{ role }}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{{ unsafe_categories }}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

User: {{ user_message_1 }}

Agent: {{ model_answer_1 }}

<END CONVERSATION>

",
  }, 
    {"role": "user", "content":  "Provide your safety assessment for ONLY THE LAST {{ role }} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"}, 
    {"role": "assistant", "content": "{expected_response}"}
    ]
  }
```

Note: All the entries in the messages list need to have non-empty content, and at a minimum we expect one user and one assistant message. 

Since we've taken care of the full templating ourselves, we just need the prompt formatter to verbatim concatenate the content in different roles. Thus, the generation config can look like:

```yaml
generation_config:
  prompt_format:
    system: "{instruction}"
    user: "{instruction}"
    assistant: "{instruction}<|end_of_text|>"
    trailing_assistant: ""
    bos: "" # optional, empty string by default
```

For the above example, the "instruction" (format specifier) passed in to the `system` template is almost the entire prompt (mainly problem context), the "instruction" passed in to the `user` template contains the specific instructions for the LLM, and the "instruction" passed in to the `assistant` template is the expected response ('safe' or 'unsafe'). Also note that this is only one of the many possibilites of `prompt_format` you can specify (with your data preprocessing changing accordingly). 


## Inference time behaviour

After customizing the prompt format during fine-tuning, you need to make sure that the same format is being used at inference. You can use the [inference template](https://docs.anyscale.com/examples/deploy-llms) to deploy your fine-tuned model and specify the same prompt format parameters under  the `generation` entry in the YAML. 