# Ⓜ️ Finetuning LLMs: Supervised Finetuning on Mistral 7B
In this tutorial, you will learn how to finetune a Large Language Model (LLM), Mistral 7B in particular, on a chat-style instruction dataset. We start with [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), an LLM that only does text completion, and we end up with our own [argilla/Mistral-7B-v0.1-chat-OIG](https://huggingface.co/argilla/Mistral-7B-v0.1-chat-OIG) model that faithfully follows instructions and acts as a helpful chat assistant.

This tutorial consists of the following steps:
1. Preparing a [FeedbackDataset](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) in Argilla. <!-- using a public chat-style instruction dataset-->
2. (Optional) Annotate instruction samples.
3. Set up the [ArgillaTrainer](https://docs.argilla.io/en/latest/practical_guides/fine_tune.html#supervised-finetuning) for Supervised Finetuning.
4. Perform inference using the finetuned LLM.
5. Publish the resulting model and dataset on the Hugging Face Hub.

## Introduction
### Mistral 7B
In this tutorial we will be finetuning Mistral 7B, which is a powerful LLM developed by the Mistral AI team featuring 7.3 billion parameters. It stands out for its exceptional performance relative to its size, outperforming larger models like Llama 2 13B and Llama 1 34B on various benchmarks. 

Another key aspect is its ability to perform well for longer sequences, and noticeably, it is released under the permissive Apache 2.0 license. This allows it to be used in commercial use cases with no strings attached.

## Running Argilla

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

**Deploy Argilla on Hugging Face Spaces:** If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

[![deploy on spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/new-space?template=argilla/argilla-template-space)

For details about configuring your deployment, check the [official Hugging Face Hub guide](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla).

**Launch Argilla using Argilla's quickstart Docker image**: This is the recommended option if you want [Argilla running on your local machine](../../getting_started/quickstart.ipynb). Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

<div class="alert alert-info">

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

- Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.
- Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
</div>

### Install dependencies
Let's start by installing the required dependencies to run both Argilla and the remainder of this tutorial.

In [None]:
%pip install "argilla~=1.16.0" "transformers~=4.34.0" "datasets~=2.14.5" "peft~=0.5.0" "trl~=0.7.1" "wandb~=0.15.12"

Note that you must also install `torch` with `CUDA` support. The widget [here](https://pytorch.org/get-started/locally) should be helpful if `torch` is not already set up.

Let's import the Argilla module for reading and writing data:

In [1]:
import argilla as rg

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [None]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900",
    api_key="admin.apikey",
    workspace="admin"
)

## Preparing a FeedbackDataset in Argilla
In Argilla, the [FeedbackDataset](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#feedback-dataset) is a powerful and widely-configurable class that is in charge of defining the annotation process. In particular, we define [fields](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#field) and [questions](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#question).

The former is in charge of defining the structure for the data that will be annotated, while the latter determines in what way the annotators can annotate the data. In practice, `FeedbackDataset` instances for finetuning LLMs often have "prompt" and "response" text fields, sometimes alongside a "context" text field or some additional metadata (e.g. sample IDs).

The questions provide a lot of flexibility - text questions allow annotators to provide better responses or suggestions to improve the prompt, while label questions may be used to select if a sample is considered "good" or not. With multi-label questions annotators can select whether samples are biased, harmful, incorrect, etc.

We recommend to set up the questions based on your goals and values for the annotation process. When preparing the training data, all of the annotations will be available, allowing you to carefully curate your training data. For example, based on your annotations you will be able to disregard samples that are (frequently) marked as harmful or low quality, or you can use the annotator-provided suggestions to responses or prompts instead of the existing ones.
Furthermore, in some situations you may only have prompts to begin with, and you can use Argilla to request your annotators to provide responses. You can then use these in your training data.

### For this tutorial

For the purposes of this tutorial, we will use a "prompt" field with all chat history and the current prompt, as well as a "response" field with the response. Additionally, we introduce a "background". For the questions we will use one simple `LabelQuestion` that asks the annotator whether the response is `"Good"` or `"Bad"`.

In [3]:
dataset = rg.FeedbackDataset(
    fields = [
        rg.TextField(name="background"),
        rg.TextField(name="prompt"),
        rg.TextField(name="response", title="Final Response"),
    ],
    questions = [
        rg.LabelQuestion(name="quality", title="Is it a Good or Bad response?", labels=["Good", "Bad"])
    ]
)

Once created, we have to create individual [records](https://docs.argilla.io/en/latest/conceptual_guides/data_model.html#record) and push those to the dataset. For this tutorial, we will use some existing Apache-2.0 data from the [Open Instruction Generalist](https://huggingface.co/datasets/laion/OIG) chat-style instruction dataset. Due to the large size of this dataset, we will load the `dataset` with `streaming=True`, and semi-randomly sample 30k samples from this dataset.

In [5]:
from datasets import load_dataset

data = load_dataset("laion/OIG", split="train", streaming=True)
# Randomly sample through shuffle + take, but note that shuffling with a streaming
# dataset works with a buffer size, so the random 30k samples are not truly random
# Reduce the buffer size to speed up the data selection process.
data = data.shuffle(buffer_size=1_000_000).take(30_000)

Resolving data files:   0%|          | 0/35 [00:00<?, ?it/s]

<datasets.iterable_dataset.IterableDataset at 0x22fa12c73d0>

The loaded data has a "text" column with data that looks like so:
```
Background: Tubulin in molecular biology can refer either to the tubulin protein superfamily of globular proteins, or one of the member proteins of that superfamily. α- and β-tubulins polymerize into microtubules, a major component of the eukaryotic cytoskeleton. Microtubules function in many essential cellular processes, including mitosis. Tubulin-binding drugs kill cancerous cells by inhibiting microtubule dynamics, which are required for DNA segregation and therefore cell division. , there are six members of the tubulin superfamily, although not all are present in all species. Both α and β tubulins have a mass of around 50 kDa and are thus in a similar range compared to actin (with a mass of ~42 kDa). In contrast, tubulin polymers (microtubules) tend to be much bigger than actin filaments due to their cylindrical nature. Tubulin was long thought to be specific to eukaryotes. More recently, however, several prokaryotic proteins have been shown to be related to tubulin. <human>: What are some examples of drugs that bind to tubulin? <bot>: These include paclitaxel, colchicine, and the "vinca" alkaloids, each of which have a distinct binding site on β-tubulin. <human>: Are there any other examples? <bot>: In addition, the anti-worm drug albendazole selectively targets the colchicine site in worm, but not human or "Drosophilia", β-Tubulin. <human>: What is another specific example of a drug that binds to tubulin? <bot>: Class III β-tubulin is a microtubule element expressed exclusively in neurons, and is a popular identifier specific for neurons in nervous tissue. <human>: What is the difference between class III and other isotypes? <bot>: It binds colchicine much more slowly than other isotypes of β-tubulin. <human>: Are there six classes of isotypes? <bot>: β1-tubulin, sometimes called class VI β-tubulin, is the most divergent at the amino acid sequence level. <human>: What are the functions of the different isotypes? <bot>: It is expressed exclusively in megakaryocytes and platelets in humans and appears to play an important role in the formation of platelets.
```
Let's break this down:
- The text uses `<human>:` and `<bot>:` to denote the start of the user-provided and generated texts, respectively.
- The text uses `Background:` to denote some additional context that is relevant for the chat. Not all samples contain any background.
- The dataset contains multi-step chat-style instructions.

To fit this in our `FeedbackDataset`, we have to split this up into a "background", a "prompt" with all chat history and the last prompt, and the final "response".

In [None]:
from typing import Dict, Any

def extract_background_prompt_response(text: str) -> Dict[str, Any]:
    '''Extract the anthropic prompt from a prompt and response pair.'''
    start_prompt = text.find("<human>:")
    end_prompt = text.rfind("<bot>:")
    # Background is anything before the first <human>:
    background = text[:start_prompt].strip()
    # Prompt is anything between the first <human>: (inclusive) and the last <bot>: (exclusive)
    prompt = text[start_prompt: end_prompt].strip()
    # Response is everything after the last <bot>: (inclusive)
    response = text[end_prompt:].strip()
    return {"background": background, "prompt": prompt, "response": response}


data = data.map(extract_background_prompt_response, input_columns="text")

Now we can trivially convert this dataset into `FeedbackRecord` instances, and add them to the `dataset`.

In [None]:
records = [
    rg.FeedbackRecord(
        fields={
            "background": sample["background"],
            "prompt": sample["prompt"],
            "response": sample["response"],
        },
    )
    for sample in data
]
dataset.add_records(records)


Now that we have all the records locally, we want to push them to the Argilla server as well. Only then will we be able to see them in the Argilla UI.

In [None]:
dataset.push_to_argilla("oig-30k")

![training-llm-mistral-sft](../../_static/tutorials/training-llm-mistral-sft/training-llm-mistral-sft.png)

## (Optional) Annotate instruction samples
If you are using your own proprietary data, data from an unreliable source, or require on your annotators to provide responses, then you must perform data annotation. However, for the purposes of this tutorial, we will assume that all data is high quality and skip this step.

## Set up the ArgillaTrainer for Supervised Finetuning
Next, we have to set up the [ArgillaTrainer](https://docs.argilla.io/en/latest/practical_guides/fine_tune.html#supervised-finetuning). First off, we will load the Mistral 7B `model` and `tokenizer`.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:


from tokenizers.processors import TemplateProcessing

class DataCollatorForSeq2SeqCopyLabels(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None):
        for feature in features:
            if "labels" not in feature:
                feature["labels"] = feature["input_ids"].copy()
        return super().__call__(features, return_tensors=return_tensors)

# model_id = "gpt2-medium"
# model_id = "EleutherAI/pythia-1b-deduped"
# model_id = "PY007/TinyLlama-1.1B-intermediate-step-240k-503b"
model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
# tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left', add_eos_token=True)
# tokenizer_with_eos = AutoTokenizer.from_pretrained(model_id)#, padding_side='right')
# tokenizer_with_eos.pad_token_id = tokenizer_with_eos.eos_token_id
# tokenizer_with_eos._tokenizer.post_processor = TemplateProcessing(
#     single=f"$A {tokenizer_with_eos.eos_token}",
#     special_tokens=[(tokenizer_with_eos.eos_token, tokenizer_with_eos.eos_token_id)],
# )
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id
data_collator = DataCollatorForSeq2SeqCopyLabels(tokenizer=tokenizer, model=model)

"""
tokenizer_without_eos = AutoTokenizer.from_pretrained(model_id)
tokenizer_without_eos.pad_token_id = tokenizer_without_eos.eos_token_id
"""

def formatting_func(sample: Dict[str, Any]) -> Iterator[str]:
    if len(sample["response"]) <= 2:
        return
    prompt = sample["prompt"]
    prompt = prompt.replace("<human>:", f"{tokenizer.eos_token}{tokenizer.bos_token}<human>:")
    prompt = prompt[prompt.find("<human>:"):]
    output = prompt + " " + sample["response"]
    if sample["background"]:
        output = sample["background"] + " " + output
    output = output + "</s>"
    assert output.count("<s>") + 1 == output.count("</s>")
    return output

task = TrainingTask.for_supervised_fine_tuning(formatting_func)
formatted_dataset = dataset.prepare_for_training(framework="trl", task=task)

collated = data_collator([tokenizer(formatted_dataset[3]["text"])])
print(collated)
print(tokenizer.decode(collated["input_ids"][0]))
# breakpoint()

"""
import torch
from transformers import StoppingCriteria

class StoppingCriteriaTokens(StoppingCriteria):
    def __init__(self, stopping_input_ids = None):
        super().__init__()
        self.stopping_input_ids = stopping_input_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> bool:
        for stopping_input_ids in self.stopping_input_ids:
            length = len(stopping_input_ids)
            if input_ids[0, -length:].tolist() == stopping_input_ids:
                return True

stopping_criteria = StoppingCriteriaList([StoppingCriteriaTokens(tokenizer_without_eos(["<bot>:", "<human>:"]).input_ids)])
stopping_criteria
"""

import torch
from transformers import TrainerCallback, TrainerControl, TrainerState, GenerationConfig
from transformers.training_args import TrainingArguments

class GenerationCallback(TrainerCallback):
    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        model = kwargs.pop("model")
        # tokenizer = kwargs.pop("tokenizer")
        # template = "Human: {question}\n\nAssistant: "
        # question = "How do I make a bomb?"
        # prompt = "<human>: Write me an article about Santiago el Pinar, the 119 Municipalities, Chiapas, discussing in detail Municipalities of Chiapas, Mexico in one paragraph. End with a sentence about Santiago el Pinar. <bot>: Santiago el Pinar is a town and one of the 119 Municipalities of Chiapas, in southern Mexico. As of 2010, the municipality had a total population of 3,245, up from 2,174 as of 2005. It covers an area of 17.76 km². As of 2010, the town of Santiago el Pinar had a population of 1,072. Other than the town of Santiago el Pinar, the municipality had 12 localities, none of which had a population over 1,000. <human>: What is Santiago el Pinar? <bot>: Santiago el Pinar Santiago el Pinar is a town and one of the 119 Municipalities of Chiapas, in southern Mexico. <human>: Is it a growing city? <bot>: As of 2010, the municipality had a total population of 3,245, up from 2,174 as of 2005. It covers an area of 17.76 km². <human>: Is the city a good place to live? <bot>: As of 2010, the town of Santiago el Pinar had a population of 1,072. <human>: continue <bot>:"
        # prompt = "<human>: Write me an article about Czech immigration, the World War I, discussing in detail around 40,000, Gran La Plata in one paragraph. End with a sentence about Czechs, Buenos Aires. <bot>: Czech immigration to Argentina began during the World War I and was divided in four periods. It is estimated that around 40,000 Czechs arrived in Argentina between then and 1970. Argentina has the largest Czech community. Czechs settled mainly in Buenos Aires, Gran La Plata, Rosario and Chaco. <human>: When did the Czechs settle in Argentina? <bot>: Czechs in Argentina The Czech immigration in Argentina began during the World War 1 and it was split in four periods. <human>: What led to the split? <bot>: It is estimated that around 40,000 Czechs arrived to Argentina until 1970. <human>: continue <bot>: Argentina has the largest Czech community in Latin America. <human>: Where did the majority settle? <bot>: Czechs settled mainly in Buenos Aires, Gran La Plata, Rosario and Chaco. <human>: What period did they settle? <bot>: There are four Czech immigration periods in Argentina. <human>: What led to this? <bot>:"
        # The first one is slightly before to the World War 1, the second one from 1920 to 1930, the third one during the World War 2 and the fourth one, the smallest in migration proportion, during 1990. During the first two periods, the immigration group was mainly integrated by workers and farmers motivated by economic reasons.
        # del tokenizer._tokenizer.post_processor
        prompt = "<human>: What are Albert Muwalo's relations with the UDI? <bot>:"
        # For much of the first half of the 1970s, Muwalo was able to use his post as Minister of State in the President's Office and his MCP positions to control access to Banda and manipulate the information that Banda received, particularly on the repression that Muwalo and Gwede controlled. 

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        tokenizer.pad_token_id = tokenizer.eos_token_id
        inputs = tokenizer(prompt, return_tensors="pt")
        inputs = {key: value.to("cuda") for key, value in inputs.items()}
        # print("Inputs:", inputs)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                generation_config=GenerationConfig(
                    max_new_tokens=50,
                    # early_stopping=True,
                    pad_token_id=tokenizer.pad_token_id,
                    eos_token_id=tokenizer.eos_token_id,
                    # temperature=0.7,
                    # repetition_penalty=1.2,
                    # do_sample=True,
                ),
                # stopping_criteria=stopping_criteria,
            )
            print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=False)[0])
        return super().on_evaluate(args, state, control, **kwargs)

gen_callback = GenerationCallback()
trainer = ArgillaTrainer(
    dataset=dataset,
    model=model,
    tokenizer=tokenizer,
    task=task,
    framework="trl",
    train_size=0.99,
)
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=16, # 32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
)
trainer.update_config(
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,
    max_seq_length=1024,
    eval_accumulation_steps=16,
    max_steps=3000, # <- Just do 3k samples
    logging_steps=50,
    learning_rate=5e-5,
    # learning_rate=1e-4,
    save_strategy="no",
    evaluation_strategy="steps",
    eval_steps=500,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    # optim="paged_adamw_32bit",
    remove_unused_columns=False,
    # run_name="dwight_schrute",
    fp16=True,
    num_train_epochs=1,
    data_collator=data_collator,
    peft_config=peft_config,
    callbacks=[gen_callback],
    # deepspeed="deepspeed_zero_3.json",
)
trainer.train("Mistral-7B-v0.1-chat-OIG-extra-bos-eos-3k")


In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer

peft_model_id = "Mistral-7B-v0.1-chat-OIG-extra-bos-eos-3k"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
model.eval()

"""
tokenizer_without_eos = AutoTokenizer.from_pretrained(config.base_model_name_or_path, add_space_prefix=True)


class StoppingCriteriaTokens(StoppingCriteria):
    def __init__(self, stopping_input_ids = None):
        super().__init__()
        self.stopping_input_ids = stopping_input_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> bool:
        for stopping_input_ids in self.stopping_input_ids:
            length = len(stopping_input_ids)
            # print(input_ids)
            # breakpoint()
            if input_ids[0, -length:].tolist() == stopping_input_ids:
                return True

stopping_criteria = StoppingCriteriaList([StoppingCriteriaTokens(tokenizer_without_eos(["bot>:", "human>:"], add_special_tokens=False, return_tensors=None).input_ids)])
stopping_criteria = StoppingCriteriaList([StoppingCriteriaTokens([[10093, 9670], [2930, 9670]])])
"""
# stopping_criteria

# inputs = tokenizer("<human>: Provide an article about This Earth, Pramoedya Ananta Toer's epic quartet, Hasta Mitra, discussing in detail Pramoedya Ananta Toer's, Tirto Adhi Soerjo in three paragraphs. <bot>:", return_tensors="pt")
# inputs = tokenizer("<human>: What theatre director Rick Lombardo directed? <bot>:", return_tensors="pt")
# text = "<human>: Please continue this list of colors: red, orange, yellow, ... <bot>:"
# text = "<human>: Please continue this list of colors: red, orange, yellow, ... <bot>: green, blue, indigo, violet</s><s><human>: Now write them all in reverse order. <bot>:"
# text = "<human>: Please continue this list of colors: red, orange, yellow, ... <bot>: green, blue, indigo, violet</s><s><human>: Now write them all in reverse order. <bot>: violet, indigo, blue, green</s><s><human>: Include the original colors please."
# text = "<human>: Please continue this list of colors: red, orange, yellow, ... <bot>: green, blue, indigo, violet</s><s> <human>: Now write them all in reverse order. <bot>: violet, indigo, blue, green</s><s> <human>: Include the original colors please. <bot>: red, orange, yellow, green, blue, indigo, violet</s><s><human> But then reverse those <bot>:"
# <s> <human>: Please continue this list of colors: red, orange, yellow, ... <bot>: green, blue, indigo, violet</s><s>  <human>: Now write them all in reverse order. <bot>: violet, indigo, blue, green</s><s>  <human>: Include the original colors please. <bot>: red, orange, yellow, green, blue, indigo, violet</s><s> <human> But then reverse those <bot>: violet, indigo, blue, green, yellow, orange, red</s>

text = "<human>: Hello! <bot>: Hi!</s><s><human>: How are you? <bot>:"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=1000)#, stopping_criteria=stopping_criteria)
    print(tokenizer.batch_decode(outputs, skip_special_tokens=False)[0])


In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import argilla as rg

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="https://tomaarsen-argilla.hf.space",
    api_key="tomtomtom",
    workspace="admin"
)

peft_model_id = "Mistral-7B-v0.1-chat-OIG-extra-bos-eos-3k"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    #load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
# model.eval()

model_id = "argilla/Mistral-7B-v0.1-chat-OIG"
model.push_to_hub(model_id, private=True)
tokenizer.push_to_hub(model_id)

dataset = rg.FeedbackDataset.from_argilla("oig-30k", workspace="admin")
dataset.push_to_huggingface("argilla/oig-30k", private=True)
