<a href="https://colab.research.google.com/github/anupj/fine-tuning-llama/blob/main/Fine_Tuning_Llama_for_Multi_Turn_Conversations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning LLMs for Multi-Turn Conversations

source: https://www.together.ai/blog/fine-tuning-llms-for-multi-turn-conversations-a-technical-deep-dive

🤗 dataset: [stanfordnlp/coqa](https://huggingface.co/datasets/stanfordnlp/coqa/tree/main)

In [1]:
!pip install -q datasets transformers together

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/485.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/80.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.5/80.5 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━

## Prepare CoQA Dataset for Fine-tuning

In [2]:
from datasets import load_dataset

coqa_dataset = load_dataset("stanfordnlp/coqa")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/793k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7199 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

### Lets examine some rows from the CoQA dataset

In [3]:
coqa_dataset["train"].to_pandas().head()

Unnamed: 0,source,story,questions,answers
0,wikipedia,"The Vatican Apostolic Library (), more commonl...","[When was the Vat formally opened?, what is th...",{'input_text': ['It was formally established i...
1,cnn,New York (CNN) -- More than 80 Michael Jackson...,"[Where was the Auction held?, How much did the...","{'input_text': ['Hard Rock Cafe', '$2 million...."
2,gutenberg,"CHAPTER VII. THE DAUGHTER OF WITHERSTEEN \n\n""...","[What did Venters call Lassiter?, Who asked La...","{'input_text': ['gun-man', 'Jane', 'Yes', 'to ..."
3,cnn,(CNN) -- The longest-running holiday special s...,"[Who is Rudolph's father?, Why does Rudolph ru...","{'input_text': ['Donner', 'he felt like an out..."
4,gutenberg,CHAPTER XXIV. THE INTERRUPTED MASS \n\nThe mor...,"[Who arrived at the church?, Who was followed ...","{'input_text': ['the garrison first', 'Fra. Do..."


## Format the data to confirm with the chat format
```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful AI chatbot."},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing well, thank you! How can I help you?"},
    {"role": "user", "content": "Can you explain machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
  ]
}
```
This list of messages can be written to a `.jsonl` file.

In [4]:
# the system prompt,if present, must always be at the beginning
system_prompt = "Read the story and extract answers for the questions.\nStory: {}"

def map_fields(row):
    """
    Maps the fields from a row of data to a structured format for conversation.
    Args:
        row (dict): A dictionary containing the keys "story", "questions", and "answers".
            - "story" (str): The story content to be used in the system prompt.
            - "questions" (list of str): A list of questions from the user.
            - "answers" (dict): A dictionary containing the key "input_text" which is a list of answers from the assistant.
    Returns:
        dict: A dictionary with a single key "messages" which is a list of message dictionaries.
            Each message dictionary contains:
            - "role" (str): The role of the message sender, either "system", "user", or "assistant".
            - "content" (str): The content of the message.
    """
    messages = [
        {
            "role": "system",
            "content": system_prompt.format(row["story"]),
        }
    ]
    for q, a in zip(row["questions"], row["answers"]["input_text"]):
        messages.append(
            {
                "role": "user",
                "content": q,
            }
        )
        messages.append(
            {
                "role": "assistant",
                "content": a,
            }
        )

    return {
        "messages": messages
    }

In [5]:
# transform the data using the mapping function
train_messages = coqa_dataset["train"].map(map_fields, remove_columns=coqa_dataset["train"].column_names)

Map:   0%|          | 0/7199 [00:00<?, ? examples/s]

In [6]:
train_messages

Dataset({
    features: ['messages'],
    num_rows: 7199
})

In [7]:
train_messages.to_json("coqa_prepared_train.jsonl")

Creating json from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

23777505

## Fine-tune on Prepared Dataset using Together AI Fine-tuning API

In [8]:
from together import Together
import os
try:
    from google.colab import userdata
    os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')
    os.environ['WANDB_API_KEY'] = userdata.get('WANDB_API_KEY')
except ImportError:
    print("Not in Google Colab environment")

for key in ['TOGETHER_API_KEY', 'WANDB_API_KEY']:
    try:
        api_key = os.environ[key]
        if not api_key:
            raise ValueError(f"{key} environment variable is empty")
    except KeyError:
        api_key = input(f"{key} environment variable is not set. Please enter your API key: ")
        os.environ[key] = api_key
# Get the API key from the environment variable
TOGETHER_API_KEY = os.environ.get('TOGETHER_API_KEY')
WANDB_API_KEY    = os.environ.get('WANDB_API_KEY')

client = Together(api_key = TOGETHER_API_KEY)

In [9]:
# Upload dataset to Together AI

train_file_resp = client.files.upload("coqa_prepared_train.jsonl", check=True)
print(train_file_resp)

Uploading file coqa_prepared_train.jsonl: 100%|██████████| 23.8M/23.8M [00:00<00:00, 28.2MB/s]


id='file-af23f4ed-4156-4ea2-9937-6776a06f9ce1' object=<ObjectType.File: 'file'> created_at=1740133118 type=None purpose=<FilePurpose.FineTune: 'fine-tune'> filename='coqa_prepared_train.jsonl' bytes=0 line_count=0 processed=False FileType='jsonl'


### This is where all the fine tuning magic happens

In [10]:
ft_resp = client.fine_tuning.create(
    training_file = train_file_resp.id,
    model = 'meta-llama/Meta-Llama-3.1-8B-Instruct-Reference',
    train_on_inputs= "auto",
    n_epochs = 3,
    n_checkpoints = 1,
    wandb_api_key = WANDB_API_KEY,
    lora = True,
    warmup_ratio=0,
    learning_rate = 1e-5,
    suffix = 'my-demo-finetune',
)

print(ft_resp.id)

message='Starting from together>=1.3.0, the default batch size is set to the maximum allowed value for each model.'


ft-9f5c13f6


## Evaluate Fine-tuned Model
For evaluation, CoQA uses two metrics:

- F1 score, which measures word overlap between predicted and ground truth answers
- Exact Match (EM), which requires the prediction to exactly match one of the ground truth answers.

F1 is the primary metric as it better handles free-form answers by giving partial credit for partially correct responses.

In [11]:
from tqdm.auto import tqdm
from multiprocessing.pool import ThreadPool
import transformers.data.metrics.squad_metrics as squad_metrics

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [12]:
# This function is used to generate model answers on the CoQA validation set from the untuned reference and fine-tuned models

def get_model_answers(model_name):
    """
    Generate model answers for a given model name using a dataset of questions and answers.
    Args:
        model_name (str): The name of the model to use for generating answers.
    Returns:
        list: A list of lists, where each inner list contains the answers generated by the model for the corresponding set of questions in the dataset.
    The function performs the following steps:
    1. Initializes an empty list to store the model answers.
    2. Defines an inner function `get_answers` that takes a data dictionary and generates answers for the questions in the data.
    3. Uses a thread pool to parallelize the process of generating answers for each entry in the validation dataset.
    4. Appends the generated answers to the `model_answers` list.
    5. Returns the `model_answers` list.
    Note:
        - The `system_prompt` and `client` variables are assumed to be defined elsewhere in the code.
        - The `coqa_dataset` variable is assumed to contain the dataset with a "validation" key.
    """

    model_answers = []

    def get_answers(data):
        answers = []
        messages = [
            {
                "role": "system",
                "content": system_prompt.format(data["story"]),
            }
        ]
        for q, true_answer in zip(data["questions"], data["answers"]["input_text"]):
            messages.append(
                {
                    "role": "user",
                    "content": q
                }
            )
            chat_completion = client.chat.completions.create(
                messages=messages,
                model=model_name,
                max_tokens=64,
            )
            answer = chat_completion.choices[0].message.content
            answers.append(answer)
        return answers


    with ThreadPool(8) as pool:
        for answers in tqdm(pool.imap(get_answers, coqa_dataset["validation"]), total=len(coqa_dataset["validation"])):
            model_answers.append(answers)

    return model_answers

In [13]:
# This function will be used to evaluate predicted answers uinsg the Exact Match (EM) and F1 metrics

def get_metrics(pred_answers):
    """
    Calculate the Exact Match (EM) and F1 metrics for predicted answers.
    Args:
        pred_answers (list): A list of predicted answers. Each element in the list is a list of predicted answers for a single question.
    Returns:
        tuple: A tuple containing two elements:
            - em_score (float): The average Exact Match score across all predictions.
            - f1_score (float): The average F1 score across all predictions.
    """

    em_metrics = []
    f1_metrics = []

    for pred, data in tqdm(zip(pred_answers, coqa_dataset["validation"]), total=len(pred_answers)):
        for pred_answer, true_answer in zip(pred, data["answers"]["input_text"]):
            em_metrics.append(squad_metrics.compute_exact(true_answer, pred_answer))
            f1_metrics.append(squad_metrics.compute_f1(true_answer, pred_answer))

    return sum(em_metrics) / len(em_metrics), sum(f1_metrics) / len(f1_metrics)

## Deploy Model and Run Evals


In [16]:
models_names = [
    "anupjadhav/Meta-Llama-3.1-8B-Instruct-Reference-finetune-yyyyyyysecrectyyyyyy",
]

for model_name in models_names:
    print(model_name)
    answers = get_model_answers(model_name)
    em_metric, f1_metric = get_metrics(answers)
    print(f"EM: {em_metric}, F1: {f1_metric}")

anupjadhav/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-1d44f634-818e33ad


  0%|          | 0/500 [00:00<?, ?it/s]

ServiceUnavailableError: Error code: 503 - The server is overloaded or not ready yet.