In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

# Build a Chat Model Using Fine-tuning

Conversation fine-tuning builds on top of instruction tuning to make the model be even better at continued conversation with a human. Chat models are created using this techinque.

In a conversation each dialog comes from an actor with a well defined role. Example conversation:

```
User: When was Abraham Lincoln born?
LLM: Abraham Lincoln was born on February 12, 1809.

User: How old was he when he died?
LLM: Abraham Lincoln died on April 15, 1865, at the age of 56.

User: Where did he die?
LLM: Abraham Lincoln was assasinated in Washington D.C.
```

We will now fine-tuned the base model ``TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T`` for conversation. We will adopt the following prompt syntax to designate the roles for the dialog.

```
<|user|>
When was Abraham Lincoln born?</s> 
<|assistant|>
Abraham Lincoln was born on February 12, 1809.</s> 
<|user|>
How old was he when he died?</s> 
<|assistant|>
Abraham Lincoln died on April 15, 1865, at the age of 56.</s> 
<|user|>
Where did he die?</s> 
<|assistant|>
Abraham Lincoln was assasinated in Washington D.C.</s>
```

## Prepare Training Data

We will use [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) to train the model.

SFTTrainer supports [two different data formats](https://huggingface.co/docs/trl/en/sft_trainer#dataset-format-support). We will use the conversation format where each piece of training data will be as follows.

```json
{
  "messages": [
    {"role": "system", "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request."},
    {"role": "user", "content": "Who was Ada Lovelace?"},
    {"role": "assistant", "content": "Ada Lovelace was a Mathematician."}
  ]
}
```

These columns in the source dataset are of interest to us:

- instruction – User question
- input - Any additional context about the question
- output – Desired response from the model

Let's look at the dataset.

In [None]:
dataset = load_dataset(
    "medalpaca/medical_meadow_medical_flashcards",
    split="train"
).train_test_split(test_size=0.1)

In [None]:
dataset

In [None]:
print(dataset["train"][0])

The code below will load the dataset, reformat it according to the requirement of SFTTrainer and save it in the ``instruction_dataset.jsonl`` file.

In [None]:
def prepare_data(dataset):
 
    #Data mapping function
    def create_conversation(sample):   
        return {
            "messages": [
                {
                    "role": "system", 
                    "content": "You are medical professional. Answer the question with most scientific accuracy."
                },
                {
                    "role": "user", 
                    "content": sample["input"]
                },
                {
                    "role": "assistant", 
                    "content": sample["output"]
                }
            ]
        }
        
    #By default the map() function merges new columns to the dataset.
    dataset = dataset.map(
        create_conversation, 
        remove_columns=["input", "output", "instruction"])

    # Save dataset
    dataset["train"].to_json("train_medical_dataset.jsonl", orient="records")
    dataset["test"].to_json("test_medical_dataset.jsonl", orient="records")
 
prepare_data(dataset)

JSONL is an interesting format where each line is a JSON document. Open the ``train_dataset.jsonl`` file and review it.

Data conversion needs to be done only once. Before running training we need to load the converted data.

In [None]:
train_dataset = load_dataset(
    "json", 
    data_files="train_medical_dataset.jsonl", 
    split="train")

In [None]:
train_dataset

## Load the Base Model

This code will load the base model with 4bit quantization.

In [None]:
bnb_config = BitsAndBytesConfig(
    #For 4bit quantization
    load_in_4bit=True
)

base_model_name = "EleutherAI/pythia-70m-deduped"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(
    base_model_name)

#The base tokenizer does not have a prompt template.
#We add it here.
tokenizer.chat_template = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


## Evaluate the Base Model

Before running any training we should see if the base model is any good at solving our problems. We write a simple utility to perform text generation.

In [None]:
def generate(model, tokenizer, question):
  streamer = TextStreamer(tokenizer)
  
  messages = [
    {"role": "system", "content": "You are medical professional. Answer the question with most scientific accuracy."},
    {"role": "user", "content": question},
  ]

  #This will convert the messages list to text and then tokenize it.
  encoded = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt").to(model.device)
 
  generated_ids = model.generate(encoded, streamer=streamer, max_new_tokens=128)

In [None]:
#Give it a try.
generate(base_model, tokenizer, "What is the name of the active form of vitamin D?")

Biggest problem with the model right now is that it doesn't know when to stop answering. Let’s see if fine-tuning will help.

## Run Training

First we configure the training parameters. We run training for 1 epoch. Each batch will have 5 samples of training data. We set the maximum sequence length to only 500 because we're using a very small language model.

In [None]:
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)
 
args = SFTConfig(
    output_dir="medical-trained-model", # directory to save and repository id
    num_train_epochs=1,                     # number of training epochs
    per_device_train_batch_size=5,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=2,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=False,                              # use bfloat16 precision
    tf32=False,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    max_length=500, #Maximum number of generated tokens
    packing=True,
)
 
trainer = SFTTrainer(
    model=base_model,
    args=args,
    train_dataset=train_dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
)

In [None]:
train_dataloader = trainer.get_train_dataloader()
first_batch = next(iter(train_dataloader))
# print(first_batch['input_ids'][0])

first_batch

In [None]:
tokenizer.decode(first_batch["input_ids"][1])

In [None]:
input_ids_batch = first_batch["input_ids"] 
decoded_texts = [tokenizer.decode(input_ids, skip_special_tokens=False) for input_ids in input_ids_batch]
print(decoded_texts)

Now, we can begin training. As training progresses you should see a dramatic reduction in loss. This is always a welcome sign.

In [None]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
trainer.train()

While training is going on, you can use the ``nvidia-smi`` command to check GPU usage and memory avalability.

## Save the Model

The model weights are saved for every epoch in the ``./chat-trained-model`` folder. But we should save the final version. This will save the model as well as the tokenizer.

In [None]:
trainer.save_model()

Open ``./medical-trained-model/tokenizer_config.json`` to verify that the chat template is now set for the tokenizer.

## Run Inference

To run inference we need to load the fine-tuned model from the ``./trained-model`` folder. This model is already quantized. There’s no need to quantize it again.

Before you go forward I recommend that you restart the notebook session or run this code to free up memory.

In [None]:
#Free up memory taken up during training
del base_model
del trainer
torch.cuda.empty_cache()

In [None]:
#Load the model
model = AutoModelForCausalLM.from_pretrained(
    "medical-trained-model",
    device_map="auto")
 
tokenizer = AutoTokenizer.from_pretrained(
    "medical-trained-model")

Run inference.

In [None]:
generate(model, tokenizer, "What is transformation and how is it characterized as the direct uptake of naked DNA by bacteria?")

In [None]:
generate(model, tokenizer, "Which acid-base disturbance is a normal physiological change during pregnancy?")

## Summary

Here we built a proper chat model. Applications can now supply a chat history and get a relevant response from the model. An example below. Notice how ``"he"`` in the last user prompt is correlated by the model to ``Abraham Lincoln``.