In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, BitsAndBytesConfig
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

# Workshop - Build a Medical Chat Model

You will fine-tune a pre-trained model (``"EleutherAI/pythia-1B-deduped"``) to be able to answer medical questions. The base model has very little medical knowledge and has no chatting capability. The base model can only predict next token.

## Load the Dataset
Load the ``"medalpaca/medical_meadow_medical_flashcards"`` dataset. Use only 1000 samples to speed up training.

In [17]:
#Load dataset

#Use only 1000 samples

#Inspect the dataset

## Prepare the Dataset
Map the dataset to the conversational format required by ``SFTTrainer``. Save the transformed dataset to ``train_medical_dataset.jsonl`` file.

In [None]:
#Map and save dataset


## Load the Dataset
Load the transformed dataset from ``train_medical_dataset.jsonl``.

In [None]:
#Load dataset


## Load the Base Model

Load the base model ``"EleutherAI/pythia-1B-deduped")`` with 4bit quantization. Also load the tokenizer.

In [16]:
#Load base model with 4bit quantization

#Load the tokenizer


## Set Chat Template
The base tokenizer doesn't have any chat template. Set the same template that was used in the ``fine-tune-chat-template.ipynb`` notebook.

In [None]:
#Set chat template


## Evaluate the Base Model
Inspect how the base model performs when you ask it a few questions from the training dataset.

In [None]:
# Write a function that will generate answer from a question
# Set max_new_tokens to 256


## Run Training

Run LoRA training using these parameters: 

- 2 epochs. 
- Batch size 5.
- Maximum sequence length to only 300 because we're using a very small language model.
- Model save directory ``"medical-trained-model"``.

In [None]:
#Run training


In [None]:
#Save the trained model


## Run Inference
Load the trained model and tokenizer and run inference.

In [None]:
#Load trained model

#Load trained tokenizer


In [None]:
#Run inference


# Solution

## Load Dataset

In [None]:
dataset = load_dataset("medalpaca/medical_meadow_medical_flashcards")

In [None]:
#Use only 1000 samples to speed up training
dataset = dataset["train"].select(range(1000))

In [None]:
print(dataset[0])

## Prepare Dataset for Training
The code below will load the dataset, reformat it according to the requirement of SFTTrainer and save it in the ``train_medical_dataset.jsonl`` file.

In [None]:
def prepare_data(dataset):
 
    #Data mapping function
    def create_conversation(sample):   
        return {
            "messages": [
                {
                    "role": "system", 
                    "content": "You are medical professional."
                },
                {
                    "role": "user", 
                    "content": sample["input"]
                },
                {
                    "role": "assistant", 
                    "content": sample["output"]
                }
            ]
        }
        
    #By default the map() function merges new columns to the dataset.
    dataset = dataset.map(
        create_conversation, 
        remove_columns=["input", "output", "instruction"])

    # Save dataset
    dataset.to_json("train_medical_dataset.jsonl", orient="records")
 
prepare_data(dataset)

JSONL is an interesting format where each line is a JSON document. Open the ``train_medical_dataset.jsonl`` file and review it.

Data conversion needs to be done only once. Before running training we need to load the converted data.

## Load the Converted Dataset

In [None]:
train_dataset = load_dataset(
    "json",
    data_files="train_medical_dataset.jsonl",
    split="train"
)

In [None]:
train_dataset

## Load the Base Model

This code will load the base model with 4bit quantization.

In [None]:
bnb_config = BitsAndBytesConfig(
    #For 4bit quantization
    load_in_4bit=True,
    bnb_4bit_compute_dtype = torch.float16,
)

base_model_name = "EleutherAI/pythia-1B-deduped"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(
    base_model_name)

#The base tokenizer does not have a prompt template.
#We add it here.
tokenizer.chat_template = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"


## Evaluate the Base Model

Before running any training we should see if the base model is any good at solving our problems. We write a simple utility to perform text generation.

In [None]:
def generate(model, tokenizer, question):
    streamer = TextStreamer(tokenizer)

    messages = [
      {"role": "system", "content": "You are medical professional."},
      {"role": "user", "content": question},
    ]

    #This will convert the messages list to text and then tokenize it.
    encoded = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt").to(model.device)
      
    generated_ids = model.generate(encoded, streamer=streamer, max_new_tokens=256)

def sample_qa(model, tokenizer, dataset):
    dataset = dataset.shuffle()
    batch = dataset.select(range(1))
    sample = batch["messages"][0]
    question = sample[1]["content"]
    expected_answer = sample[2]["content"]
    
    print("Question:\n", question, "\n")
    print("Expected answer:\n", expected_answer, "\n")
    generate(model, tokenizer, question)

In [None]:
#Give it a try.
sample_qa(base_model, tokenizer, train_dataset)

Biggest problem with the model right now is that it doesn't know when to stop answering. Let’s see if fine-tuning will help.

## Run Training

First we configure the training parameters. We run training for 2 epoch. Each batch will have 5 samples of training data. We set the maximum sequence length to only 300 because we're using a very small language model.

In [None]:
peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)
 
args = SFTConfig(
    output_dir="medical-trained-model", # directory to save and repository id
    num_train_epochs=2,                     # number of training epochs
    per_device_train_batch_size=5,          # batch size per device during training
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=2,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    max_length=300, #Maximum number of generated tokens
    packing=True,
)
 
trainer = SFTTrainer(
    model=base_model,
    args=args,
    train_dataset=train_dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
)

Now, we can begin training. As training progresses you should see a dramatic reduction in loss. This is always a welcome sign.

In [None]:
trainer.train()

While training is going on, you can use the ``nvidia-smi`` command to check GPU usage and memory avalability.

## Save the Model

The model weights are saved for every epoch in the ``./chat-trained-model`` folder. But we should save the final version. This will save the model as well as the tokenizer.

In [None]:
trainer.save_model()

Open ``./medical-trained-model/tokenizer_config.json`` to verify that the chat template is now set for the tokenizer.

## Run Inference

To run inference we need to load the fine-tuned model from the ``./trained-model`` folder. This model is already quantized. There’s no need to quantize it again.

Before you go forward I recommend that you restart the notebook session or run this code to free up memory.

In [None]:
#Free up memory taken up during training
del base_model
del trainer
torch.cuda.empty_cache()

In [None]:
#Load the model
trained_model = AutoModelForCausalLM.from_pretrained(
    "medical-trained-model",
    device_map="auto")
 
trained_tokenizer = AutoTokenizer.from_pretrained(
    "medical-trained-model")

Run inference.

In [None]:
sample_qa(trained_model, trained_tokenizer, train_dataset)

## Summary

Here we built a proper medical chat model. After training the model gained medical knowledge and the ability to chat.