#Build your MedBot
© 2023, Zaka AI, Inc. All Rights Reserved.

---
The goal of this colab is to get you more familiar with LLM fine-tuning by creating a simple QA LLM that can answer medical questions. By the end of it you will be able to customize this LLM with any dataset.

**Just to give you a heads up:** We won't be having a model performing like ChatGPT or Bard, but at least we will have an idea about how we can create our own smaller versions of such powerful LLMs.  

## Importing and Installing Libraries/Packages
We will start by installing our necessary packages.

**bitsandbytes**: This package will allow us to run 4bit quantization on our model

**transformers**: This Hugging Face package will allow us to load state-of-the-art models easily into our notebook

**peft**: This package allows us to add PEFT techniques easily to our model, such as LoRA

**accelerate**: Accelerate is a handy package that allows us to run boiler plate code with a few lines of code

**datasets**: This package allows us to easily import datasets from the Hugging Face platform to be directly used

In [None]:
!pip install bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install datasets

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl (69.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.0
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-p94rz3ig
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-p94rz3ig
  Resolved https://github.com/huggingface/transformers.git to commit 8f38f58f3de5a35f9b8505e9b48985dce5470985
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packag

In [None]:
import torch
import transformers
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM

## Loading our model

Let's start by loading our model. We will use the GPT Neox 20b Model by EleutherAI!

In [None]:
hf_model = "EleutherAI/gpt-neox-20b"

We will also set the bitsandbytes configurations needed for our model to run on our single colab GPU. The needed paramaters will be 'Double Quantization' 'Quantization Type' and the computational type needs to be set to bfloat16.

In [None]:
bitsbytes_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enabling 4-bit quantization
    bnb_4bit_quant_type="nf4",              # Float 4-bit quantization
    bnb_4bit_use_double_quant=True,         # Enabling double quantization
    bnb_4bit_compute_dtype=torch.bfloat16   # Setting bfloat16
)

We will then set our tokenizer, and our model using the AutoTokenizer and AutoModelforCausalLM classes

In [None]:
# Loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model)

# Loading the model with the bitsandbytes configuration                                       # CUDA 0
model = AutoModelForCausalLM.from_pretrained(hf_model, quantization_config=bitsbytes_config, device_map={"":0})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/60.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

model-00001-of-00046.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

model-00002-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00003-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00004-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00005-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00006-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00007-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00008-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00009-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00010-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00011-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00012-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00013-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00014-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00015-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00016-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00017-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00018-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00019-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00020-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00021-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00022-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00023-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00024-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00025-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00026-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00027-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00028-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00029-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00030-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00031-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00032-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00033-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00034-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00035-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00036-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00037-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00038-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00039-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00040-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00041-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00042-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00043-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00044-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00045-of-00046.safetensors:   0%|          | 0.00/604M [00:00<?, ?B/s]

model-00046-of-00046.safetensors:   0%|          | 0.00/620M [00:00<?, ?B/s]

The `GPTNeoXSdpaAttention` class is deprecated in favor of simply modifying the `config._attn_implementation`attribute of the `GPTNeoXAttention` class! It will be removed in v4.48


Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

## Model Preprocessing

We now have to apply some preprocessing to our model so we can prepare it for training. First we need to further reduce our memory consumption by using the gradient_checkpointing_enable() fucntion on our model. We then use the prepare_model_for_kbit_training function so that we can use 4bit quantization training.

In [None]:
# Enabling gradient checkpointing
model.gradient_checkpointing_enable()

# Preparing the model for 4-bit quantization training
model = prepare_model_for_kbit_training(model)

Explain with your own words how 4-bit quantization affects accuracy.

**The need for 4-bit quantization is to reduce memory usage and improve inference speed. However, because the model weights and activations are represented by 4 bits instead of 16 or 32, the precesion deareases. Decreased precision leads to a drop in accuracy, where the extent of the drop depends on the nature of the task (what the model is trained for) and its complexity, the model architecture, and quantization method.**

We will also set a function that will print the number of trainable parameters our model has.

In [None]:
def print_trainable_parameters(model):
    trainable_parameters = 0
    all_paramaters = 0
    for _, param in model.named_parameters():
        all_paramaters += param.numel()
        if param.requires_grad:
            trainable_parameters += param.numel()
    print(
        f"Trainable: {trainable_parameters} || All: {all_paramaters} || Trainable %: {100 * trainable_parameters / all_paramaters}"
    )

Finally we will set the configurations for our LoRA. The paramaters needed are the rank updates, the default LoRa alpha value, the target modules which need to be set to query_key_value, the default lora dropout rate, bias should be set to none, and the task type according to the model we are using.

In [None]:
from peft import TaskType

config = LoraConfig(
    r=8,                                  # Rank updates (I chose 8 to match the output cell present in the colab skeleton)
    lora_alpha=32,                        # Default LoRA alpha value
    target_modules=["query_key_value"],   # Target modules for LoRA adaptation
    lora_dropout=0.05,                    # Default LoRA dropout rate
    bias="none",                          # Bias
    task_type=TaskType.CAUSAL_LM          # Task type for causal language modeling
)

# Insertting the configs above to the model using the get_peft_model function
model = get_peft_model(model, config)

# Print the trainable parameters of the model
print_trainable_parameters(model)

Trainable: 8650752 || All: 10597552128 || Trainable %: 0.08162971878329976


## Dataset Loading

Let's load our medical dataset from Hugging Face. We will use the `medalpaca/medical_meadow_wikidoc_patient_information` dataset. You can access it [here](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc).

In [None]:
# Loading the dataset from Hugging Face
data = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information")

# Mapping the needed column as our data using a lambda statement
data = data.map(lambda samples: tokenizer(samples['output']), batched=True)

README.md:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

medical_meadow_wikidoc_patient_info.json:   0%|          | 0.00/3.49M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5942 [00:00<?, ? examples/s]

Map:   0%|          | 0/5942 [00:00<?, ? examples/s]

## Model Training and Testing

Now we train the model usig the transformers library. Before doing so, we set the tokenizer to be the end of sequence tokens since it is required by our model. Your goal here is to tune the paramaters until you get a running model on a single colab GPU.

In [None]:
from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling

# Setting the tokenizer padding to be 'eos' tokens
tokenizer.pad_token = tokenizer.eos_token

# Initialize the data collator, this will take care of padding since it is not done earlier
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

# Training Arguments
training_args = TrainingArguments(

        output_dir="outputs",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        optim="adamw_torch",      # Specifying PyTorch's Adam optimizer
        report_to="none",         # Disable wandb integration
)

# Initializing the Trainer
trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=training_args,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# This silences the warnings
model.config.use_cache = False

# Train the model!
trainer.train()

  trainer = transformers.Trainer(
  return fn(*args, **kwargs)


Step,Training Loss
1,7.5259
2,4.0079
3,6.7487
4,4.9784
5,8.1746
6,7.1793
7,8.8178
8,10.0928
9,7.6451
10,11.5465


TrainOutput(global_step=10, training_loss=7.671693658828735, metrics={'train_runtime': 173.1133, 'train_samples_per_second': 0.231, 'train_steps_per_second': 0.058, 'total_flos': 509412616962048.0, 'train_loss': 7.671693658828735, 'epoch': 0.006731740154830024})

Explain 4 of the training arguments you used in your Trainer, how they are used, and what do they represent

1.  **per_device_train_batch_size:    Sets the batch size for training on each GPU. It Determines how many samples are processed simultaneously per device. I set it to 1, which means that only one sample is processed per forward pass. Since I am working with a single GPU, the small batch size is fitting with the limited memory.**
2.**gradient_accumulation_steps:    Accumulates gradients over multiple forward passes before performing a backward pass and updating weights. It works well with per_device_train_batch_size where per_device_train_batch_size = 1 & gradient_accumulation_steps=4 the batch size becomes 1 x 4 = 4. This allows me to simulate a larger batch size without increasing memory usage. It also helps fit large models to limited GPU memory as in my case.**
3. **warmup_step = 2:     Gradually increases the learning rate from 0 to the specified learning_rate over the first 2 training steps. This stablizes the training process from early on as it acts as a ramp-up that prevents sudden large updates.**
4. **max_steps= 10:   Limits training to 10 steps no matter what the number of epochs is. I used this because my colab would crash if I omitted it or specified a number of epochs, which is why I did not use the num_train_epochs argument. I read that it is useful for debugging or experimentation.**

We now save our model as a pretrained version so that we can set the LoRA configurations. This model will be saved to a separate folder on the next block.

In [None]:
# Saving the model to a separate folder
saved_model = model if hasattr(model, "save_pretrained") else model.module
saved_model.save_pretrained("outputs")

Before testing our model, we have to get the LoRA configs from our pre-trained model and set them to our new model using the get_peft_model() function.

In [None]:
# Loading the pre-trained model
lora_configs = LoraConfig.from_pretrained("outputs")

# Applying LoRA configurations to the model
model = get_peft_model(model, lora_configs)



We need to set our prompt as a variable, and also our device currently in use.

In [None]:
# Setting the prompt to ask about the symptoms of flu

prompt = "What are the symptoms of flu?"
device = "cuda:0"

# Moving the model to selected device
model = model.to(device)

Finally, we will make our LLM generate text based on the data. First we user the tokenizer() function on our prompt.

In [None]:
# return_tensors = "pt" as we're using pyTorch
inputs = tokenizer(prompt, return_tensors="pt").to(device)

Let's now use the generate() function on our model, and print the decoded version of our output.

In [None]:
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


What are the symptoms of flu?

The symptoms of flu are similar to those of a cold. They include:

fever

cough

sore throat

headache

muscle aches


**Testing using another prompt that asks about symptoms in an indirect way**

In [None]:
prompt = "How can someone know if they are about to get a heart attack?"
device = "cuda:0"

# return_tensors = "pt" as we're using pyTorch
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


How can someone know if they are about to get a heart attack?

I have a friend who is a nurse. She told me that she had a patient who had a heart attack. She said that she knew it was coming because she had a feeling. She


In [None]:
# Trying the same question in a more direct way
prompt = "What are the symptoms of a heart attack?"
device = "cuda:0"

# return_tensors = "pt" as we're using pyTorch
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


What are the symptoms of a heart attack?

A:

The symptoms of a heart attack are chest pain, shortness of breath, and/or a feeling of indigestion.

A:

The symptoms of
