# Finetuning Llama 3
We’ll fine-tune the Llama 3 8B-Chat model using the ruslanmv/ai-medical-chatbot dataset. The dataset contains 250k dialogues between a patient and a doctor.

/kaggle/input/llama-3/transformers/8b-chat-hf/1

GPU P100 

huggingface_token and wandb to be active



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Ignore the warnings

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
%%capture
%pip install -U transformers 
%pip install -U datasets 
%pip install -U accelerate 
%pip install -U peft 
%pip install -U trl 
%pip install -U bitsandbytes 
%pip install -U wandb

transformers: A library for state-of-the-art natural language processing.

datasets: A library for easily accessing and sharing datasets.

accelerate: A library for optimizing and accelerating model training.

peft: A library for parameter-efficient fine-tuning.

trl: A library for training language models with reinforcement learning.

bitsandbytes: A library for 8-bit optimizers and quantization.

wandb: A tool for experiment tracking and model management.


In [4]:
# !pip uninstall peft huggingface_hub
# !pip install peft==0.11.0 huggingface_hub==0.23.5

In [5]:
pip show peft huggingface_hub

Name: peft
Version: 0.14.0
Summary: Parameter-Efficient Fine-Tuning (PEFT)
Home-page: https://github.com/huggingface/peft
Author: The HuggingFace team
Author-email: benjamin@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: accelerate, huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch, tqdm, transformers
Required-by: 
---
Name: huggingface-hub
Version: 0.27.1
Summary: Client library to download and publish models, datasets and other repos on the huggingface.co hub
Home-page: https://github.com/huggingface/huggingface_hub
Author: Hugging Face, Inc.
Author-email: julien@huggingface.co
License: Apache
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, packaging, pyyaml, requests, tqdm, typing-extensions
Required-by: accelerate, datasets, peft, timm, tokenizers, torchtune, transformers
Note: you may need to restart the kernel to use updated packages.


In [5]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

This code snippet is primarily setting up the necessary imports for a machine learning task involving natural language processing (NLP) using the Hugging Face Transformers library, PEFT (Parameter-Efficient Fine-Tuning), and other related tools. Here's a breakdown:


Hugging Face Transformers Imports:



AutoModelForCausalLM, AutoTokenizer: For loading pre-trained language models and tokenizers.

BitsAndBytesConfig, HfArgumentParser, TrainingArguments: For configuring model training and parsing arguments.

pipeline, logging: For creating NLP pipelines and logging.


PEFT Imports:



LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model: For applying parameter-efficient fine-tuning techniques to models.


Other Imports:



os, torch, wandb: Standard libraries for operating system interactions, PyTorch (deep learning), and Weights & Biases (experiment tracking).

datasets: For loading datasets.

trl: Specific tools for training language models, including SFTTrainer and setup_chat_format.




In [6]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
login(token = hf_token)

wb_token = user_secrets.get_secret("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune Llama 3 8B on Medical Dataset', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: Currently logged in as: [33mnabarupeducation[0m ([33mnabarupeducation-iit-kharagpur[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [7]:
base_model = "/kaggle/input/llama-3/transformers/8b-chat-hf/1"
dataset_name = "ruslanmv/ai-medical-chatbot"
new_model = "llama-3-8b-chat-doctor"

In [8]:
torch_dtype = torch.float16
attn_implementation = "eager"

## Loading the model and tokenizer

In this part, we’ll load the model from Kaggle. However, due to memory constraints, we’re unable to load the full model. Therefore, we’re loading the model using 4-bit precision.

Our goal in this project is to reduce memory usage and speed up the fine-tuning process.

In [9]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Load the tokenizer and then set up a model and tokenizer for conversational AI tasks. By default, it uses the chatml template from OpenAI, which will convert the input text into a chat-like format.




In [10]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)
model, tokenizer = setup_chat_format(model, tokenizer)

ValueError: Chat template is already added to the tokenizer. If you want to overwrite it, please set it to None

## Adding the adapter to the layer
Fine-tuning the full model will take a lot of time, so to improve the training time, we’ll attach the adapter layer with a few parameters, making the entire process faster and more memory-efficient.




In [11]:
# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)

## Loading the dataset
To load and pre-process our dataset, we:

1. Load the ruslanmv/ai-medical-chatbot dataset, shuffle it, and select only the top 1000 rows. This will significantly reduce the training time.

2. Format the chat template to make it conversational. Combine the patient questions and doctor responses into a "text" column.

3. Display a sample from the text column (the “text” column has a chat-like format with special tokens).




In [12]:
#Importing the dataset
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=65).select(range(1000)) # Only use 1000 samples for quick demo

def format_chat_template(row):
    row_json = [{"role": "user", "content": row["Patient"]},
               {"role": "assistant", "content": row["Doctor"]}]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc=4,
)

dataset['text'][3]

README.md:   0%|          | 0.00/863 [00:00<?, ?B/s]

dialogues.parquet:   0%|          | 0.00/142M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/256916 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

'<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nFell on sidewalk face first about 8 hrs ago. Swollen, cut lip bruised and cut knee, and hurt pride initially. Now have muscle and shoulder pain, stiff jaw(think this is from the really swollen lip),pain in wrist, and headache. I assume this is all normal but are there specific things I should look for or will I just be in pain for a while given the hard fall?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello and welcome to HCM,The injuries caused on various body parts have to be managed.The cut and swollen lip has to be managed by sterile dressing.The body pains, pain on injured site and jaw pain should be managed by pain killer and muscle relaxant.I suggest you to consult your primary healthcare provider for clinical assessment.In case there is evidence of infection in any of the injured sites, a course of antibiotics may have to be started to control the infection.Thanks and take careDr Shailja P Wahal<|eot_i

4. Split the dataset into a training and validation set.


In [13]:
dataset = dataset.train_test_split(test_size=0.1)

## Complaining and training the model
We are setting the model hyperparameters so that we can run it on the Kaggle. You can learn about each hyperparameter by reading the Fine-Tuning Llama 2 tutorial.

We are fine-tuning the model for one epoch and logging the metrics using the Weights and Biases.




In [14]:
training_arguments = TrainingArguments(
    output_dir=new_model,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="wandb"
)

This code snippet is configuring the training parameters for a machine learning model using the TrainingArguments class. Here's a breakdown of the most relevant parts:



output_dir=new_model: Specifies the directory where the trained model and other outputs will be saved.

per_device_train_batch_size=1 and per_device_eval_batch_size=1: Sets the batch size for training and evaluation to 1 per device.

gradient_accumulation_steps=2: Accumulates gradients over 2 steps before performing a backward pass, effectively simulating a larger batch size.

optim="paged_adamw_32bit": Chooses the optimizer, in this case, a 32-bit version of AdamW.

num_train_epochs=1: Sets the number of training epochs to 1.

evaluation_strategy="steps" and eval_steps=0.2: Specifies that evaluation should be done every 0.2 steps.

logging_steps=1 and logging_strategy="steps": Logs training metrics every step.

warmup_steps=10: Sets the number of warmup steps for learning rate scheduling.

learning_rate=2e-4: Sets the learning rate to 0.0002.

fp16=False and bf16=False: Disables 16-bit and bfloat16 precision training.

group_by_length=True: Groups sequences of similar lengths together to optimize training efficiency.

report_to="wandb": Specifies that training metrics should be reported to Weights and Biases (wandb) for tracking.


In [15]:
# Add a new pad_token to the tokenizer
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))  # Update model's embeddings to include the new token

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(128257, 4096)

In [16]:
# Preprocessing function to tokenize and truncate/pad sequences
def preprocess_function(examples):
    return tokenizer(
        examples["text"],
        max_length=512,
        truncation=True,
        padding="max_length"
    )

# Apply preprocessing to train and test datasets
tokenized_train_dataset = dataset["train"].map(preprocess_function, batched=True)
tokenized_test_dataset = dataset["test"].map(preprocess_function, batched=True)


Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [17]:
# We’ll now set up a supervised fine-tuning (SFT) trainer and provide
# a train and evaluation dataset, LoRA configuration, training argument, 
# tokenizer, and model. We’re keeping the max_seq_length to 512 to avoid 
# exceeding GPU memory during training.

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
)

In [18]:
trainer.train()



Step,Training Loss,Validation Loss
90,1.9758,2.512768
180,2.524,2.487967
270,2.125,2.454458
360,2.6733,2.422145
450,2.5278,2.4155


Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.


TrainOutput(global_step=450, training_loss=2.463011441230774, metrics={'train_runtime': 1989.7227, 'train_samples_per_second': 0.452, 'train_steps_per_second': 0.226, 'total_flos': 2.08655911747584e+16, 'train_loss': 2.463011441230774, 'epoch': 1.0})

## Model evaluation
When you finish the Weights & Biases session, it’ll generate the run history and summary.


In [19]:
wandb.finish()
model.config.use_cache = True

0,1
eval/loss,█▆▄▁▁
eval/runtime,▂▁▂█▂
eval/samples_per_second,███▁▇
eval/steps_per_second,███▁▇
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇█
train/global_step,▁▁▁▁▁▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇██
train/grad_norm,█▅▄▄▅▄▃▆▃▃▂▃▅▅▃▄▃▄▃▃▃▄▃▁▃▃▄▃▅▃▃▃▆▅▄▃▂▄▃▃
train/learning_rate,▄▅███▇▇▇▇▇▆▅▅▅▅▅▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁
train/loss,▇▅▄▃▄▅▂▁▄▇▆▂▃▅▄▃▂█▆▄▃▅▇▄▁▄▃█▄▇▃▃▆▃▂▃▁▅▃▃

0,1
eval/loss,2.4155
eval/runtime,83.9591
eval/samples_per_second,1.191
eval/steps_per_second,1.191
total_flos,2.08655911747584e+16
train/epoch,1.0
train/global_step,450.0
train/grad_norm,1.53013
train/learning_rate,0.0
train/loss,2.5278


In [20]:
messages = [
    {
        "role": "user",
        "content": "Hello doctor, I have a bad scar on my forehead. How do I get rid of it?"
    }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, 
                                       add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt', padding=True, 
                   truncation=True).to("cuda")

outputs = model.generate(**inputs, max_length=150, 
                         num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




Hi. For the scar on your forehead, you can use silicone gel sheeting. It is available in the market. You can apply it on the scar and leave it overnight. Repeat this process for a few days. You can also use vitamin E oil on the scar. Apply it on the scar and leave it overnight. Repeat this process for a few days. You can also use aloe vera gel on the scar. Apply it on the scar and leave it overnight. Repeat this process for a few days. Hope I have answered your query. Let me know if I can assist you further.


## Saving the model file
We’ll now save the fine-tuned adapter and push it to the Hugging Face Hub. The Hub API will automatically create the repository and store the adapter file.




In [21]:
trainer.model.save_pretrained(new_model)
trainer.model.push_to_hub(new_model, use_temp_dir=False)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/NabarupGhosh/llama-3-8b-chat-doctor/commit/385d95eed51dc630a0bfaa3809c51e57d4a6d8a0', commit_message='Upload model', commit_description='', oid='385d95eed51dc630a0bfaa3809c51e57d4a6d8a0', pr_url=None, repo_url=RepoUrl('https://huggingface.co/NabarupGhosh/llama-3-8b-chat-doctor', endpoint='https://huggingface.co', repo_type='model', repo_id='NabarupGhosh/llama-3-8b-chat-doctor'), pr_revision=None, pr_num=None)