# Fine-tuning Llama 3.2 for customer service chat

**Introduction**: Customer service chat is a common use case for LLMs. The purpose of this notebook is to determine how well the Llama 3.2-3B-Instruct base model can be fine-tuned to adopt the style of a helpful customer service bot.

**Findings**: The model was fine-tuned on 3000 examples from the Bitext customer service training [dataset](https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset). As a result, it adopted the style of a customer service bot and provided relevant responses.

**Summary of steps**:
*   Load dataset, model and tokenizer
*   Fine-tune QLoRA adapters
*   Perform manual test to see bot adopting customer service style
*   Merge with foundational model
*   Convert to GGUF and quantize to facilitate its use locally in the Jan application





# Setup: libraries, dependencies, configurations

Load libraries and modules from Huggingface, Weights and Biases, and Google (for accessing Drive).

In [2]:
%%capture
%pip install -U transformers
%pip install -U datasets
%pip install -U accelerate
%pip install -U peft
%pip install -U trl
%pip install -U bitsandbytes
%pip install -U wandb

from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  BitsAndBytesConfig,
  HfArgumentParser,
  TrainingArguments,
  pipeline,
  logging,
)
from peft import (
  LoraConfig,
  PeftModel,
  prepare_model_for_kbit_training,
  get_peft_model,
)
import os, torch, wandb, pprint
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig, setup_chat_format
from google.colab import userdata
from huggingface_hub import HfApi, login as hf_login

Mount Google Drive, connect to HF and W&B. Initialize configuration.

In [3]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive/')

# connect to huggingface
hf_auth_token = userdata.get('HF_TOKEN')
hf_login(hf_auth_token)

# connect to weights and biases
wb_auth_token = userdata.get('WB_TOKEN')
wandb.login(key=wb_auth_token)

# initialize wandb.ai with this project
run = wandb.init(
  project='Fine-tune Llama 3.2 3B Instruct on Customer Service Dataset',
  job_type="training",
  anonymous="allow"
)

# base model config
base_model_namespace = "meta-llama"
base_model = "Llama-3.2-3B-Instruct"
base_model_name = f"{base_model_namespace}/{base_model}"

# base model cache config
base_model_cache_base_directory = "/content/drive/MyDrive/.model_cache"
base_model_provider = "huggingface"
base_model_cache_directory = f"{base_model_cache_base_directory}/{base_model_provider}/{base_model_name}"

# base dataset cache config
base_dataset_cache_base_directory = "/content/drive/MyDrive/.dataset_cache"
base_dataset_provider = "huggingface"
base_dataset_namespace = "bitext"
base_dataset = "Bitext-customer-support-llm-chatbot-training-dataset"
base_dataset_name = f"{base_dataset_namespace}/{base_dataset}"
base_dataset_cache_directory = f"{base_dataset_cache_base_directory}/{base_dataset_provider}/{base_dataset_name}"

# project config
custom_model_name = "llama-3.2-3B-customer-service-chatbot"
model_directory = f"{base_model_cache_base_directory}/model-llama-3.2-3b-instruct"

# mixed precision datatypes
torch_dtype = torch.float16
use_fp16 = True
use_bf16 = False

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mchrisalehman[0m ([33mchrisalehman-autism-center-for-treatment[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


 # Load and configure model, prepare dataset

Load the model and tokenizer. Load memory-optimized version of the base model where weigths stored in 4-bit quantized format, but computations in bfloat16 or float16, depending on the hardware. These optimizations will speed up training.

NOTE: I'm getting a package conflict error relating to flash attention 2, so disabling for now.

In [4]:
# set flash attention mechanism
attn_implementation = "eager"

# quantization config
bnb_config = BitsAndBytesConfig(
  load_in_4bit=True, # 4-bit quantized version
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch_dtype,
  bnb_4bit_use_double_quant=True,
)

# load model
model = AutoModelForCausalLM.from_pretrained(
  base_model_name,
  quantization_config=bnb_config,
  device_map="auto",
  attn_implementation=attn_implementation
)

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

# check to see which dtype and attention mechanism is possible
print(torch_dtype)
print(attn_implementation)

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

torch.float16
eager


Prepare the dataset. Load, shuffle, split, and select sample size. The dataset contains 26.9k samples.

We'll set a small sample size of 1000 for demonstration purposes.

In [5]:
sample_size = 3000
dataset = load_dataset(base_dataset_name, split="train")
dataset = dataset.shuffle(seed=65).select(range(sample_size))

instruction = """
You are a top-rated customer service agent named John.
Be polite to customers and answer all their questions.
"""

def format_chat_template(row):
  row_json = [{"role": "system", "content": instruction },
              {"role": "user", "content": row["instruction"]},
              {"role": "assistant", "content": row["response"]}]

  row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
  return row

dataset = dataset.map(
  format_chat_template,
  num_proc= 4,
)

# output an example
pprint.pprint(dataset['text'][0])

# split into train (90%) and test (10%)
dataset = dataset.train_test_split(test_size=0.1)

README.md:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

(…)t_Training_Dataset_27K_responses-v11.csv:   0%|          | 0.00/19.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/26872 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]

('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n'
 '\n'
 'Cutting Knowledge Date: December 2023\n'
 'Today Date: 08 Feb 2025\n'
 '\n'
 'You are a top-rated customer service agent named John.\n'
 'Be polite to customers and answer all their '
 'questions.<|eot_id|><|start_header_id|>user<|end_header_id|>\n'
 '\n'
 'where do i enter a different shipping '
 'address<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n'
 '\n'
 "Ah, I understand that you're looking to enter a different shipping address. "
 'Allow me to guide you through the process:\n'
 '\n'
 '1. Log in to your account on our website.\n'
 '2. Navigate to the "My Account" or "Profile" section, which can usually be '
 'found in the top right corner of the page.\n'
 '3. Look for the "Shipping Addresses" or similar option. It may also be '
 'labeled as "Manage Addresses" or "Delivery Information."\n'
 '4. Click on that option to access your saved addresses.\n'
 "5. To enter a different shipping address, you'l

Find the trainable modules in the model. We'll use a custom function to extract the modules; then we'll apply them to the LoRA config.

In [None]:
import bitsandbytes as bnb

def find_all_module_names(model):

  cls = bnb.nn.Linear4bit
  lora_module_names = set()

  for name, module in model.named_modules():
    if isinstance(module, cls):
      names = name.split('.')
      lora_module_names.add(names[0] if len(names) == 1 else names[-1])

  if 'lm_head' in lora_module_names:  # needed for 16 bit
    lora_module_names.remove('lm_head')

  return list(lora_module_names)

modules = find_all_module_names(model)
print(modules)

In [None]:
# LoRA config
peft_config = LoraConfig(
  r=16,
  lora_alpha=32,
  lora_dropout=0.05,
  bias="none",
  task_type="CAUSAL_LM",
  target_modules=modules
)

# clear out pre-existing template to avoid error when replacing it
tokenizer.chat_template = None

# replace chat template (function from trl package)
model, tokenizer = setup_chat_format(model, tokenizer)
print(tokenizer.chat_template)

model = get_peft_model(model, peft_config)

Configure the hyperparameters and training parameters. I'm using an Nvidia A100, so I can support slightly larger batch sizes and bfloat16 mixed precision for faster training.

In [None]:
# hyperparamters
training_arguments = SFTConfig(
  output_dir=model_directory,
  per_device_train_batch_size=4,
  per_device_eval_batch_size=4,
  gradient_accumulation_steps=1,
  optim="paged_adamw_32bit",
  num_train_epochs=3,
  eval_strategy="steps",
  eval_steps=0.1, # evaluate every 1/10 increment through the run
  logging_steps=50,
  warmup_steps=50,
  logging_strategy="steps",
  learning_rate=2e-4,
  fp16=False,
  bf16=True,
  group_by_length=True,
  report_to="wandb",
  max_seq_length=512, # avoids exceeding GPU memory during training
  dataset_text_field="text",
  packing= False,
)

# setting sft parameters
trainer = SFTTrainer(
  model=model,
  train_dataset=dataset["train"],
  eval_dataset=dataset["test"],
  peft_config=peft_config,
  processing_class=tokenizer,
  args=training_arguments,
)

# Train, validate, and save model

Start training.

In [None]:
# start training
trainer.train()

# save results locally to view
wandb.finish()
model.config.use_cache = True

Test the model manually.

In [None]:
def execute_test(model, tokenizer):

  messages = [
    {"role": "system", "content": instruction},
    {"role": "user", "content": "I bought the same item twice. Can I cancel order {{Order Number}}?"}]

  # set chat template
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

  # tokenize the prompt
  inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

  # generate a response
  outputs = model.generate(**inputs, max_new_tokens=512, num_return_sequences=1)

  # decodes back to text and outputs
  text = tokenizer.decode(outputs[0], skip_special_tokens=True)
  return text.split("assistant")[1]

print(execute_test(model, tokenizer))

Save the LoRA adapter locally and push it to huggingface.

In [None]:
# save the adapter locally
trainer.model.save_pretrained(model_directory)
tokenizer.save_pretrained(model_directory)

# push to huggingface
trainer.model.push_to_hub(
  custom_model_name,
  repo_type="model",
  use_temp_dir=True,
  token=hf_auth_token)

# Merge and export fine-tuned model

First, reload the base model and tokenizer. Then merge the adapter with the base model.

In [None]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

# reload base model
model = AutoModelForCausalLM.from_pretrained(
  base_model_name,
  low_cpu_mem_usage=True,
  return_dict=True,
  torch_dtype=torch_dtype,
  device_map='auto',
)

# clear out pre-existing template to avoid error when replacing it
tokenizer.chat_template = None

# replace chat template (function from trl package)
model, tokenizer = setup_chat_format(model, tokenizer)
print(tokenizer.chat_template)

# merge model
merged_model = PeftModel.from_pretrained(model, model_directory)
merged_model = merged_model.merge_and_unload()

In [None]:
# sanity check test
print(execute_test(merged_model, tokenizer))

In [None]:
# save merged model locally
merged_model.save_pretrained(model_directory)
tokenizer.save_pretrained(model_directory)

# push the merged model to huggingface hub
merged_model.push_to_hub(custom_model_name, use_temp_dir=True, token=hf_auth_token)
tokenizer.push_to_hub(custom_model_name, use_temp_dir=True, token=hf_auth_token)

# Convert the merged model to GGUF and quantize

In [None]:
tmp_dir_base = f"{model_directory}/.tmp"
tmp_dir = f"{tmp_dir_base}/llama.cpp"

# set working directory
if not os.path.exists(f"{tmp_dir_base}"):
  os.mkdir(f"{tmp_dir_base}")
os.chdir(f"{tmp_dir_base}")

# clone llama.ccp repo
if not os.path.exists(f"{tmp_dir}"):
  !git clone --depth=1 https://github.com/ggerganov/llama.cpp.git

# change to local repo
os.chdir(f"{tmp_dir}")

# build llama.ccp with 8 parallel processes to speed it up
!cmake -B build
!cmake --build build --config Release -j 8

# convert safetensors model format to gguf
!python convert_hf_to_gguf.py "/content/drive/MyDrive/Experiments/Fine-tuning/llama-3.2-3B-customer-service-chatbot" \
  --outfile "/content/drive/MyDrive/Experiments/Fine-tuning/llama-3.2-3B-customer-service-chatbot/model.gguf" \
  --outtype f16

# now quantize the model, reducing size from 16GB to ~4.6GB
!./build/bin/llama-quantize \
  "/content/drive/MyDrive/Experiments/Fine-tuning/llama-3.2-3B-customer-service-chatbot/model.gguf" \
  "/content/drive/MyDrive/Experiments/Fine-tuning/llama-3.2-3B-customer-service-chatbot/model-Q4_K_M.gguf" \
  "Q4_K_M"

In [None]:
# push the the gguf and quantized gguf models to huggingface
from huggingface_hub import HfApi

api = HfApi()

# upload gguf
api.upload_file(
  path_or_fileobj=f"{model_directory}/model.gguf",
  path_in_repo="model.gguf",
  repo_id=f"chrisalehman/{custom_model_name}",
  repo_type="model",
)

# upload quantized gguf
api.upload_file(
  path_or_fileobj=f"{model_directory}/model-Q4_K_M.gguf",
  path_in_repo="model-Q4_K_M.gguf",
  repo_id=f"chrisalehman/{custom_model_name}",
  repo_type="model",
)