Project: Fine-Tuning Gemma with LoRA for Customer Support Automation

- Fine-tuned model using parameter-efficient LoRA adapters on 945k+ customer support tweets, enabling context-aware response generation while reducing memory by 60% using 8-bit quantization.

- Built full training pipeline with Hugging Face transformers, datasets, and Trainer; saved model artifacts for real-time deployment and inference with GPU support.

**Step 1: Loading the Dataset and Tokenizer**

The first step is to load a customer support dataset from Hugging Face and initialize the tokenizer for the model. We need the data to train the model and the tokenizer to process that data into a format that the model can understand.

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load the dataset from Hugging Face
dataset = load_dataset("MohammadOthman/mo-customer-support-tweets-945k")

# Load the tokenizer for Llama 2
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")

README.md:   0%|          | 0.00/3.40k [00:00<?, ?B/s]

preprocessed_data.json:   0%|          | 0.00/222M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/945278 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [5]:
import random

for _ in range(10):
  example = dataset['train'][random.randint(1,5000)]

  for key in example:
    print(key,":", example[key])
  print("-"*20)

output : Can you please message your full name, address and email? I can look into this for you. TY Chris
input : 2nd order in a row that failed to be delivered due to damage. No call, no email no apology shameful
--------------------
output : Your cabin crew will only prevent people from stowing their coats away, if there is no space left.  Kimbers
input : your agents made people check bags because the flight is full, yet many people have taken up overhead bins with coats.
--------------------
output : We cannot make any promises about HDMI streaming, but rest assured that we care about your concerns and have shared the feedback.
input : Sounds like generic yourenotgonnafixit replh
--------------------
output : Right? So clutch. Becky
input : So I just found out chipolte has queso and my life has been significantly improved.
--------------------
output : Gotcha, let us make sure that your controller is up to date and see if the headset works then.
input : The one you guys give in box 

In [6]:
small_sample = dataset["train"].shuffle(seed=42).select(range(1000))


In [7]:
small_sample

Dataset({
    features: ['output', 'input'],
    num_rows: 1000
})

In [8]:
# Add a special padding token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

1

**Step 2: Tokenizing the Dataset**

Before training the model, we need to tokenize the dataset. This is a critical step where the text data is converted into numerical tokens.

We define a function to tokenize both the customer inquiries (inputs) and the responses (outputs). The outputs will be used as labels during training.

In [9]:
def tokenize_function(examples):
    # Tokenize inputs (customer inquiries)
    inputs = tokenizer(
        examples['input'], padding="max_length", truncation=True, max_length=512
    )

    # Tokenize outputs (customer responses) to use as labels
    outputs = tokenizer(
        examples['output'], padding="max_length", truncation=True, max_length=512
    )

    # Ensure that labels are the tokenized responses
    inputs['labels'] = outputs['input_ids']

    return inputs

In [10]:
# Apply tokenization
tokenized_dataset = small_sample.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

**Step 3: Loading the Gemma Model with LoRA**

Next, we load the model. We will fine-tune it using LoRA, a technique designed to reduce the memory and computational cost of training large models. LoRA adapts the model by introducing low-rank matrices that require fewer trainable parameters.

In [12]:
# pip install bitsandbytes accelerate
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-7b",
    quantization_config=quantization_config
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

<bos>Write me a poem about Machine Learning.

I’m not a poet, but I’m a Machine Learning Engineer.

I’


We load the model in 8-bit precision, which drastically reduces memory usage. The device_map="auto" argument ensures that the model is loaded on the appropriate hardware, whether it’s a GPU or CPU.

In [13]:
# Define the LoRA configuration
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # Task type for causal language modeling
    r=16,                          # Rank of the low-rank matrices
    lora_alpha=32,                 # Scaling factor for low-rank adaptation
    lora_dropout=0.1,              # Dropout to prevent overfitting
    target_modules=["q_proj", "v_proj"]  # The attention layers to apply LoRA
)

# LoRA Parameters:
# r=16: The rank of the low-rank matrices. This value determines the extent of LoRA’s influence.
# lora_alpha=32: A scaling factor that adjusts the impact of LoRA on the model’s layers.
# lora_dropout=0.1: Dropout is used to prevent overfitting by randomly dropping units during training.
# Next, we apply LoRA to the model.

In [None]:
# print(model)


get_peft_model: This function adapts the pretrained model using the LoRA configuration. It essentially modifies certain layers in the model to be trained with LoRA, making the training process more efficient.

In [14]:
# Apply LoRA to the model
model = get_peft_model(model, peft_config)

In [15]:
# Add the special token that we defined to the model's config

"""Resize Token Embeddings: We ensure that the model’s token embeddings are updated to include the special [PAD] token added to the tokenizer earlier."""

model.resize_token_embeddings(len(tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(256001, 3072, padding_idx=0)

**Step 4: Defining the Training Arguments**

The next step is to define the training arguments. These control how the model is trained, such as the learning rate, batch size, and the number of epochs.

In [23]:
from transformers import TrainingArguments

# Define the training arguments
training_args = TrainingArguments(
    output_dir="lora-gemma-customer-support",  # Output directory for saving the model
    per_device_train_batch_size=32,              # Batch size per GPU (adjust based on your GPU memory)
    gradient_accumulation_steps=4,             # Gradient accumulation steps
    num_train_epochs=2,                         # Number of training epochs
    learning_rate=2e-4,                         # Learning rate
    fp16=True,                                  # Use FP16 precision
    logging_steps=10,                           # Log training progress every 10 steps
    save_steps=1000,                            # Save model every 1000 steps
    save_total_limit=2,                         # Keep only the last 2 checkpoints
    optim="adamw_torch"                         # Optimizer to use
)

**Step 5: Initializing the Trainer**

Once the training arguments are set, we can initialize the Trainer. This class handles the training loop and simplifies the process.

In [20]:
from transformers import Trainer, DataCollatorForSeq2Seq

# Define a data collator that dynamically pads inputs during training
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,  # Use the tokenized dataset
    data_collator=data_collator,
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


**Step 6: Training and Saving the Model**

In [None]:
# Train the model
trainer.train()

In [31]:
adapter_path = "/content/lora-gemma-customer-support/checkpoint-93"

In [32]:
model1 = PeftModel.from_pretrained(model, adapter_path)
model1.to("cuda")



PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PeftModelForCausalLM(
      (base_model): LoraModel(
        (model): GemmaForCausalLM(
          (model): GemmaModel(
            (embed_tokens): Embedding(256001, 3072, padding_idx=0)
            (layers): ModuleList(
              (0-27): 28 x GemmaDecoderLayer(
                (self_attn): GemmaAttention(
                  (q_proj): lora.Linear8bitLt(
                    (base_layer): Linear8bitLt(in_features=3072, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=3072, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): Parameter

In [28]:
from peft import PeftModel

In [37]:
input_text = "My name is"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model1.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


My name is <strong><em>Dr.</em></strong> <strong><em>S.</em></strong> <strong><em>S.</em>
