<a href="https://colab.research.google.com/github/ghundal/E115_SMART/blob/milestone3/fine_tuning/smart_gemma_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set-up

## Install packages

In [1]:
# Install Gemma release branch from Hugging Face
%pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

# Install Hugging Face libraries
%pip install  --upgrade \
  huggingface_hub \
  transformers \
  datasets \
  evaluate \
  accelerate \
  bitsandbytes \
  trl \
  peft \
  protobuf \
  wandb \
  gcsfs

# COMMENT IN: if you are running on a GPU that supports BF16 data type and flash attn, such as NVIDIA L4 or NVIDIA A100
%pip install flash-attn

#patch for gemma 3 issue
%pip install --upgrade git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
  Cloning https://github.com/huggingface/transformers (to revision v4.49.0-Gemma-3) to /tmp/pip-req-build-ughus7bx
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-ughus7bx
  Running command git checkout -q 367bab469b0ef32017e2a0a0a5dbac5d36002f03
  Resolved https://github.com/huggingface/transformers to commit 367bab469b0ef32017e2a0a0a5dbac5d36002f03
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=10936468 sha256=2cf83436b028ff0c7fee5d82929d7a295e966b2db14c588dcba9853637e676f7
  Stored in directory: /tmp/pip-eph

## Package Imports

In [2]:
import os
import json
from datasets import Dataset

#google credentials
from google.colab import (
    userdata,
    auth as google_auth,
    drive as google_drive
    )
from google.cloud import storage
from google.oauth2 import service_account

# Login into Hugging Face Hub
from huggingface_hub import login

import wandb

# Model training
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    logging,
    pipeline
    )

from peft import LoraConfig, PeftModel
from trl import SFTConfig, SFTTrainer

## Environment Set-up

In [3]:
# mount google drive
google_drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab\ Notebooks/CSCI_115

# authenticate to gcp
google_auth.authenticate_user()
credentials = service_account.Credentials.from_service_account_file(
            "./secrets/smart_input_key.json"
        )
data_dir = "fine_tuning/datasets_60q/"

# Login to WandB
wandb.login(key=userdata.get("WANDB_API_KEY"))

os.environ["WANDB_LOG_MODEL"] = "checkpoint" # W&B auto-logging of models as artifacts
os.environ["WANDB_WATCH"] = "all" # W&B log params

run = wandb.init(project='smart_gemma_ft', job_type="training", anonymous="allow")

# Login into Hugging Face Hub
hf_token = userdata.get('HF_TOKEN') # If you are running inside a Google Colab
login(hf_token)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/CSCI_115


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtvaldeca[0m ([33mtvaldeca-harvard-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


# Import Dataset

In [4]:
def connect_to_bucket():
    """Connect to the GCS bucket and return the bucket object."""
    try:
        client = storage.Client(project="SMART", credentials=credentials)
        bucket = client.bucket("smart_input_data")
        return bucket
    except Exception as e:
        raise RuntimeError(f"Critical failure: Unable to connect to GCS bucket: {str(e)}")

def download_dataset(bucket, filename):
  """Download the dataset from the bucket"""
  blob = bucket.blob(f'{data_dir}{filename}')
  blob.download_to_filename(filename)

def load_dataset(filename):
  """Load the dataset from the filename"""
  with open(filename, 'r') as f:
    dataset = Dataset.from_dict(json.load(f))
  return dataset

In [5]:
filename = "dataset_60q.json"

# get gcp bucket connection
bucket = connect_to_bucket()

# download and load dataset
download_dataset(bucket, filename)
dataset = load_dataset(filename)

# shuffle then split to train/test
dataset = dataset.shuffle().train_test_split(test_size=0.2)

# Load Gemma Model

In [6]:
# Hugging Face model id
model_id = "google/gemma-3-1b-it"

# Check if GPU benefits from bfloat16
if torch.cuda.get_device_capability()[0] >= 8:
    torch_dtype = torch.bfloat16
else:
    torch_dtype = torch.float16

# Define model init arguments
model_kwargs = dict(
    attn_implementation="eager", # Use "flash_attention_2" when running on Ampere or newer GPU
    torch_dtype=torch_dtype, # What torch dtype to use, defaults to auto
    device_map="auto", # Let torch decide how to load the model
)

# BitsAndBytesConfig: Enables 4-bit quantization to reduce model size/memory usage
model_kwargs["quantization_config"] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=model_kwargs['torch_dtype'],
    bnb_4bit_quant_storage=model_kwargs['torch_dtype'],
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it") # Load the Instruction Tokenizer to use the official Gemma template

## Training Configuration

### QLoRA

In [7]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
    modules_to_save=["lm_head", "embed_tokens"] # make sure to save the lm_head and embed_tokens as you train the special tokens
)

### Hyperparameters

In [8]:
args = TrainingArguments(
    output_dir="gemma-smart-fine-tuned-v2",         # directory to save and repository id
    overwrite_output_dir = True,
    eval_steps = 10,                        # perform evaluation every 10 steps
    num_train_epochs=7,                     # number of training epochs
    per_device_train_batch_size=1,          # batch size per device during training
    per_device_eval_batch_size=1,
    report_to="wandb",                      # enables logging to W&B 😎
    run_name='960_qs_7_epochs_cosine',       # W&B run name
    gradient_accumulation_steps=3,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    logging_strategy="steps",
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    fp16=True if torch_dtype == torch.float16 else False,   # use float16 precision
    bf16=True if torch_dtype == torch.bfloat16 else False,   # use bfloat16 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="cosine",           # use constant learning rate scheduler
    push_to_hub=True                       # push model to hub
)

# Disable caching during training for gradient computation efficiency
model.config.use_cache = False

# Training

In [9]:
# Create Trainer object
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    processing_class=tokenizer
)

Map:   0%|          | 0/768 [00:00<?, ? examples/s]

Converting train dataset to ChatML:   0%|          | 0/768 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/768 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/768 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/768 [00:00<?, ? examples/s]

Map:   0%|          | 0/192 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/192 [00:00<?, ? examples/s]

Applying chat template to eval dataset:   0%|          | 0/192 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/192 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/192 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [10]:
# Start training, the model will be automatically saved to the Hub and the output directory
trainer.train()

# Save the final model again to the Hugging Face Hub
trainer.save_model()

# tell W&B training complete
wandb.finish()
model.config.use_cache = True



Step,Training Loss
10,3.69
20,2.5946
30,1.8321
40,1.5435
50,1.4262
60,1.3431
70,1.3444
80,1.2872
90,1.2993
100,1.2612


[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-256)... Done. 15.9s
[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-512)... Done. 12.7s
[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-768)... Done. 15.3s
[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-1024)... Done. 11.5s
[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-1280)... Done. 13.5s
[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-1536)... Done. 13.3s
[34m[1mwandb[0m: Adding directory to artifact (./gemma-smart-fine-tuned-v2/checkpoint-1792)... Done. 11.3s
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names li

0,1
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▄▄▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇██
train/grad_norm,▅▅▆█▄▄▄▄▄▃▄▄▄▅▅▅▅▄▅▅▅▅▄▅▅▅▂▃▃▄▄▄▂▁▂▁▂▁▁▁
train/learning_rate,▃█████████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▄▄▄▄▃▃▃▂▂▂▁▁▁▁▁▁
train/loss,█▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/mean_token_accuracy,▁▁▂▂▂▄▄▄▄▄▅▅▅▅▆▆▇▇▇▇█▇█▇▇███████████████
train/num_tokens,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇████

0,1
total_flos,7370895404291328.0
train/epoch,7.0
train/global_step,1792.0
train/grad_norm,1.16565
train/learning_rate,0.0
train/loss,0.0711
train/mean_token_accuracy,0.98086
train/num_tokens,1212799.0
train_loss,0.48097
train_runtime,4630.5709


In [11]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

# Merge QLoRA Adapter Weights

In [12]:
# Load Model base model
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True)

# Merge LoRA and base model and save
peft_model = PeftModel.from_pretrained(model, args.output_dir)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True, max_shard_size="2GB")

processor = AutoTokenizer.from_pretrained(args.output_dir)
processor.save_pretrained("merged_model")

('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.model',
 'merged_model/added_tokens.json',
 'merged_model/tokenizer.json')

In [13]:
# send model to gcp
bucket_name = "smart_input_data"
gcs_model_path = f"gs://{bucket_name}/fine_tuning/gemma_model/"

# send merged model
!gsutil -m cp -r {"merged_model"} {gcs_model_path}

# send fine-tuned model
!gsutil -m cp -r {"gemma-smart-fine-tuned-v2"} {gcs_model_path}

# send W&B files
!gsutil -m cp -r {"wandb"} {gcs_model_path}

Copying file://merged_model/config.json [Content-Type=application/json]...
Copying file://merged_model/generation_config.json [Content-Type=application/json]...
Copying file://merged_model/model-00001-of-00003.safetensors [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

Copying file://merged_model/model-00002-of-00003.safetensor

# Test Model Inference

In [14]:
question = "What is cross validation?"

messages = [{"role": "user", "content": question}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, return_tensors='pt', padding=True, truncation=True).to("cpu")

# Optimized generation with tuned sampling strategies
outputs = merged_model.generate(
    **inputs,
    max_length=350,  # Increase max length for complex answers
    num_return_sequences=1,
    top_k=50,
    top_p=0.85,  # Narrow top-p for more deterministic output
    temperature=0.3,  # Slightly higher temperature for balance between creativity and accuracy
    no_repeat_ngram_size=3,
)

# Decode and clean up the output
text = processor.decode(outputs[0], skip_special_tokens=True)
response = text.split("\nmodel\n")[1].strip()

print("\n",response)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



 Cross-validation is a robust technique used to assess the performance of machine learning models, particularly linear models. It involves partitioning the available data into multiple subsets, training the model on some subsets and validating it on the remaining subsets. This process helps in identifying potential overfitting and allows for a more reliable estimation of the model's performance on unseen data.

The most common form of cross-validation, known as k-fold cross- validation, involves dividing the dataset into k equally sized folds. The model is trained on k-1 folds and validated on the final fold, thus creating a series of nested cross-validated subsets. The k-folds are used to compute performance metrics such as accuracy, precision, and recall, providing a more comprehensive view of the classifier's ability to generalize.

By utilizing multiple subsets of the data, cross-validates helps in ensuring that every data point in the dataset has the opportunity to be a part of t

# References

- https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora
- https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-tune-an-LLM-Part-3-The-HuggingFace-Trainer--Vmlldzo1OTEyNjMy
- https://medium.com/@anicomanesh/fine-tuning-gemma-2-for-medical-question-answering-a-step-by-step-guide-1c6c4ec4c107