In [None]:
# Copyright 2024 Google LLC

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

#     https://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,0000

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

> This notebook was tested in the following environment:

>

> - Python 3.10

> - Colab Enterprise with a `e2-standard-16` runtime:

>   - 62.8 GB of system RAM

>   - 0 GB of GPU RAM (NVIDIA L4)


## Overview


Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models.



This notebook demonstrates loading, finetuning, converting, and deploying Gemma to Vertex AI.


### Objective



- Load Gemma using KerasNLP

- Finetune Gemma using KerasNLP

- Convert Gemma to Hugging Face Transformers

- Deploy Gemma to Vertex AI


### Costs



This tutorial uses billable components of Google Cloud:



- Vertex AI

- Cloud Storage



Learn about [Vertex AI](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage](https://cloud.google.com/storage/pricing) pricings,

and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)

to generate a cost estimate based on your projected usage.


### Current Model State:

- The model was trained using facility-specific health data, focusing on trends related to veterans' physical and mental health.

- The last training session processed **102/10,205** examples, but to enhance the model’s accuracy, additional data and retraining will be required.



### Disclaimer:

Please note that this notebook is a **work in progress** and may appear incomplete. Ongoing updates and improvements will be made as the model is retrained and further refined.



### Healthcare Disclaimer:

The outputs of this model are intended for research and educational purposes only and should not be used as a substitute for professional medical advice, diagnosis, or treatment. Consult a qualified healthcare provider for any medical concerns or decisions related to patient care.

rvices.

y.

## Installation


Install the following packages required to execute this notebook:


In [None]:
!pip install --upgrade keras-nlp tensorflow tensorflow-addons tensorflow-text kaggle kagglehub -q


## Before you begin


### Kaggle credentials



> Add blockquote






Gemma models are hosted by Kaggle. To use Gemma, request access on Kaggle:



- Sign in or register at [kaggle.com](https://www.kaggle.com)

- Open the [Gemma model card](https://www.kaggle.com/models/google/gemma) and select _"Request Access"_

- Complete the consent form and accept the terms and conditions



Then, to use the Kaggle API, create an API token:



- Open the [Kaggle settings](https://www.kaggle.com/settings)

- Select _"Create New Token"_

- A `kaggle.json` file is downloaded. It contains your Kaggle credentials



Run the following cell and enter your Kaggle credentials.


In [None]:
import kagglehub



kagglehub.login()

> Note: If `kagglehub.login()` doesn't work for you, an alternative way is to set `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables.


### Dependencies


In [None]:
# Utility Libraries

import os

import datetime

import json

import csv



# Data Handling

import pandas as pd



# Machine Learning and Deep Learning

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import mixed_precision

from tensorflow.keras.callbacks import EarlyStopping

from tensorflow.keras.optimizers import Adam

import keras_nlp  # if specifically using Keras NLP functionalities

import transformers  # if using models from Hugging Face's Transformers library

import torch

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

# Google Cloud Integration (if needed)

    # from google.cloud import aiplatform

    # from google.colab import files



# Hugging Face Integration (if needed)

    # from huggingface_hub import HfApi, HfFolder


### Dataset



To finetune Gemma, this notebook uses the [Veteran Affairs NE Regional](https://huggingface.co/datasets/vetHealthGuy/Veteran_Affairs_NorthEast_Region_Conversational_Dataset) test dataset.



Download the dataset:


In [None]:
#!wget -nv -nc -O $DATASET_PATH $DATASET_URL



DATASET_URL = "https://huggingface.co/datasets/vetHealthGuy/Veteran_Affairs_NorthEast_Region_Conversational_Dataset/resolve/main/Updated_VA_Facilities_Conversational_with_Phone.csv"

!wget -nv -nc -O vet_chat.csv $DATASET_URL

## Load Gemma



In this step, you will configure Keras precision settings and load Gemma with KerasNLP.


In [None]:
from keras_nlp.models import GemmaCausalLM



# Load the Gemma model with built-in preprocessing

gemma_model = GemmaCausalLM.from_preset("gemma_2b_en")

print("Model loaded successfully.")

### Keras precision settings



When training on NVIDIA GPUs, mixed precision (`keras.mixed_precision.set_global_policy("mixed_bfloat16")`) can be used to speed up training with minimal effect on training quality. In most cases, it is recommended to turn on mixed precision as it saves both memory and time. However, be aware that at small batch sizes, it can inflate memory usage by 1.5x (weights will be loaded twice, at half precision and full precision).



For inference, half-precision (`keras.config.set_floatx("bfloat16")`) will work and save memory (while mixed-precision is not applicable).



Configure your precision settings:


In [None]:
# Run inferences at half precision

keras.config.set_floatx("bfloat16")



# Train at mixed precision (enable for large batch sizes)

#keras.mixed_precision.set_global_policy("mixed_bfloat16")

### Model summary



Load the Gemma model using the `GemmaCausalLM.from_preset()` method:


In [None]:
#print(keras_nlp.models.GemmaCausalLM.presets.keys())

In [None]:
MODEL_NAME = "gemma_2b_en"

gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset(MODEL_NAME)

Display the model summary:


In [None]:
gemma_lm.summary()

### Test examples



Define test examples and functions that will be used to test models before and after finetuning:


In [None]:
TEST_EXAMPLES = [

    "I need to go to the VA in Philadelphia",

    "How far is the Tulsa Veteran Affairs from Bixby?",

    "Need to travel to New York.  Is there a VA?",

    "What is the number to VA in boston"

]



# Prompt template for the training data and the finetuning tests

PROMPT_TEMPLATE = "Instruction:\n{Prompt}\n\nResponse:\n{Response}"



TEST_PROMPTS = [

    PROMPT_TEMPLATE.format(Prompt=example, Response="")

    for example in TEST_EXAMPLES

]

### Samplers



You can control how tokens are generated for `GemmaCausalLM` by calling the `compile()` method with the `sampler` parameter.



For example:



- `greedy`: picks the next token with the largest probability

- `top_k`: randomly picks the next token from the tokens of top K probability



To get deterministic outputs in this notebook, make sure you're using the `greedy` sampler:


In [None]:
gemma_lm.compile(sampler="greedy")

To learn more about available samplers, see [Samplers](https://keras.io/api/keras_nlp/samplers).


### Inference before finetuning



Check how the model responds to the test examples:


In [None]:
for test_example in TEST_EXAMPLES:

    response = gemma_lm.generate(test_example, max_length=48)

    output = response[len(test_example) :]

    print(f"{test_example}\n{output!r}\n")

A pretrained model can generate text that deviates from the output you are expecting. Here are some examples:



- The output doesn't follow your output requirements.

- The output is too generic or not consistent enough.

- The output is factually incorrect or outdated.

- The output must be aligned with your specific safety policies.



More specific inputs (prompt engineering) can fix some of these issues, at the expense of more complex and longer prompts. If the expected output is not part of the model training data, LLMs generate plausible text anyway and produce what is sometimes called hallucinations.



You can perform a model finetuning to improve the performance of the model and keep simpler prompts.


## Finetune Gemma



Finetune your Gemma model to improve its performance in the specific task of answering questions more consistently and more factually.


### Training data



Generate the training examples using the dataset:


In [None]:
NEW_DATASET_PATH = "vet_chat.csv"



def generate_training_data(training_ratio: int = 100) -> list[str]:

    assert 0 < training_ratio <= 100

    data = []

    # Open the CSV file instead of using readlines() for JSON lines

    with open(NEW_DATASET_PATH, newline='') as file:

        reader = csv.DictReader(file)  # Use DictReader to read CSV into a dictionary

        for row in reader:

            # Skip examples with context, for simplicity (if 'context' exists in your data)

            if row.get("context"):

                continue

            # Format using questionText and answerText from the CSV

            data.append(PROMPT_TEMPLATE.format(Prompt=row['Prompt'], Response=row['Response']))

    total_data_count = len(data)

    training_data_count = total_data_count * training_ratio // 100

    print(f"Training examples: {training_data_count}/{total_data_count}")



    return data[:training_data_count]



# Limit to 10% for test purposes

training_data = generate_training_data(training_ratio=10)

### Low-Rank Adaptation (LoRA)



[Low Rank Adaptation](https://arxiv.org/abs/2106.09685) (LoRA) is a finetuning technique which greatly reduces the number of trainable parameters for downstream tasks by freezing the full weights of the model and inserting a smaller number of new trainable weights into the model. This technique makes training much faster and more memory-efficient.



Enable LoRA for the model and set the LoRA rank to 4:


In [None]:
gemma_lm.backbone.enable_lora(rank=4)

Check that the number of trainable parameters is significantly reduced:


In [None]:
gemma_lm.summary()

The number of trainable parameters decreased from 2.5B down to 1.4M (1,800x less), making it possible to finetune the model with reasonable GPU memory requirements.


### Finetuning



Finetune the model with the training data. This step can take a couple of minutes:


In [None]:


def finetune_gemma(model: keras_nlp.models.GemmaCausalLM, data: list[str]):

    # Reduce the input sequence length to limit memory usage

    model.preprocessor.sequence_length = 64



    # Define early stopping

    early_stopping = EarlyStopping(monitor='loss', patience=1)



    # Configure the AdamW optimizer

    optimizer = keras.optimizers.AdamW(

        learning_rate=5e-6,

        weight_decay=0.01,

    )

    optimizer.exclude_from_weight_decay(var_names=["bias", "scale"])  # Exclude bias and layer norm from decay



    # Convert data to a tf.data.Dataset and batch it

    dataset = tf.data.Dataset.from_tensor_slices(data).batch(4)



    # Compile the model

    model.compile(

        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),

        optimizer=optimizer,

        metrics=[keras.metrics.SparseCategoricalAccuracy()],

    )



    # Fit model with batched dataset

    model.fit(dataset, epochs=4, callbacks=[early_stopping])



    # Save the model in .keras format

    model.save('veteran_assistance_finetuned_model.keras')

    print("Model saved as veteran_assistance_finetuned_model.keras")



    # Save the model in .h5 format

    model.save('veteran_assistance_finetuned_model.h5', save_format='h5')

    print("Model saved as veteran_assistance_finetuned_model.h5")



# Sample function call (assuming you have `gemma_lm` and `training_data`)

finetune_gemma(gemma_lm, training_data)


### Inference after finetuning



Test the finetuned model:


### Reloading the models after initial training

In [None]:
import keras_hub



# Check available attributes and methods of GemmaBackbone

print(dir(keras_hub.models.GemmaBackbone))


In [None]:
import keras_hub



# Step 1: Initialize GemmaBackbone with the "gemma_2b_en" preset

backbone = keras_hub.models.GemmaBackbone.from_preset("gemma_2b_en")



# Step 2: Initialize GemmaCausalLM with this backbone, skipping the preprocessor

gemma_lm = keras_hub.models.GemmaCausalLM(

    backbone=backbone,

    preprocessor=None  # Set to None to skip tokenization

)



# Step 3: Load the previously saved weights, if any

gemma_lm.load_weights("veteran_assistance_finetuned_model.h5")



# Define test prompts with raw text input

TEST_PROMPTS = [

    "What services are available for veterans at the VA in Philadelphia?",

    "How can I get healthcare benefits as a veteran?",

    "Are there mental health resources at the VA?",

]



# Step 4: Generate responses without any tokenization

for prompt in TEST_PROMPTS:

    # Directly pass the raw prompt to the generate method

    output = gemma_lm.generate(prompt, max_length=40)

    print(f"Prompt: {prompt}")

    print(f"Response: {output}\n{'- ' * 40}")


You should observe that outputs are now structured, more consistent, and more factual.


## Retraining of the model


In [None]:
import torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments

from torch.utils.data import Dataset, DataLoader



# Define the file path for data and model

data_path = '/content/vet_chat.csv'

model_path = '/content/veteran_assistance_finetuned_model.h5'



# Load data in a memory-efficient way

class VAInstructionsDataset(Dataset):

    def __init__(self, data_path, tokenizer, max_length=128):

        self.data = pd.read_json(data_path, lines=True)

        self.tokenizer = tokenizer

        self.max_length = max_length



    def __len__(self):

        return len(self.data)



    def __getitem__(self, idx):

        item = self.data.iloc[idx]

        instruction = item['Instruction']

        response = item['Response']



        inputs = self.tokenizer(

            instruction,

            max_length=self.max_length,

            padding='max_length',

            truncation=True,

            return_tensors='pt'

        )

        outputs = self.tokenizer(

            response,

            max_length=self.max_length,

            padding='max_length',

            truncation=True,

            return_tensors='pt'

        )



        input_ids = inputs['input_ids'].squeeze()

        attention_mask = inputs['attention_mask'].squeeze()

        labels = outputs['input_ids'].squeeze()



        return {

            'input_ids': input_ids,

            'attention_mask': attention_mask,

            'labels': labels,

        }



# Load tokenizer and model

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForSeq2SeqLM.from_pretrained(model_path)



# Prepare dataset and dataloader

dataset = VAInstructionsDataset(data_path, tokenizer)

train_dataloader = DataLoader(dataset, batch_size=4, shuffle=True)



# Training arguments

training_args = TrainingArguments(

    output_dir='/content/model-finetuned',

    num_train_epochs=3,

    per_device_train_batch_size=4,

    save_steps=500,

    save_total_limit=2,

    logging_dir='/content/logs',

    logging_steps=100,

    load_best_model_at_end=True,

)



# Trainer

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=dataset,

    tokenizer=tokenizer,

)



# Fine-tune the model

trainer.train()



# Save the model

model.save_pretrained('/content/model-finetuned')

tokenizer.save_pretrained('/content/model-finetuned')