<center><h1>Fine-tunning Gemma model with Kaggle Docs data</h1></center>

<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook will demonstrate three things:

1. How to fine-tune Gemma model using LoRA
2. Creation of a specialised class to query about Kaggle features
3. Some results of querying about Kaggle Docs

This work is largely based on previous work. Here I list the sources:

1. Gemma Model Card, Kaggle Models, https://www.kaggle.com/models/google/gemma
2. Kaggle QA with Gemma - KerasNLP Starter, Kaggle Code, https://www.kaggle.com/code/awsaf49/kaggle-qa-with-gemma-kerasnlp-starter (Version 11)  
3. Fine-tune Gemma models in Keras using LoRA, Kaggle Code, https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora (Version 1)  
4. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, ArXiv, https://arxiv.org/pdf/2106.09685.pdf
5. Abheesht Sharma, Matthew Watson, Parameter-efficient fine-tuning of GPT-2 with LoRA, https://keras.io/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/
6. Keras 3 API documentation / KerasNLP / Models / Gemma, https://keras.io/api/keras_nlp/models/gemma/
7. Kaggle Docs, Kaggle Dataset, https://www.kaggle.com/datasets/awsaf49/kaggle-docs

**Let's go**!


# What is Gemma?


Gemma is a collection of lightweight source generative AI models designed to be used mostly by developers and researchers. Created by Google DeepMind research lab that also developed Gemini, Gemma is available in several versions, with 2B and 7B parameters, as following:


| Model                  | Parameters      | Tuned versions    | Description                                    | Recomemnded target platforms       |
|------------------------|-----------------|-------------------|------------------------------------------------|------------------------------------|
| `gemma_2b_en`          | 2.51B           | Pretrained        | 18-layer Gemma model (Gemma with 2B parameters)|Mobile devices and laptops          |
| `gemma_instruct_2b_en` | 2.51B           | Instruction tuned | 18-layer Gemma model (Gemma with 2B parameters)| Mobile devices and laptops         | 
| `gemma_7b_en`          | 8.54B           | Pretrained        | 28-layer Gemma model (Gemma with 7B parameters)| Desktop computers and small servers|
| `gemma_instruct_7b_en` | 8.54B           | Instruction tuned | 28-layer Gemma model (Gemma with 7B parameters)| Desktop computers and small servers|




# What is LoRA?  

LoRA stands for Low-Rank Adaptation. It is a method used to fine-tune large language models (LLMs) by freezing the weights of the LLM and injecting trainable rank-decomposition matrices. The number of trainable parameters during fine-tunning will decrease therefore considerably. According to LoRA paper, this number decreases 10,000 times, and the computational resources size decreases 3 times. 

# How we proceed?

For fine-tunning with LoRA, we will follow the steps:

1. Install prerequisites
2. Load and process the data for fine-tuning
3. Initialize the code for Gemma causal language model (Gemma Causal LM)
4. Perform fine-tuning
5. Test the fine-tunned model with questions from the data used for fine-tuning and with aditional questions

# Prerequisites


## Installs

In [1]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.1.1 which is incompatible.[0m[31m
[0m

## Imports

In [2]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

2024-03-31 18:25:10.746069: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-31 18:25:10.746181: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-31 18:25:10.907272: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Configurations

In [3]:
class Config:
    seed = 42
    dataset_path = "/kaggle/input/kaggle-docs/questions_answers"
    preset = "gemma_2b_en" # name of pretrained Gemma
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training, x 2 as two GPUs
    epochs = 10 # number of epochs to train

In [4]:
keras.utils.set_random_seed(Config.seed)

## Utils

This is an utility function that we will include in our class for QA to format the answer to our queries.

In [5]:
def colorize_text(text):
    for word, color in zip(["Category", "Question", "Answer"], ["blue", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Load the data

In [6]:
df = pd.read_csv(f"{Config.dataset_path}/data.csv")
df.head()

Unnamed: 0,Question,Answer,Category
0,What are the different types of competitions a...,# Types of Competitions\n\nKaggle Competitions...,competition
1,What are the different competition formats on ...,There are handful of different formats competi...,competition
2,How to join a competition?,"Before you start, navigate to the [Competition...",competition
3,"How to form, manage, and disband teams in a co...",Everyone that competes in a Competition does s...,competition
4,How do I make a submission in a competition?,You will need to submit your model predictions...,competition


Let's check the total number of rows in this dataset.

In [7]:
df.shape[0]

60

For easiness, we will create the following template for QA: 

In [8]:
template = "\n\nCategory:\nkaggle-{Category}\n\nQuestion:\n{Question}\n\nAnswer:\n{Answer}"
df["prompt"] = df.progress_apply(lambda row: template.format(Category=row.Category,
                                                             Question=row.Question,
                                                             Answer=row.Answer), axis=1)
data = df.prompt.tolist()

  0%|          | 0/60 [00:00<?, ?it/s]

## Template utility function

In [9]:
def colorize_text(text):
    for word, color in zip(["Category", "Question", "Answer"], ["blue", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Specialized class to query Gemma


We define a specialized class to query Gemma.

## Initialize the code for Gemma Causal LM

In [10]:
gemma_causal_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_causal_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


## Define the specialized class

In [11]:
class GemmaQA:
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.prompt = template
        self.gemma_causal_lm = gemma_causal_lm
        
    def query(self, category, question):
        response = self.gemma_causal_lm.generate(
            self.prompt.format(
                Category=category,
                Question=question,
                Answer=""), 
            max_length=self.max_length)
        display(Markdown(colorize_text(response)))
        

## Test the GemmaQA class

In [12]:
gemma_qa = GemmaQA()
category=""
question="What is Kaggle?"
gemma_qa.query(category, question)



**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is Kaggle?

**<font color='green'>Answer:</font>**
Kaggle is a platform for data scientists to compete with each other.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a kernel?

**<font color='green'>Answer:</font>**
A kernel is a piece of code that you write and submit to Kaggle.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a leaderboard?

**<font color='green'>Answer:</font>**
A leaderboard is a list of the top performers on a Kaggle competition.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a dataset?

**<font color='green'>Answer:</font>**
A dataset is a collection of data that you can use to train your model.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a submission?

**<font color='green'>Answer:</font>**
A submission is the result of your model on a Kaggle competition.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a kernel?

**<font color='green'>Answer:</font>**
A kernel is a piece of code that you write and submit to Kaggle.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a leaderboard?

**<font color='green'>Answer:</font>**
A leaderboard is a list of the top performers on a Kaggle competition.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a dataset?

**<font color='green'>Answer:</font>**
A dataset is a collection of data that you can use to train your model.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a submission?

**<font color='green'>Answer:</font>**
A submission is the result of your model on a Kaggle competition.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a kernel?

**<font color='green'>Answer:</font>**
A kernel is a piece of code that you write and submit to Kaggle.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a leaderboard?

**<font color='green'>Answer:</font>**
A leaderboard is a list of the top performers on a Kaggle competition.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a dataset?

**<font color='green'>Answer:</font>**
A dataset is a collection of data that you can use to train your model.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a submission?

**<font color='green'>Answer:</font>**
A submission is the result of your model on a Kaggle competition.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a kernel?

**<font color='green'>Answer:</font>**
A kernel is a piece of code that you write and submit to Kaggle.

**<font color='blue'>Category:</font>**
kaggle-

**<font color='red'>Question:</font>**
What is a leaderboard?

Answer

## Gemma preprocessor


This preprocessing layer will take in batches of strings, and return outputs in a ```(x, y, sample_weight)``` format, where the y label is the next token id in the x sequence.

From the code below, we can see that, after the preprocessor, the data shape is ```(num_samples, sequence_length)```.

In [13]:
x, y, sample_weight = gemma_causal_lm.preprocessor(data[0:2])

# Perform fine-tuning with LoRA

## Enable LoRA for the model

LoRA rank is setting the number of trainable parameters. A larger rank will result in a larger number of parameters to train.

In [14]:
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_causal_lm.backbone.enable_lora(rank=4)
gemma_causal_lm.summary()

## Run the training sequence

In [15]:
# Limit the input sequence length to 512 (to control memory usage).
gemma_causal_lm.preprocessor.sequence_length = Config.sequence_length 

# Compile the model with loss, optimizer, and metric
gemma_causal_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=8e-5),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train model
gemma_causal_lm.fit(data, epochs=Config.epochs, batch_size=Config.batch_size)

Epoch 1/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 736ms/step - loss: 1.7209 - sparse_categorical_accuracy: 0.5241
Epoch 2/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 731ms/step - loss: 1.6869 - sparse_categorical_accuracy: 0.5313
Epoch 3/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 732ms/step - loss: 1.6175 - sparse_categorical_accuracy: 0.5417
Epoch 4/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 731ms/step - loss: 1.5770 - sparse_categorical_accuracy: 0.5509
Epoch 5/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 732ms/step - loss: 1.5537 - sparse_categorical_accuracy: 0.5552
Epoch 6/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 1s/step - loss: 1.5304 - sparse_categorical_accuracy: 0.5568
Epoch 7/10
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 731ms/step - loss: 1.5028 - sparse_categorical_accuracy: 0.5630
Epoch 8/10
[1m60/60[0

<keras.src.callbacks.history.History at 0x79a11455df60>

# Test the fine-tuned model

In [16]:
gemma_qa = GemmaQA()

## Sample 1

In [17]:
row = df.iloc[0]
gemma_qa.query(row.Category,row.Question)



**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
What are the different types of competitions available on Kaggle?

**<font color='green'>Answer:</font>**
There are several competition types on Kaggle.

## Competitions

This is the type of competition you will most likely be interested in if you are a new user looking to participate in a competition. Competitions are organized by the Kaggle team and can be public or invite-only. They are typically open for 2-4 weeks and can be either public or invite-only.

You can participate in the competition as an individual or as a team. You can join a public team or create your own. You can also join an existing team, but it’s more likely that you’ll find a team that you want to join.

If you are a team, you’ll need to invite one or more teammates to the competition. Once they have accepted, the team is created and ready to go.

If you join a public team, it’ll show up on your Team Leaderboard and Team Leaderboard Page, which you can visit to track how you’re doing as a team. You can see your individual leaderboard as well, which shows how you’re performing compared to everyone else who has joined the competition as an individual.

If you’re a public team member, you’ll also see your team’s leaderboard and leaderboard page on Kaggle.

If you create a private team, only you and your teammates will see your team’s leaderboard and leaderboard page on Kaggle.

You can find out more about teams and how they work by reading the Team Guide.

## Datasets

Kaggle datasets are a great way to share your models and data with other Kaggle users. They can be anything from data sets to notebooks to competitions. You can find datasets by searching on Kaggle or by browsing the Datasets Directory.

## Challenges

Kaggle challenges are a new way to engage in the community. Anyone can create a challenge to invite other users to collaborate on a project. Challenges are typically open-ended and open-source.

If you’re interested in participating in a challenge, we recommend reading through the challenge description and rules before joining. You should also check out the other users who have already signed up to see how they’re approaching it.

## Competitions with Datasets

If you have a dataset you want to share in a competition, you can do that too! You can create a dataset in your profile and make it public or invite other Kag

## Sample 2

In [18]:
row = df.iloc[15]
gemma_qa.query(row.Category,row.Question)



**<font color='blue'>Category:</font>**
kaggle-tpu

**<font color='red'>Question:</font>**
How to load and save model on TPU?

**<font color='green'>Answer:</font>**
TpuModel.load_model()

## Load a model

The model can be loaded using `load_model()` function. The following code loads the MobileNetV2 model from a SavedModel checkpoint.

The `load_model()` function is also available for loading from a SavedModel file (.pb). This is useful for loading models with TPUv2, as SavedModel files cannot be loaded on TPUv2.

## Save a model

The model can be saved to disk using the `save_model()` and `save_checkpoint()` function:

## Sample 3

In [19]:
row = df.iloc[25]
gemma_qa.query(row.Category,row.Question)



**<font color='blue'>Category:</font>**
kaggle-noteboook

**<font color='red'>Question:</font>**
What are the different types of notebooks available on Kaggle?

**<font color='green'>Answer:</font>**
There are two main types of notebooks on Kaggle: Public or Shared Notebooks, and Private Notebooks.

## Public or Shared Notebooks

Public or Shared Notebooks are available for everyone. These notebooks can either be created and owned by Kaggle users, or shared by other Kaggle users. Anyone can view these notebooks, but you need to be logged in to Kaggle in order to run them.

## Private Notebooks

Private notebooks are owned by Kaggle users and are only available to those with the appropriate permission. These can either be created by Kaggle users, or shared by other Kaggle users.

## Not seen question(s)

In [20]:
category = "notebook"
question = "How to run a notebook?"
gemma_qa.query(category,question)



**<font color='blue'>Category:</font>**
kaggle-notebook

**<font color='red'>Question:</font>**
How to run a notebook?

**<font color='green'>Answer:</font>**
When you click on the "Run" button, you can choose whether to run the entire notebook or just a single cell. If you select "Entire notebook", then all of the cells of the notebook will be executed.

You can also use the "Run cell" button to only run a single cell in a notebook.

If you have any questions about using notebooks, you can also use the “Ask question” button in the bottom left corner of the notebook.

In [21]:
category = "discussions"
question = "How to create a discussion topic?"
gemma_qa.query(category,question)



**<font color='blue'>Category:</font>**
kaggle-discussions

**<font color='red'>Question:</font>**
How to create a discussion topic?

**<font color='green'>Answer:</font>**
To create a discussion topic, you can follow these steps:

1. Click on “Discussions”.
2. In the “Topic” dropdown, select “Create”.
3. In the pop-up window, enter a title for your discussion.
4. Select the appropriate “Type” and "Visibility" for your discussion.
5. Add tags to the end of the title to help others find the topic.
6. Select the appropriate "Tags" for your discussion.
7. Click "Save" to create the discussion.


Note that you can edit your discussion title, type and tags at any time after it has been created.

In [22]:
category = "competitions"
question = "What is a code competition?"
gemma_qa.query(category,question)



**<font color='blue'>Category:</font>**
kaggle-competitions

**<font color='red'>Question:</font>**
What is a code competition?

**<font color='green'>Answer:</font>**
Code competitions are a popular format for Kaggle's community to compete and collaborate. Code competitions can range from small, focused competitions with only a few participants to large, multi-week competitions with hundreds of participants.

## Types of Code Competitions

## Data Science

Data science competitions on Kaggle are often the first to test the skills of newcomers. Data science competitions often require a deeper understanding of the data and its structure, as well as more advanced programming skills.

## Machine Learning

Machine learning competitions are often the most popular on Kaggle, as they offer a wide range of interesting and challenging problems to work on. These competitions often require a combination of both data and programming skills, as machine learning is an interdisciplinary field.

## Vision (Image Classification)

Image classification competitions are popular on Kaggle and are typically data science competitions. The data used for vision competitions often consists of images from various categories, such as cars, dogs, or faces. The goal of these competitions is to classify each image as belonging to one of those categories.

## Regression

Regression competitions are also popular on Kaggle. These competitions often involve predicting a quantitative value from a set of inputs, typically numerical variables such as time or money. Regression competitions can be used to make predictions about anything from stock prices to customer behavior.

## Time Series

Time series competitions are competitions that involve predicting time-series data. Time series can be thought of as data that changes over time. For example, the number of tweets per hour over the last week could be a time series. The goal of time series competitions is to predict the values of the time series for future time periods.

## Text Classification

Text classification competitions are competitions that involve classifying text into one of several categories. For example, you might be given a set of news articles and asked to classify each one as belonging to one of several categories, such as “politics” or “entertainment.” Text classification competitions are popular on Kaggle, as they offer a way to use machine learning to solve real-world problems related to natural language processing.

# Conclusions



We demonstated how to fine-tune a Gemma model using LoRA. 
We also created a class to run queries to the Gemma model and tested it with some examples from the existing training data but also with some new, not seen questions.