<center><h1>Google AI Assistant for Data Science teaching</h1></center>

<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>

# Introduction

This Notebook will build an AI Assistant for Data Science teaching.

## The method

We will fine-tune a LLM with Data Science questions and answers. Then we will create a custom class that will query this model.

## The model

As model, we will use Gemma model, fine-tuned with our Data Science Q&A data, using LoRA.

## The data

As data for fine-tuning Gemma, we will use [Data Science Q&A Treasury](https://www.kaggle.com/datasets/memocan/data-science-interview-q-and-a-treasury) dataset. This dataset contains over 150 questions and answering about Data Science.

## Previous work

This work is largely based on previous work. Here I list the sources:

1. Gemma Model Card, Kaggle Models, https://www.kaggle.com/models/google/gemma
2. Kaggle QA with Gemma - KerasNLP Starter, Kaggle Code, https://www.kaggle.com/code/awsaf49/kaggle-qa-with-gemma-kerasnlp-starter (Version 11)
3. Fine-tune Gemma models in Keras using LoRA, Kaggle Code, https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora (Version 1)
4. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, ArXiv, https://arxiv.org/pdf/2106.09685.pdf
5. Abheesht Sharma, Matthew Watson, Parameter-efficient fine-tuning of GPT-2 with LoRA, https://keras.io/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/
6. Keras 3 API documentation / KerasNLP / Models / Gemma, https://keras.io/api/keras_nlp/models/gemma/


# Introduction about Gemma


Gemma is a collection of lightweight source generative AI models designed to be used mostly by developers and researchers. Created by Google DeepMind research lab that also developed Gemini, Gemma is available in several versions, with 2B and 7B parameters, as following:

| Model                  | Parameters      | Tuned versions    | Description                                    | Recomemnded target platforms       |
|------------------------|-----------------|-------------------|------------------------------------------------|------------------------------------|
| `gemma_2b_en`          | 2.51B           | Pretrained        | 18-layer Gemma model (Gemma with 2B parameters)|Mobile devices and laptops          |
| `gemma_instruct_2b_en` | 2.51B           | Instruction tuned | 18-layer Gemma model (Gemma with 2B parameters)| Mobile devices and laptops         | 
| `gemma_7b_en`          | 8.54B           | Pretrained        | 28-layer Gemma model (Gemma with 7B parameters)| Desktop computers and small servers|
| `gemma_instruct_7b_en` | 8.54B           | Instruction tuned | 28-layer Gemma model (Gemma with 7B parameters)| Desktop computers and small servers|


For this notebook, we will fine-tune `gemma_2b_en` model, one of the `2B` parameters Gemma models.

# LoRA introduction

LoRA stands for Low-Rank Adaptation. It is a method used to fine-tune large language models (LLMs) by freezing the weights of the LLM and injecting trainable rank-decomposition matrices.   
The number of trainable parameters during fine-tunning will decrease therefore considerably.   
According to LoRA paper, this number decreases 10,000 times, and the computational resources size decreases 3 times. 

# Installations and configurations

In [1]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.1.1 which is incompatible.[0m[31m
[0m

In [2]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

2024-04-01 14:34:32.206269: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-01 14:34:32.206379: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-01 14:34:32.343044: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
class Config:
    seed = 42
    dataset_path = "/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv"
    preset = "gemma_2b_en" # name of pretrained Gemma
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training, x 2 as two GPUs
    epochs = 10 # number of epochs to train

In [4]:
keras.utils.set_random_seed(Config.seed)

In [5]:
def colorize_text(text):
    for word, color in zip(["Question", "Answer"], ["blue", "red"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Load the data

In [6]:
df = pd.read_csv(f"{Config.dataset_path}")
df.head()

Unnamed: 0,question,answer
0,What is supervised machine learning? 👶,Supervised learning is a type of machine learn...
1,What is regression? Which models can you use t...,Regression is a part of supervised ML. Regress...
2,What is linear regression? When do we use it? 👶,Linear regression is a model that assumes a li...
3,What are the main assumptions of linear regres...,There are several assumptions of linear regres...
4,What’s the normal distribution? Why do we care...,The normal distribution is a continuous probab...


# Fine-tune Gemma

We prepare a template and generate a prompt using this template from each row in the dataset.

In [7]:
template = "\n\nQuestion:\n{question}\n\nAnswer:\n{answer}"
df["prompt"] = df.progress_apply(lambda row: template.format(question=row.question,
                                                             answer=row.answer), axis=1)
data = df.prompt.tolist()

  0%|          | 0/166 [00:00<?, ?it/s]

## Initialize the code for Gemma Causal LM

We initialize `GemmaCausalML` with `gemma_2b_en` model.

In [8]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


## Gemma preprocessor

In [9]:
x, y, sample_weight = gemma_lm.preprocessor(data[0:1])

## Enable LoRA for the model

We set LoRA rank to 5. The higher the LoRA rank, the higher the number of trainable parameters.

In [10]:
# Enable LoRA for the model and set the LoRA rank to 5.
gemma_lm.backbone.enable_lora(rank=5)
gemma_lm.summary()

The total trainable parameters is 1.7 M (or 6.5 MB).  
This is less than 0.06% of the 2,5G (9.35 GB) total trainable parameters.

## Run the fine-tuning sequence

In [11]:
# Limit the input sequence length to 512 (to control memory usage).
gemma_lm.preprocessor.sequence_length = Config.sequence_length 

# Compile the model with loss, optimizer, and metric
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=8e-5),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train model
gemma_lm.fit(data, epochs=Config.epochs, batch_size=Config.batch_size)

Epoch 1/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m143s[0m 733ms/step - loss: 0.5137 - sparse_categorical_accuracy: 0.5402
Epoch 2/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 728ms/step - loss: 0.4342 - sparse_categorical_accuracy: 0.5783
Epoch 3/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 728ms/step - loss: 0.4123 - sparse_categorical_accuracy: 0.5927
Epoch 4/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 728ms/step - loss: 0.3994 - sparse_categorical_accuracy: 0.6016
Epoch 5/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 728ms/step - loss: 0.3898 - sparse_categorical_accuracy: 0.6081
Epoch 6/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 728ms/step - loss: 0.3777 - sparse_categorical_accuracy: 0.6162
Epoch 7/10
[1m166/166[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m121s[0m 728ms/step - loss: 0.3622 - sparse_categorical_accuracy: 0.6280

<keras.src.callbacks.history.History at 0x7843602a77f0>

# Test the fine-tuned model

## Define the specialized class

In [12]:
class GemmaQA:
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.prompt = template
        self.gemma_lm = gemma_lm
        
    def query(self, question):
        response = self.gemma_lm.generate(
            self.prompt.format(
                question=question,
                answer=""), 
            max_length=self.max_length)
        display(Markdown(colorize_text(response)))
        

In [13]:
gemma_qa = GemmaQA()

## Test 1

In [14]:
row = df.iloc[5]
gemma_qa.query(row.question)



**<font color='blue'>Question:</font>**
How do we check if a variable follows the normal distribution? ‍⭐️

**<font color='red'>Answer:</font>**
1. Calculate the mean of the variable.
2. Calculate the standard deviation of the variable.
3. Calculate the standard error of the variable as follows: standard_error = standard_deviation / squareroot(n)
4. Calculate z-score as: z-score = (x - mean) / standard_error
5. If z-score > 2 or lower than -2, then the variable follows the normal distribution.

## Test 2

In [15]:
row = df.iloc[10]
gemma_qa.query(row.question)



**<font color='blue'>Question:</font>**
What is SGD  —  stochastic gradient descent? What’s the difference with the usual gradient descent? ‍⭐️

**<font color='red'>Answer:</font>**
SGD stands for Stochastic Gradient Descent and it's an algorithm for training machine learning models. In this algorithm, the model parameters are updated for each training example. The difference with usual gradient descent is that in the former algorithm, the gradient descent is performed on a randomly sampled training example, while in the latter, the gradient is calculated on the entire training set. The goal of SGD is to find the global minimum, while the gradient descent tries to find the local minimum. 

**Disclaimer**
**The answers provided here are our own research on Quora and are not official answers. Please consult an expert before making any buying/selling decision based on our research.**

# Test 3

In [16]:
row = df.iloc[15]
gemma_qa.query(row.question)



**<font color='blue'>Question:</font>**
How to validate your models? 👶

**<font color='red'>Answer:</font>**
Validation is the process of using a training set to build a model, and then using the model to predict on the validation set, and then using the validation loss as a metric to evaluate the model.

## Fresh question 1

In [17]:
question = "What is regularization?"
gemma_qa.query(question)



**<font color='blue'>Question:</font>**
What is regularization?

**<font color='red'>Answer:</font>**
Regularization is a technique to reduce overfitting of the models. It does it by adding penalty terms to the original cost functions of the models.

## Fresh question 2

In [18]:
question = "What is SVM?"
gemma_qa.query(question)



**<font color='blue'>Question:</font>**
What is SVM?

**<font color='red'>Answer:</font>**
Support Vector Machine (SVM) is one of the supervised machine learning algorithms. It works on the concept of maximizing the margin for better classification. The model finds the optimal hyperplane that separates two classes of data points in a multi-dimensional space.

## Fresh question 3

In [19]:
question = "What is Dropout?"
gemma_qa.query(question)



**<font color='blue'>Question:</font>**
What is Dropout?

**<font color='red'>Answer:</font>**
Dropout consists of a simple technique to reduce Overfitting. In this technique, we set some nodes in the network to 0, randomly or periodically.

# Conclusions


We fine-tuned Gemma with a set of Data Science interview question and answers.   
Then we tested the model with questions from the dataset used for fine-tuning.  
At the end, we also tested with some questions that were not in the dataset.