<a href="https://colab.research.google.com/github/ashispapu/LLMs/blob/main/Google_AI_Assistant_for_Data_Science_teaching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'data-science-interview-q-and-a-treasury:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F4498747%2F7705679%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240403%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240403T082644Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D14cdf030232eb9e8c603c85a227afe99d4ef8ae28c816dffa137f5be86e4f4bfadd374e651b5cbd6a0961e6c72926c78edfdc10946e20761ae8e84050dcf195313bf081960d3702f1f5fa16796f788b93bcf70311889bc72808256b6737e5b64e5f7923347b5582d0d6cd52b09cb69207684bcb8d6c33c73e5cd1025c13e36fd3c1eca42ab7b59876af1bc0dca35b4d3de1159be2b1029f13646629cc7c6ea7759c6c2a421c4b04896cdadce607d8db98906021443584de73455832157f2a888a9bf822c4471622ea5d2fc9345edca7c54cdc44f1a00da1d353685776f0783f155212e78bb8f30fc7ca414163193351cc33ad6f2f67f9346d284107d6a84d894'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


<center><h1>Google AI Assistant for Data Science teaching</h1></center>

<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>

# Introduction

This Notebook will build an AI Assistant for Data Science teaching.

## The method

We will fine-tune a LLM with Data Science questions and answers. Then we will create a custom class that will query this model.

## The model

As model, we will use Gemma model, fine-tuned with our Data Science Q&A data, using LoRA.

## The data

As data for fine-tuning Gemma, we will use [Data Science Q&A Treasury](https://www.kaggle.com/datasets/memocan/data-science-interview-q-and-a-treasury) dataset. This dataset contains over 150 questions and answering about Data Science.

## Previous work

This work is largely based on previous work. Here I list the sources:

1. Gemma Model Card, Kaggle Models, https://www.kaggle.com/models/google/gemma
2. Kaggle QA with Gemma - KerasNLP Starter, Kaggle Code, https://www.kaggle.com/code/awsaf49/kaggle-qa-with-gemma-kerasnlp-starter (Version 11)
3. Fine-tune Gemma models in Keras using LoRA, Kaggle Code, https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora (Version 1)
4. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, ArXiv, https://arxiv.org/pdf/2106.09685.pdf
5. Abheesht Sharma, Matthew Watson, Parameter-efficient fine-tuning of GPT-2 with LoRA, https://keras.io/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/
6. Keras 3 API documentation / KerasNLP / Models / Gemma, https://keras.io/api/keras_nlp/models/gemma/


# Introduction about Gemma


Gemma is a collection of lightweight source generative AI models designed to be used mostly by developers and researchers. Created by Google DeepMind research lab that also developed Gemini, Gemma is available in several versions, with 2B and 7B parameters, as following:

| Model                  | Parameters      | Tuned versions    | Description                                    | Recomemnded target platforms       |
|------------------------|-----------------|-------------------|------------------------------------------------|------------------------------------|
| `gemma_2b_en`          | 2.51B           | Pretrained        | 18-layer Gemma model (Gemma with 2B parameters)|Mobile devices and laptops          |
| `gemma_instruct_2b_en` | 2.51B           | Instruction tuned | 18-layer Gemma model (Gemma with 2B parameters)| Mobile devices and laptops         |
| `gemma_7b_en`          | 8.54B           | Pretrained        | 28-layer Gemma model (Gemma with 7B parameters)| Desktop computers and small servers|
| `gemma_instruct_7b_en` | 8.54B           | Instruction tuned | 28-layer Gemma model (Gemma with 7B parameters)| Desktop computers and small servers|


For this notebook, we will fine-tune `gemma_2b_en` model, one of the `2B` parameters Gemma models.

# LoRA introduction

LoRA stands for Low-Rank Adaptation. It is a method used to fine-tune large language models (LLMs) by freezing the weights of the LLM and injecting trainable rank-decomposition matrices.   
The number of trainable parameters during fine-tunning will decrease therefore considerably.   
According to LoRA paper, this number decreases 10,000 times, and the computational resources size decreases 3 times.

# Installations and configurations

In [None]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

In [None]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

In [None]:
class Config:
    seed = 42
    dataset_path = "/kaggle/input/data-science-interview-q-and-a-treasury/dataset.csv"
    preset = "gemma_2b_en" # name of pretrained Gemma
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training, x 2 as two GPUs
    epochs = 10 # number of epochs to train

In [None]:
keras.utils.set_random_seed(Config.seed)

In [None]:
def colorize_text(text):
    for word, color in zip(["Question", "Answer"], ["blue", "red"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

# Load the data

In [None]:
df = pd.read_csv(f"{Config.dataset_path}")
df.head()

# Fine-tune Gemma

We prepare a template and generate a prompt using this template from each row in the dataset.

In [None]:
template = "\n\nQuestion:\n{question}\n\nAnswer:\n{answer}"
df["prompt"] = df.progress_apply(lambda row: template.format(question=row.question,
                                                             answer=row.answer), axis=1)
data = df.prompt.tolist()

## Initialize the code for Gemma Causal LM

We initialize `GemmaCausalML` with `gemma_2b_en` model.

In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.summary()

## Gemma preprocessor

In [None]:
x, y, sample_weight = gemma_lm.preprocessor(data[0:1])

## Enable LoRA for the model

We set LoRA rank to 5. The higher the LoRA rank, the higher the number of trainable parameters.

In [None]:
# Enable LoRA for the model and set the LoRA rank to 5.
gemma_lm.backbone.enable_lora(rank=5)
gemma_lm.summary()

The total trainable parameters is 1.7 M (or 6.5 MB).  
This is less than 0.06% of the 2,5G (9.35 GB) total trainable parameters.

## Run the fine-tuning sequence

In [None]:
# Limit the input sequence length to 512 (to control memory usage).
gemma_lm.preprocessor.sequence_length = Config.sequence_length

# Compile the model with loss, optimizer, and metric
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=8e-5),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train model
gemma_lm.fit(data, epochs=Config.epochs, batch_size=Config.batch_size)

# Test the fine-tuned model

## Define the specialized class

In [None]:
class GemmaQA:
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.prompt = template
        self.gemma_lm = gemma_lm

    def query(self, question):
        response = self.gemma_lm.generate(
            self.prompt.format(
                question=question,
                answer=""),
            max_length=self.max_length)
        display(Markdown(colorize_text(response)))


In [None]:
gemma_qa = GemmaQA()

## Test 1

In [None]:
row = df.iloc[5]
gemma_qa.query(row.question)

## Test 2

In [None]:
row = df.iloc[10]
gemma_qa.query(row.question)

# Test 3

In [None]:
row = df.iloc[15]
gemma_qa.query(row.question)

## Fresh question 1

In [None]:
question = "What is regularization?"
gemma_qa.query(question)

## Fresh question 2

In [None]:
question = "What is SVM?"
gemma_qa.query(question)

## Fresh question 3

In [None]:
question = "What is Dropout?"
gemma_qa.query(question)

# Conclusions


We fine-tuned Gemma with a set of Data Science interview question and answers.   
Then we tested the model with questions from the dataset used for fine-tuning.  
At the end, we also tested with some questions that were not in the dataset.