<a href="https://colab.research.google.com/github/ashispapu/LLMs/blob/main/Gemma%2C_please_teach_me_Data_Science!.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><h1>Gemma, please teach me Data Science!</h1></center>

<center><img src="https://res.infoq.com/news/2024/02/google-gemma-open-model/en/headerimage/generatedHeaderImage-1708977571481.jpg" width="400"></center>


# Introduction

This notebook aims to show how we can, with a very simple approach, to exploit the rich information that Gemma already acquired through training and answer questions about Data Science.

**Let's go**!


# Install resources

We start with few logistic steps, installing the needed resources and preparing our tools.

We will use Gemma through Keras interface.

## Install Keras NLP and Keras

In [None]:
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

## Import packages

In [None]:
import keras
import keras_nlp
import os
from IPython.display import display, Markdown
import warnings
warnings.filterwarnings("ignore")

## Setup some environment variables

In [None]:
# Select the desired backend for Keras. Options: "jax", "tensorflow", or "torch".
os.environ["KERAS_BACKEND"] = "jax"  # Adjust as needed.

# Specific to the JAX backend, this setting helps avoid memory fragmentation, ensuring more efficient resource use.
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00"

## Utility for formatting the output

In [None]:
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer"], ["blue", "red", "green"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

## Load Gemma Causal LM

We will try for this application `gemma_instruct_2b_en`.

In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en")

# Setup the Q&A class


We will setup a class for querying directly the Gemma model about Data Science.

In [None]:
class GemmaQA:
    def __init__(self, max_length=512):
        self.max_length = max_length
        self.prompt = """
            You are an AI assistant designed to answer questions about Data Science.
            Reasoning: If the question is not related to Data Science, simply state politely this.
            If it is a complex question, think step by step. If needed, include your reasoning process.
            Question: {question}
            Answer:
        """
        self.gemma_lm = gemma_lm

    def query(self, question):
        response = self.gemma_lm.generate(
            self.prompt.format(
                question=question),
            max_length=self.max_length)
        display(Markdown(colorize_text(response)))


Let's initialize the class.

If we are not giving any parameters, the default initialization with `max_length` = 512 will be used.

Let's use instead initialization with `max_length` = 256.

In [None]:
gemma_qanda = GemmaQA(max_length=256)

# Test the model - ask few questions on Data Science

## Question 1: Ask about sklearn

Let's test first with a simple question about `sklearn`.

In [None]:
gemma_qanda.query("Please teach me how to do a train test split using sklearn")

Doesn't look fine, isn't it? Let's get back to the default settings.

In [None]:
gemma_qanda = GemmaQA()

In [None]:
gemma_qanda.query("Please teach me how to do a train test split using sklearn")

## Question 2: Ask about bias and variance

Now, let's ask something different. Will ask Gemma about bias and variance.

In [None]:
gemma_qanda.query("Please explain to me the concepts of bias and variance")

## Question 3: Ask about Dropout

Let's ask something from Deep Learning, more specific, about Dropout layers.

In [None]:
gemma_qanda.query("What is the role of Dropout layer added to a Deep Learning architecture?")

## Question 4: Ask about model calibration

Now, let's ask a more advanced question.

In [None]:
gemma_qanda.query("Could you explain, in the context of Data Science, what is model calibration?")

# Conclusions

In this first attempt to create a simple tool to answer to our questions about Data Science, we tested the capability of Gemma model to answer to few questions related to Data Science directly, without prompting it with some special documentation.

The results were quite good, actually unexpected good, considering that we were just using a system prompt adapted to the task. Maybe the last question could have been answered better, but, well, how many of us have actually used Platt calibrations curves or other method for model calibration? Btw., here is a good article describing the concepts related to model calibration: [A Comprehensive Guide on Model Calibration: What, When, and How](https://towardsdatascience.com/a-comprehensive-guide-on-model-calibration-part-1-of-4-73466eb5e09a)

The results shows that Gemma was trained well with data about Data Science domain. To further extend and maybe take it up-to-date, we can ingest recent information about Data Science in a vector database and create a Retrieval Augmented Generation system.
