<center><img src="https://keras.io/img/logo-small.png" alt="Keras logo" width="100"><br/>
This starter notebook is provided by the Keras team.</center>

<font color="red"><b>Note</b>:</font><br>
I made some changes to the original pinned notebook, to suit to my own experiments.

# Google – AI Assistants for Data Tasks with Gemma with [KerasNLP](https://github.com/keras-team/keras-nlp) and [Keras](https://github.com/keras-team/keras)

> The objective of this competition is to build tools to assist Kaggle developers.

<div align="center">
    <img src="https://i.ibb.co/8xZNc32/Gemma.png">
</div>

In this competition, we are asked to create notebooks that demonstrate how to use the Gemma LLM to accomplish one or more of the following developer-oriented tasks:
1. **<font color="red">Answer common questions about the Kaggle platform.</font>**
2. Explain or teach basic data science concepts.
3. Summarize Kaggle Solution write-ups.
4. Explain or teach concepts from Kaggle Solution write-ups.
5. Answer common questions about the Python programming language.

This notebook guides you through performing `"1. Answer common questions about the Kaggle platform"` task for the competition. As this task requires specific knowledge of Kaggle, we need precise information about Kaggle. To do so, I have created a dataset, ["Kaggle Docs"](https://www.kaggle.com/datasets/awsaf49/kaggle-docs), collecting data from [kaggle.com/docs](https://www.kaggle.com/docs/). To make things easier for the model, the data is curated to have Question-Answer pair format, but if you are interested, the raw data is also available. We will use this dataset to fine-tune **Gemma LLM** to answer questions about the Kaggle platform.

<u>Fun fact</u>: This notebook is backend-agnostic, supporting TensorFlow, PyTorch, and JAX. However, the best performance can be achieved from `JAX`. Utilizing KerasNLP and Keras allows us to choose our preferred backend. Explore more details on [Keras](https://keras.io/keras_3/).

**Note**: For a more in-depth understanding of KerasNLP, refer to the [KerasNLP guides](https://keras.io/keras_nlp/).


# Install Libraries  

In [1]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.15.0 requires keras<2.16,>=2.15.0, but you have keras 3.1.1 which is incompatible.
tensorflowjs 4.16.0 requires packaging~=23.1, but you have packaging 21.3 which is incompatible.[0m[31m
[0m

# Import Libraries 

In [2]:
import os
os.environ["KERAS_BACKEND"] = "jax" # you can also use tensorflow or torch
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # avoid memory fragmentation on JAX backend.

import keras
import keras_nlp

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
tqdm.pandas() # progress bar for pandas

import plotly.graph_objs as go
import plotly.express as px
from IPython.display import display, Markdown

2024-03-31 10:27:03.757997: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-31 10:27:03.758098: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-31 10:27:03.882684: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Configuration

In [3]:
class CFG:
    seed = 42
    dataset_path = "/kaggle/input/kaggle-docs/questions_answers"
    preset = "gemma_2b_en" # name of pretrained Gemma
    sequence_length = 512 # max size of input sequence for training
    batch_size = 1 # size of the input batch in training, x 2 as two GPUs
    epochs = 15 # number of epochs to train

# Reproducibility 
Sets value for random seed to produce similar result in each run.

In [4]:
keras.utils.set_random_seed(CFG.seed)

# Data

The newly created **Kaggle Docs** dataset contains only approximately $60$ question-answer pairs curated from raw data from the `kaggle.com/docs` website. However, one can create many more samples from this provided data through simple augmentation or prompt engineering. For more flexibility, readers are welcome to explore the **raw** data stored in the dataset. In this notebook, we will focus on keeping it simple.

**Data Format:**

- The question-answer pair data is stored in `./kaggle-docs/questions_answers/data.csv` file.
- This file includes:
    - `Question`: A question about the Kaggle platform
    - `Answer`: Answer to the question in markdown format
    - `Category`: The category into which the question falls, one of the nine mentioned on the `kaggle.com/docs` website.
    
> You can access the **raw** data from `./kaggle-docs/raw/`, where there are `.txt` files for each of the **nine** categories.

In [5]:
df = pd.read_csv(f"{CFG.dataset_path}/data.csv")
df.head(2)

Unnamed: 0,Question,Answer,Category
0,What are the different types of competitions a...,# Types of Competitions\n\nKaggle Competitions...,competition
1,What are the different competition formats on ...,There are handful of different formats competi...,competition


We'll use the following simple template to create prompts from question-answer pairs and category to feed text into the model:

```
Category: ...

Question: ...

Answer: ...
```

This template helps the model understand what you're asking and how to respond accurately. You can explore more advanced prompt templates for better results.

In [6]:
template = "\n\nCategory:\nkaggle-{Category}\n\nQuestion:\n{Question}\n\nAnswer:\n{Answer}"

In [7]:
df["prompt"] = df.progress_apply(lambda row: template.format(Category=row.Category,
                                                             Question=row.Question,
                                                             Answer=row.Answer), axis=1)
data = df.prompt.tolist()

  0%|          | 0/60 [00:00<?, ?it/s]

Let's examine a sample prompt. As the answers in our dataset are curated with **markdown** format, we will render the sample using `Markdown()` to properly visualize the formatting.

## Sample

In [8]:
def colorize_text(text):
    for word, color in zip(["Category", "Question", "Answer"], ["blue", "red", "green"]):
        text = text.replace(f"\n\n{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [9]:
# Take a random sample
sample = data[45]

# Give colors to Question, Answer and Category
sample = colorize_text(sample)

# Show sample in markdown
display(Markdown(sample))



**<font color='blue'>Category:</font>**
kaggle-competition-setup

**<font color='red'>Question:</font>**
How do Kaggle competitions work?

**<font color='green'>Answer:</font>**
## Overview

Every competition has two things:

a) a clearly defined problem that participants need to solve using a machine learning model
b) a dataset that’s used both for training and evaluating the effectiveness of these models.

For example, in the [Store Sales – Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting) competition, participants must accurately predict how many of each grocery item will sell using a dataset of past product and sales information from a grocery retailer.

Once the competition starts, participants can submit their predictions. Kaggle will score them for accuracy, and the team will be placed on a ranked leaderboard. The team at the top of the leaderboard at the deadline wins!

## Datasets, Submissions & Leaderboards

Every competition’s dataset is split into two smaller datasets.

- One of these smaller datasets will be given to participants to train their models, typically named `train.csv`.
- The other dataset will be mostly hidden from participants and used by Kaggle for testing and scoring, named `test.csv` and `solution.csv` (`test.csv` is the same as `solution.csv` except that `test.csv` contains the feature values and `solution.csv` contains the ground truth variable(s) – participants will never, ever see `solution.csv`).

When a participant feels ready to make a submission to the competition, they will use `test.csv` to generate a prediction and upload a CSV file. Kaggle will automatically score the submission for accuracy using the hidden `solution.csv` file.

Most competitions have a maximum number of submissions that a participant can make each day and a final deadline at which point the leaderboard will be frozen.

It’s conceivable that a participant could use the mechanics of a Kaggle competition to overfit a solution - which would be great for winning a competition, but not valuable for a real-world application.

To help prevent this, Kaggle has two leaderboards – the public and private leaderboard. The competition host splits the `solution.csv` dataset into two parts, using one part for the public leaderboard and another part for the private leaderboard. Participants generally will now know which samples are public vs private. The private leaderboard is kept a secret until after the competition deadline and is used as the official leaderboard for determining the final ranking.

# Exploratory Data Analysis

Let's do a simple EDA to determine how many question-answer pairs we have per category.

In [10]:
# Get unique labels and their frequency
unique_labels, label_counts = np.unique(df.Category.tolist(), return_counts=True)

# Plotting
fig = go.Figure(data=go.Bar(x=unique_labels, y=label_counts))
fig.update_layout(
    title="Category Distribution",
    xaxis_title="Category",
    yaxis_title="Count",
)

fig.update_traces(text=label_counts, textposition="outside")
fig.show()


# Modeling

<div align="center"><img src="https://i.ibb.co/Bqg9w3g/Gemma-Logo-no-background.png" width="300"></div>

**Gemma** is a suite of advanced open models developed by **Google DeepMind** and other **Google teams**, derived from the same research and technology behind the **Gemini** models. They can be integrated into applications and run on various platforms including mobile devices and hosted services. Developers can customize Gemma models using tuning techniques to enhance their performance for specific tasks, offering more targeted and efficient generative AI solutions beyond text generation.

Gemma models are available in several sizes so you can build generative AI solutions based on your available computing resources, the capabilities you need, and where you want to run them.

| Parameters size | Tuned versions    | Intended platforms                 | Preset                 |
|-----------------|-------------------|------------------------------------|------------------------|
| 2B              | Pretrained        | Mobile devices and laptops         | `gemma_2b_en`          |
| 2B              | Instruction tuned | Mobile devices and laptops         | `gemma_instruct_2b_en` |
| 7B              | Pretrained        | Desktop computers and small servers| `gemma_7b_en`          |
| 7B              | Instruction tuned | Desktop computers and small servers| `gemma_instruct_7b_en` |

In this notebook, we will use the `Gemma 2B` from KerasNLP's pretrained models to answer questions about the Kaggle platform. To explore other models, simply modify the `preset` in the `CFG` (config). A list of other available pretrained models can be found on the [KerasNLP website](https://keras.io/api/keras_nlp/models/).



## Gemma Causal LM

The code below will build an end-to-end Gemma model for causal language modeling (hence the name `GemmaCausalLM`). A causal language model (LM) predicts the next token based on previous tokens. This task setup can be used to train the model unsupervised on plain text input or to autoregressively generate plain text similar to the data used for training. This task can be used for pre-training or fine-tuning a Gemma model simply by calling `fit()`.

This model has a `generate()` method, which generates text based on a prompt. The generation strategy used is controlled by an additional sampler argument on `compile()`. You can recompile the model with different `keras_nlp.samplers` objects to control the generation. By default, `"greedy"` sampling will be used.

> The `from_preset` method instantiates the model from a preset architecture and weights.

In [11]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
gemma_lm.summary()

Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'tokenizer.json' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
Attaching 'assets/tokenizer/vocabulary.spm' from model 'keras/gemma/keras/gemma_2b_en/2' to your Kaggle notebook...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.


## Gemma LM Preprocessor

An important part of the Gemma model is the **Preprocessor** layer, which under the hood uses **Tokenizer**.

**What it does:** The preprocessor takes input strings and transforms them into a dictionary (`token_ids`, `padding_mask`) containing preprocessed tensors. This process starts with tokenization, where input strings are converted into sequences of token IDs.

**Why it's important:** Initially, raw text data is complex and challenging for modeling due to its high dimensionality. By converting text into a compact set of tokens, such as transforming `"The quick brown fox"` into `["the", "qu", "##ick", "br", "##own", "fox"]`, we simplify the data. Many models rely on special tokens and additional tensors to understand input. These tokens help divide input and identify padding, among other tasks. Making all sequences the same length through padding boosts computational efficiency, making subsequent steps smoother.

Explore the following pages to access the available preprocessing and tokenizer layers in **KerasNLP**:
- [Preprocessing](https://keras.io/api/keras_nlp/preprocessing_layers/)
- [Tokenizers](https://keras.io/api/keras_nlp/tokenizers/)

In [12]:
x, y, sample_weight = gemma_lm.preprocessor(data[0:2])

This preprocessing layer will take in batches of strings, and return outputs in a `(x, y, sample_weight)` format, where the `y` label is the next token id in the `x` sequence.

From the code below, we can see that, after the preprocessor, the data shape is `(num_samples, sequence_length)`.

In [13]:
# Display the shape of each processed output
for k, v in x.items():
    print(k, ":", v.shape)

token_ids : (2, 8192)
padding_mask : (2, 8192)


# Inference before fine tuning

Let's ask the Gemma model some sample questions using our prepared prompt and see how it responds. 

> As this model is not tuned for instruction yet, you will notice that the model is creating more question-answer pairs instead of answering the question that was asked.

## Sample 1

In [14]:
# Take one sample
row = df.iloc[2]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to join a competition?

**<font color='green'>Answer:</font>**
1. Go to the competition page.
2. Click on the "Join" button.
3. Enter your email address and click on the "Join" button.
4. You will receive an email with a link to confirm your email address.
5. Click on the link in the email to confirm your email address.
6. You will now be able to log in to the competition.

**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to submit a solution?

**<font color='green'>Answer:</font>**
1. Go to the competition page.
2. Click on the "Submit" button.
3. Enter your solution in the text box and click on the "Submit" button.
4. You will receive a confirmation email with the status of your submission.

**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to view the leaderboard?

**<font color='green'>Answer:</font>**
1. Go to the competition page.
2. Click on the "Leaderboard" button.
3. You will see the leaderboard with the top 100 participants.

**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to view the

## Sample 2

In [15]:
# Take one sample
row = df.iloc[45]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition-setup

**<font color='red'>Question:</font>**
How do Kaggle competitions work?

**<font color='green'>Answer:</font>**
Kaggle competitions are a way for Kaggle users to compete against each other and win prizes.

To participate in a competition, you must first create an account on Kaggle. Once you have an account, you can start competing by creating a project.

A project is a collection of notebooks that you can use to solve a problem. You can create a project from scratch or use one of the many templates that are available.

Once you have created a project, you can start working on it. You can use the notebooks in your project to solve the problem, or you can create new notebooks to solve the problem.

When you are finished working on your project, you can submit it to the competition. The competition will then review your project and decide if you have solved the problem correctly.

If you are successful in solving the problem, you will be awarded a prize.

**<font color='blue'>Category:</font>**
kaggle-competition-setup

**<font color='red'>Question:</font>**
How do I create a project?

**<font color='green'>Answer:</font>**
To create a project, you must first create an account on Kaggle. Once you have an account, you can start creating projects by clicking

# Fine-tuning with LoRA

To get better responses from the model, we will fine-tune the model with Low Rank Adaptation (LoRA) on the **Kaggle Docs** dataset.

**What exactly is LoRA?**

LoRA is a method used to fine-tune large language models (LLMs) in an efficient way. It involves freezing the weights of the LLM and injecting trainable rank-decomposition matrices.

Imagine in an LLM, we have a pre-trained dense layer, represented by a $d \times d$ weight matrix, denoted as $W_0$. We then initialize two additional dense layers, labeled as $A$ and $B$, with shapes $d \times r$ and $r \times d$, respectively. Here, $r$ denotes the rank, which is typically **much smaller than** $d$. Prior to LoRA, the model's output was computed using the equation $output = W_0 \cdot x + b_0$, where $x$ represents the input and $b_0$ denotes the bias term associated with the original dense layer, which remains frozen. After applying LoRA, the equation becomes $output = (W_0 \cdot x + b_0) + (B \cdot A \cdot x)$, where $A$ and $B$ denote the trainable rank-decomposition matrices that have been introduced.

<center><img src="https://i.ibb.co/DWsbhLg/LoRA.png" width="300"><br/>
Credit: <a href="https://arxiv.org/abs/2106.09685">LoRA: Low-Rank Adaptation of Large Language Models</a> Paper</center>


In the paper, $A$ is initialized with $\mathcal{N} (0, \sigma^2)$ and $B$ with $0$, where $\mathcal{N}$ denotes the normal distribution, and $\sigma^2$ is the variance.

**Why does LoRA save memory?**

Even though we're adding more layers to the model with LoRA, it actually helps save memory. This is because the smaller layers (A and B) have fewer parameters to learn compared to the big model and fewer trainable parameters mean fewer optimizer variables to store. So, even though the overall model might seem bigger, it's actually more efficient in terms of memory usage. 

> This notebook uses a LoRA rank of `4`. A higher rank means more detailed changes are possible, but also means more trainable parameters.

In [16]:
# Enable LoRA for the model and set the LoRA rank to 4.
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

**Notice** that, the number of trainable parameters is reduced from ~$2.5$ billions to ~$1.3$ millions after enabling LoRA.

## Training

In [17]:
# Limit the input sequence length to 512 (to control memory usage).
gemma_lm.preprocessor.sequence_length = CFG.sequence_length 

# Compile the model with loss, optimizer, and metric
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(learning_rate=8e-5),
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

# Train model
gemma_lm.fit(data, epochs=CFG.epochs, batch_size=CFG.batch_size)

Epoch 1/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 734ms/step - loss: 1.7209 - sparse_categorical_accuracy: 0.5241
Epoch 2/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 727ms/step - loss: 1.6869 - sparse_categorical_accuracy: 0.5313
Epoch 3/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 728ms/step - loss: 1.6175 - sparse_categorical_accuracy: 0.5417
Epoch 4/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 727ms/step - loss: 1.5770 - sparse_categorical_accuracy: 0.5509
Epoch 5/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 728ms/step - loss: 1.5537 - sparse_categorical_accuracy: 0.5552
Epoch 6/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 727ms/step - loss: 1.5304 - sparse_categorical_accuracy: 0.5568
Epoch 7/15
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 1s/step - loss: 1.5028 - sparse_categorical_accuracy: 0.5630
Epoch 8/15
[1m60/60[0

<keras.src.callbacks.history.History at 0x7af3b02ffd00>

# Inference after fine-tuning

Let's see how our fine-tuned model responds to the same questions we asked before fine-tuning the model.

## Sample 1

In [18]:
# Take one sample
row = df.iloc[2]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition

**<font color='red'>Question:</font>**
How to join a competition?

**<font color='green'>Answer:</font>**
You need to sign in to your [Kaggle account](https://www.kaggle.com/signup) to join a competition.

You’ll then be redirected to the competition page, where you can either:

- Click “Join” to become a participant in that competition. This will allow you to download the data and start working on your model.

- Click "Download and upload the data and then click "Join Leaderboard" to view the rules of participating in the competition.

- Click “Download Data” to download a copy of the competition’s data. You’ll need to do this before you can start working on your model.

If you have any questions about this process, feel free to reach out to our support team [here](https://www.kaggle.com/support/contact/Contact).

## Sample 2

In [19]:
# Take one sample
row = df.iloc[45]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-competition-setup

**<font color='red'>Question:</font>**
How do Kaggle competitions work?

**<font color='green'>Answer:</font>**
When you click “Open dataset” on any public competition, you’ll be taken to the “Overview” tab. There you’ll find details about the problem and any data that is needed to solve it.

The "Overview" is broken into different sections which we’ll cover in more detail. The different parts of this section are:

- **Problem**: The problem statement.
- **Competition rules**: The competition rules that govern your participation.
- **Dataset**: A brief description of the dataset, including the number of files and the number of samples.
- **Data sources and licenses**: Details on where the data came from and what the licensing terms were.

The "Files" section of the overview is also very important. It contains the details about how the data is formatted and what the data looks like.

- **Training dataset**: The number of samples and the number of files in this folder will match the information that you see on the "Overview" tab.
- **Evaluation dataset**: The number of samples and the number of files in this folder will also match the information that you see on the "

## Sample 3

In [20]:
# Take one sample
row = df.iloc[20]

# Generate Prompt using template
prompt = template.format(
    Category=row.Category,
    Question=row.Question,
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-model

**<font color='red'>Question:</font>**
How to find Kaggle Models?

**<font color='green'>Answer:</font>**
There are a number of ways for you to discover, explore, and get access to the wide range of models available on Kaggle.

First and foremost, we highly encourage you to check out the [Featured Models page](https://www.kaggle.com/featured). It showcases the latest models in each model category from the past seven days. This is a great place to get started if you want to see the hot new models in the field.

Next, we have the [Model Spotlight](https://www.kaggle.com/modelspotlight) and the [Model Spotlight Leaderboard](https://www.kaggle.com/modelspotlightleaderboard). These are collections of the most downloaded and most upvoted models on Kaggle. The model spotlight leaderboards are refreshed every 24 hours and feature the top 25 models in each model category. The model spotlight is a great place to get a quick overview of the top models in each category. It’s also a good way to see what models are popular and trending in a particular category.

Finally, you can search for models using Kaggle’s [Datasets page](

## Unseen Sample(s)

Also just for fun, let's try out a question that model hasn't seen during training.

In [21]:
# Generate Prompt using template
prompt = template.format(
    Category="kaggle-notebook",
    Question="How to export a notebook?",
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-kaggle-notebook

**<font color='red'>Question:</font>**
How to export a notebook?

**<font color='green'>Answer:</font>**
You can always save a notebook locally from the “Save” button in the top right corner.

However, the local copy of the notebook is not shared with other Kaggle users. If you need a permanent record of your work, or you need your notebook to work on other platforms such as GitHub, you should use one of the following options:

1) Save the notebook to your Google Drive. You can find it there at a convenient location.

2) Share the notebook URL on Kaggle, which will open the notebook in a new tab and allow others to view and comment on it. This is the default way notebooks are displayed on Kaggle.

3) Export the notebook as a zip file to download and host locally.

In [22]:
# Generate Prompt using template
prompt = template.format(
    Category="kaggle-notebook",
    Question="How to create a notebook using Kaggle API?",
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-kaggle-notebook

**<font color='red'>Question:</font>**
How to create a notebook using Kaggle API?

**<font color='green'>Answer:</font>**
Creating a Notebook using Kaggle API:

You can create a notebook using Kaggle API. This allows you to create notebooks quickly from the command line.

To use this endpoint, you need to be a project owner or a notebook author of the Notebook project you want to upload the notebook to.

You can find the URL for a Notebook project by clicking the “Share” button in the top right corner of any Notebook page.

You can find the URL for a specific Notebook by clicking the Notebook name from the list on the Project details page.

You can find the list of all Notebook projects on the Kaggle website: https://kaggle.com/projects/all.

You can use the following API endpoint to create a Notebook:

```
https://kaggle-api.s3-us-west-1.amazonaws.com/projects?access_token=<your token>
```

Replace `<your token>` with the access token for the project you want to upload the notebook to.

You can find your access token from your user profile. You should also store your access token in

In [23]:
# Generate Prompt using template
prompt = template.format(
    Category="kaggle-competitions",
    Question="How you can create a in-class competition?",
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-kaggle-competitions

**<font color='red'>Question:</font>**
How you can create a in-class competition?

**<font color='green'>Answer:</font>**
There are two ways to run a competition on Kaggle.

First, you can run an open competition where anyone can participate and submit solutions. This is great if you want to gather a large amount of data to train a model on, but not ideal for a private challenge where you want to control who can participate or view solutions.

Second, you can run a private competition for internal competition organizers only (competitors will need to be invited). This is ideal if you want to create a closed environment for your internal competition.

## Create a private competition

Follow these steps:

- [Create a new notebook](https://www.kaggle.com/docs/notebooks#creating-a-new-solution) in your competition folder
- [Add your solution to the leaderboard](https://www.kaggle.com/competition/leaderboards#create-a-new-leaderboard-in-a-private-competition-folder)


In [24]:
# Generate Prompt using template
prompt = template.format(
    Category="kaggle-competitions",
    Question="How you can create a in-class competition?",
    Answer=""
)

# Infer
output = gemma_lm.generate(prompt, max_length=256)

# Colorize
output = colorize_text(output)

# Display in markdown
display(Markdown(output))



**<font color='blue'>Category:</font>**
kaggle-kaggle-competitions

**<font color='red'>Question:</font>**
How you can create a in-class competition?

**<font color='green'>Answer:</font>**
You can create a competition for attendees to a live workshop. This is a great way to drive engagement and participation in your event.

## Creating a Workshop-Type Competition

## Step 1: Set Up the Competition Settings

Navigate to the "Settings" tab for the workshop you want to create a competition for.

In the "Format" menu, select "Workshop" as the "Competition Format".

In the "Dataset" menu, select "Custom".

In the "Data" section, upload your dataset (if you haven't done so already).

In the "Scoring" section, select the metric you want to use to determine the winner(s) of the competition.

You may choose between the following options:

- "Submission Accuracy": If you’re hosting an accuracy-type competition and want your participants to submit machine learning models, this is the metric for them! Submission accuracy is the proportion of predictions by the model that are correct.
- "Submission Macro F1": If you’re hosting an F1-type competition and want your participants to submit predictions, this

# Conclusion

The result is not bad, especially compared to the model without fine-tuning. Though it's not exactly what we're looking for, it's important to remember that we only fine-tuned this model using $60$ samples without any augmentation or advanced prompting. Therefore, there is ample room for improvement. Here are some tips to improve performance:

- Try using the larger version of **Gemma** (7B).
- Increase `sequence_length`.
- Experiment with advanced prompt engineering techniques.
- Implement augmentation to increase the number of samples.
- Utilize a learning rate scheduler.

# Reference
* [Fine-tune Gemma models in Keras using LoRA](https://www.kaggle.com/code/nilaychauhan/fine-tune-gemma-models-in-keras-using-lora)
* [Parameter-efficient fine-tuning of GPT-2 with LoRA](https://keras.io/examples/nlp/parameter_efficient_finetuning_of_gpt2_with_lora/)
* [Gemma - KerasNLP](https://keras.io/api/keras_nlp/models/gemma/)