# Finetuning Google's Gemma with Custom Data

This notebook follows a [Tutorial on YouTube](https://www.youtube.com/watch?v=iOdFUJiB0Zc).

This code walks through setting up Google's Gemma model, along with custom data, for fine-tuning, as well as some queries and results.

### Installing Packages

Run the following command to install the necessary packages: `pip install requirements.txt`

## I. Preparing the Data

For this example, I am using a dataset of ArXiv papers (titles, authors, abstracts, etc.). It's a simple dataset that is similar to my interests (AI analysis of research documents).

In [1]:
# Import packages
import os
import transformers
import torch
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, GemmaTokenizer

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("nick007x/arxiv-papers", token=True)

  from .autonotebook import tqdm as notebook_tqdm
W1120 21:07:18.872000 3620 Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Downloading data: 100%|██████████| 1.70G/1.70G [00:44<00:00, 38.4MB/s]
Generating train split: 2549619 examples [00:38, 65768.55 examples/s] 


### Dataset Satatistics

In [5]:
print(type(ds))
print(ds.column_names)
print(ds.num_rows)

<class 'datasets.dataset_dict.DatasetDict'>
{'train': ['arxiv_id', 'title', 'authors', 'submission_date', 'comments', 'primary_subject', 'subjects', 'doi', 'abstract', 'file_path']}
{'train': 2549619}


# II. Setting Up The Model

For this example, I will be using Gemma-2B. This follows the aforementioned YouTube tutorial. Using the smaller model yields lower training and inference time.

In [12]:
model_id = "google/gemma-2b"
# Disable BNB config if running on CPU - only supported on CUDA GPUs
#bnb_config = BitsAndBytesConfig(
#    load_in_4bit=True,
#    bnb_4bit_quant_type="nf4",
#    bnb_4bit_compute_dtype=torch.bfloat16,
#)

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_id, token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # quantization_config=bnb_config,   # Disable if running on CPU
    # device_map={"":0},                # Disable if running on CPU
    token=True,
)

Loading checkpoint shards: 100%|██████████| 2/2 [02:23<00:00, 71.59s/it] 


### Sample Query Before Any Finetuning

Giving the model a paper title does result in coherent text. To the untrained eye, it might even look like a proper abstract. It even generates some inline math equations. However, none of the information provided in the dataset is included in the answer - the answer is a complete hallucenation. 

In [15]:
# Input query - using the title of a paper in the dataset as an example
text = "The gravitational wave background from star-massive black hole fly-bys"
device = "cpu"
# device = "cuda:0" 
inputs = tokenizer(text, return_tensors="pt").to(device)

# Generate output 
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The gravitational wave background from star-massive black hole fly-bys is a promising probe of the early Universe. We study the effect of the gravitational wave background on the cosmic microwave background (CMB) and the large-scale structure (LSS) of the Universe. We find that the gravitational wave background can be detected by the CMB and the LSS if the mass of the black hole is larger than <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mo>∼</mo><mn>10</mn><mtext> </mtext><mtext> </mtext><msub><mrow><mi>M</mi></mrow><mrow><mo stretchy="false">⊙</mo></mrow></msub></mrow></math> and the black hole is located at a redshift of <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>z</mi><mo>∼</mo><mn>10</mn></math>. The gravitational wave background can be detected by the CMB and the LSS if the black hole is located at a redshift of <math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>z</mi><mo>∼</mo><mn>10</mn></math> and the ma

# III. Finetuning The Model to the Dataset

We first start with a Low-Rank Adaptation (LORA) config. This freezes the original model weights and adds on a small number of new parameters that will be fine-tuned. This makes fine-tuning significantly faster and more memory-efficient. It is pretty thoroughly proven in literature that fine-tuning the entire model is not necessary to achieve desired accuracy metrics. 

In [16]:
lora_config = LoraConfig(
    r=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj, gate_proj"],
    task_type="CAUSAL_LM",
)

In [17]:
data = ds.map(
    lambda examples: tokenizer(examples["abstract"]),
    batched=True,
)

Map: 100%|██████████| 2549619/2549619 [17:14<00:00, 2463.57 examples/s]


In [19]:
print(data["train"].column_names)
print(data["train"]["abstract"][0])

['arxiv_id', 'title', 'authors', 'submission_date', 'comments', 'primary_subject', 'subjects', 'doi', 'abstract', 'file_path', 'input_ids', 'attention_mask']
Stars on eccentric orbits around a massive black hole (MBH) emit bursts of gravitational waves (GWs) at periapse. Such events may be directly resolvable in the Galactic centre. However, if the star does not spiral in, the emitted GWs are not resolvable for extra-galactic MBHs, but constitute a source of background noise. We estimate the power spectrum of this extreme mass ratio burst background (EMBB) and compare it to the anticipated instrumental noise of the Laser Interferometer Space Antenna (LISA). To this end, we model the regions close to a MBH, accounting for mass-segregation, and for processes that limit the presence of stars close to the MBH, such as GW inspiral and hydrodynamical collisions between stars. We find that the EMBB is dominated by GW bursts from stellar mass black holes, and the magnitude of the noise spectru

In [20]:
def formatting_func(example):
    """
    Format a mapped dataset example into a format we want to use for fine-tuning.
    """
    text = f"Title {example['title'][0]}\nAbstract: {example['abstract'][0]}\n\n"
    return [text]

In [23]:
# Setup the training parameters, dataset, model, PEFT, etc.
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=False,
        # fp16=True,                    # Enable if running on CUDA GPUs
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 7570d678-d6d9-4038-bbd2-5346c1e7f68e)')' thrown while requesting HEAD https://huggingface.co/google/gemma-2b/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].
Map: 100%|██████████| 2549619/2549619 [00:54<00:00, 46985.31 examples/s] 


In [None]:
trainer.train()

# IV. Testing the Fine-Tuned Model

Now that we have fine-tuned the model, let's pass in the same input as before, and see if we get a more accurate output.

In [None]:
# Input query - using the title of a paper in the dataset as an example
text = "The gravitational wave background from star-massive black hole fly-bys"
device = "cpu"
# device = "cuda:0" 
inputs = tokenizer(text, return_tensors="pt").to(device)

# Generate output 
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))