# GEMMA  ---  AD ACADEMY - AI for Aam Janta

Mentor - Dr Ayan Debnath, IIT Delhi + Harvard university Alumni

LinkedIn: [dr_ayan_debnath](https://www.linkedin.com/in/ayan-debnath/)

YouTube:[AD ACADEMY AI](https://www.youtube.com/@ad_academy)

Topic: Gemma - Google's open source LLM

class on 24th February 2024

# Welcome Gemma - Google’s new open LLM

Gemma, a new family of state-of-the-art open LLMs, was released today by Google! It's great to see Google reinforcing its commitment to open-source AI, and we’re excited to fully support the launch with comprehensive integration in Hugging Face.

Gemma comes in two sizes: 7B parameters, for efficient deployment and development on consumer-size GPU and TPU and 2B versions for CPU and on-device applications. Both come in base and instruction-tuned variants.

Gemma is a family of 4 new LLM models by Google based on Gemini. It comes in two sizes: 2B and 7B parameters, each with base (pretrained) and instruction-tuned versions. All the variants can be run on various types of consumer hardware, even without quantization, and have a context length of 8K tokens:

gemma-7b: Base 7B model.
gemma-7b-it: Instruction fine-tuned version of the base 7B model.
gemma-2b: Base 2B model.
gemma-2b-it: Instruction fine-tuned version of the base 2B model.

In [None]:
# !pip install transformers
!pip3 install -q -U "transformers==4.38.1" --upgrade

In [None]:
from google.colab import userdata
userdata.get('HF_TOKEN')

In [None]:
import os
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

# Running the model on a CPU

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")

# tokenizer: This is an instance of the Hugging Face tokenizer

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")
# return_tensors="pt": This parameter specifies that the tokenizer should
# return the result as a PyTorch tensor. The "pt" stands for PyTorch.

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

 # Running the model on multi GPU

In [None]:
!pip install accelerate
# !pip install accelerate
from accelerate import disk_offload

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b")
# model = AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto")
model = disk_offload(AutoModelForCausalLM.from_pretrained("google/gemma-7b", device_map="auto"))

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))


In [None]:
!pip install accelerate
from accelerate import disk_offload

In [None]:
!apt-get install nvidia-driver-470

In [None]:
!nvidia-smi

# Quantized Google Gemma's Model

In [1]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.9/150.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

In [2]:
import os
import transformers
import torch
from google.colab import userdata
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig, GemmaTokenizer

In [3]:
from google.colab import userdata
userdata.get('HF_TOKEN')

import os
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')

In [4]:
model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 32 bits of parameters (weights and bias) are convereted to 4 bits
# bnb_4bit_quant_type="nf4" --> 4 bit normal float is a technique used here for quantization
# bnb_4bit_compute_dtype=torch.bfloat16 --> Due to quantization, there is loss of information,
#    to balance that float16 is used for weights related fine tuning

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0},
                                             token=os.environ['HF_TOKEN'])

# device_map={"":0} is for GPU

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [18]:
text = "Tell about Sachin Tendulkar"
device = "cuda:0"    # for GPU
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Tell about Sachin Tendulkar, the most famous Indian cricketer.

Underline the correct form of the modifier in parentheses. Example 1. Eric took the brunt of other team members’ teasing at the cruise party. (bad, badly)

1. The king’s crown


# Fine Tuning: LORA

In [9]:
os.environ["WANDB_DISABLED"] = "false"
lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",
)

# r = 8 rank of LORA, hyper-parameter, u can choose 4,8,16,34,64

### datasets

In [10]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [None]:
data['train']['quote']

In [12]:
# for supervised Fine Tuning
def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

In [13]:
data['train']

Dataset({
    features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
    num_rows: 2508
})

In [14]:
trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)



Map:   0%|          | 0/2508 [00:00<?, ? examples/s]



In [24]:
trainer.train()

Step,Training Loss
1,0.0388
2,0.0132
3,0.0281
4,0.0217
5,0.0249
6,0.044
7,0.0404
8,0.0129
9,0.0308
10,0.0215


TrainOutput(global_step=100, training_loss=0.027157968422397972, metrics={'train_runtime': 70.6499, 'train_samples_per_second': 5.662, 'train_steps_per_second': 1.415, 'total_flos': 55030401331200.0, 'train_loss': 0.027157968422397972, 'epoch': 66.67})

In [21]:
text = "Quote: Two things are infinite: the universe and human stupidity"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quote: Two things are infinite: the universe and human stupidity; and I’m not sure about the universe
Author: Albert Einstein
Quote: The most


In [None]:
text = "Quote: Outside of a dog, a book is man's"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))