<a href="https://colab.research.google.com/github/Vaibhavs10/gpu-poor-llm-notebooks/blob/main/gemma_2_9b_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemma 2 9B 💎

Gemma 2 is Google's latest iteration of open LLMs. It comes in two sizes, 9 billion and 27 billion parameters with base (pre-trained) and instruction-tuned versions. Gemma is based on Google Deepmind Gemini and has a context length of 8K tokens:

- [gemma-2-9b](https://huggingface.co/google/gemma-7b): Base 9B model.
- [gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it): Instruction fine-tuned version of the base 9B model.
- [gemma-2-27b](https://huggingface.co/google/gemma-2-27b): Base 27B model.
- [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it): Instruction fine-tuned version of the base 27B model.

The Gemma 2 models were trained on ~2x more data than their first iteration, totaling 13 trillion tokens for the 27B version and 8 trillion tokens for the 9B version of web data (primarily English), code, and math. We don’t know the exact details of the training mix, and we can only guess that bigger and more careful data curation was a big factor in the improved performance.

Gemma 2 comes with the [same license](https://ai.google.dev/gemma/terms) as the first iteration, which is a permissive license that allows redistribution, fine-tuning, commercial use, and derivative works.

## Setup Inference Environment



In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


## Initialise the Text Generation pipeline

P.S. Make sure to accept the terms and conditions from the model page [here](https://huggingface.co/google/gemma-2-9b-it)

In [None]:
from transformers import pipeline
import torch

model_id = "google/gemma-2-9b-it"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

config.json:   0%|          | 0.00/857 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

## Define the prompt similar to messages API

You can pretty much define any query, question, discussion that you want the LLM to process below.

In [None]:
messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]

## Pass the outputs to the pipeline

In [None]:
outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)

## Generate response

In [None]:
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)

Ahoy, matey! I be a humble ship o' words, sailin' the digital seas. They call me Gemma, a creation o' the fine folks at Google DeepMind. I be trained on a treasure trove o' texts, learnin' to speak and write like a true scallywag. 

Ask me yer questions, and I'll do me best to answer 'em, aye!  🦜📚



Voila! You have your own personal (& powerful)
assistant now!