# RAG vs Prompt Engineering - Baseline LLM (Gemma 2)

This notebook provides a minimal, Colab-friendly baseline to query a small open LLM. We'll first verify that the base model struggles on specific, up-to-date questions (e.g., newest Python versions), then later augment with RAG.

- Model: `google/gemma-2-2b-it` (changeable)
- Exposes: `ask(question, ...)` helper and `PYTHON_ASSISTANT_SYSTEM_PROMPT`
- Works on CPU or GPU (4-bit if GPU available)
- Note: You may need to accept Gemma license on Hugging Face and login if access is restricted



In [2]:
# Colab/Env setup
%pip -q install -U transformers accelerate sentencepiece bitsandbytes
# Optional: Hugging Face login to access gated models
# from google.colab import userdata  # Colab only
# hf_token = userdata.get('HF_TOKEN')
# if hf_token:
#     %pip -q install -U huggingface_hub
#     from huggingface_hub import login
#     login(token=hf_token)


In [None]:
from huggingface_hub import login
login(token="access token here")


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Dict, Optional

# Try to import BitsAndBytes if GPU is available; otherwise skip (CPU won't use it)
try:
    from transformers import BitsAndBytesConfig
    _bnb_available = True
except Exception:
    _bnb_available = False

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

# Choose Gemma 2 Instruct (changeable if needed)
MODEL_ID = "google/gemma-2-2b-it"

# Generation defaults
GEN_CFG = {
    "max_new_tokens": 800,
    "temperature": 0.3,
    "top_p": 0.9,
    "repetition_penalty": 1.1,
}

# Optional: set a global random seed for reproducibility
torch.manual_seed(42)
if device == "cuda":
    torch.cuda.manual_seed_all(42)



Device: cuda


In [5]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

if device == "cuda" and _bnb_available:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float16,
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        device_map="auto",
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
    )
else:
    # CPU or GPU without bitsandbytes -> full precision/fp16 autocast
    dtype = torch.float32 if device == "cpu" else torch.float16
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=dtype)
    model.to(device)

model.eval()
print("Model loaded:", MODEL_ID)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

Model loaded: google/gemma-2-2b-it


In [6]:
PYTHON_ASSISTANT_SYSTEM_PROMPT = "You are a Python programming assistant."


In [10]:
def _format_chat(messages: List[Dict[str, str]], add_generation_prompt: bool = True) -> Dict[str, torch.Tensor]:
    """
    Uses tokenizer.apply_chat_template if available; otherwise falls back to a simple prompt.
    For templates that don't support a 'system' role (e.g., some Gemma templates),
    merge the system prompt into the first user message to avoid TemplateError.
    """
    if hasattr(tokenizer, "apply_chat_template") and tokenizer.chat_template is not None:
        effective_messages = messages
        if messages and messages[0].get("role") == "system":
            system_text = messages[0]["content"]
            effective_messages = messages[1:]
            if effective_messages and effective_messages[0].get("role") == "user":
                effective_messages = effective_messages.copy()
                effective_messages[0] = {
                    "role": "user",
                    "content": f"{system_text}\n\n{effective_messages[0]['content']}"
                }
            else:
                effective_messages = [{"role": "user", "content": system_text}]
        prompt_text = tokenizer.apply_chat_template(
            effective_messages,
            tokenize=False,
            add_generation_prompt=add_generation_prompt
        )
    else:
        # Fallback simple instruction style
        sys_msg = ""
        if messages and messages[0].get("role") == "system":
            sys_msg = f"System: {messages[0]['content']}\n"
            user_msgs = messages[1:]
        else:
            user_msgs = messages
        convo = "\n".join([f"{m['role'].capitalize()}: {m['content']}" for m in user_msgs])
        prompt_text = (sys_msg + convo + ("\nAssistant:" if add_generation_prompt else ""))

    inputs = tokenizer(prompt_text, return_tensors="pt")
    return {k: v.to(device) for k, v in inputs.items()}

@torch.inference_mode()
def generate_from_messages(
    messages: List[Dict[str, str]],
    max_new_tokens: int = GEN_CFG["max_new_tokens"],
    temperature: float = GEN_CFG["temperature"],
    top_p: float = GEN_CFG["top_p"],
    repetition_penalty: float = GEN_CFG["repetition_penalty"],
) -> str:
    inputs = _format_chat(messages, add_generation_prompt=True)
    input_len = inputs["input_ids"].shape[-1]

    outputs = model.generate(
        **inputs,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
    )
    gen_ids = outputs[0][input_len:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True)
    return text.strip()

def ask(
    question: str,
    system_prompt: Optional[str] = "You are a helpful assistant.",
    **gen_kwargs
) -> str:
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": question})
    return generate_from_messages(messages, **gen_kwargs)


In [15]:
print("General Q:", ask("What was added in Python 3.12.2 released in March 2024?", system_prompt=PYTHON_ASSISTANT_SYSTEM_PROMPT))



General Q: I can't give you specific details about the features and changes added in Python 3.12.2 because I don't have access to real-time information or release notes.  

**Here's why:**

* **My knowledge cutoff:** As an AI, my training data has a fixed cut-off date. This means I can't provide information that came out after my last update. 
* **Release updates are often complex:** Release notes for major Python versions like 3.12.2 are detailed documents with many improvements, new features, bug fixes, and performance enhancements. 

**Where to find this information:**

To find the most up-to-date information about Python 3.12.2, I recommend checking these sources:

* **The official Python website:** https://www.python.org/
* **Python's release page:** https://peps.python.org/pep/releases/
* **The Python Enhancement Proposals (PEPs):** PEPs are formal proposals for Python changes, which you can search through here: https://www.python.org/dev/peps/


Let me know if you have other Pyt