# Example Usage of some Foundation Models

### WARNING (~10GB sized model being downloaded on computer)
This notebook needs you to have the hugging face CLI set up:
```sh
pip install -U "huggingface_hub[cli]"
```
On Macintosh you can also do:
```sh
brew install huggingface-cli
```

Next you will need to go to your hugging face account and create a token so as to load the model via the python package we use in this notebook. You will find this in your user settings (https://huggingface.co/settings/tokens). After creating the "write" token run the CLI tool:
```sh
huggingface-cli login
```
Here paste in your token where relevant and say "yes"/Y to the git usage agreement.

You should be ready to run this notebook!

In [9]:
import sys
import torch
from threading import Thread
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
)

# Pick your device
if torch.backends.mps.is_available():
    print("MPS device found.")
    device = torch.device("mps") # This is for Macintosh devices
else:
    print("MPS device not found; using CPU.")
    device = torch.device("cpu")
model_name = "google/gemma-2-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16
)
model.to(device)

MPS device found.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2304, bias=False)
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (post_attention_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedforward_layernorm): Gemm

In [2]:
# Run the model with a prompt and stream the output on a separate thread
prompt = "Write a Python script that generates the Fibonacci sequence."
encoded = tokenizer(prompt, return_tensors="pt").to(device)
streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
generation_kwargs = dict(
    **encoded,                   # expand -> input_ids=<tensor>, attention_mask=<tensor>
    max_new_tokens=128,         # limit to avoid massive memory usage
    do_sample=True,
    top_k=10,
    temperature=0.1,
    top_p=0.95,
    streamer=streamer
)
def run_generation():
    model.generate(**generation_kwargs)
thread = Thread(target=run_generation)
thread.start()
print("Generated text:\n", end="", flush=True)
for new_text in streamer:
    sys.stdout.write(new_text)
    sys.stdout.flush()
thread.join()
print("\n--- Done ---")

Generated text:
Write a Python script that generates the Fibonacci 

The 'batch_size' attribute of HybridCache is deprecated and will be removed in v4.49. Use the more precisely named 'self.max_batch_size' attribute instead.


sequence. The Fibonacci sequence is defined by the following recursive formula:

$f(n)=\left\{\begin{array}{ll} 0 & \text { if } n=0 \\ 1 & \text { if } n=1 \\ f(n-1)+f(n-2) & \text { if } n>1 \end{array}\right. $

The script should prompt the user to enter a number and then print the Fibonacci sequence up to that number.

A 100-W lightbulb is placed in a cylinder equipped with a moveable piston. The lightbulb is turned on for
--- Done ---


# Making a class for faster testing of new ideas and quality of life

In [3]:
class Model:
    """Convenience wrapper for loading a causal language model and generating text."""

    def __init__(
        self,
        model_name="google/gemma-2-2b",
        device=None,
        torch_dtype=torch.float16,
        do_sample=True,
        max_new_tokens=128,
        temperature=0.7,
        top_k=50,
        top_p=0.95
    ):
        """
        :param model_name: HF Hub name or local path to a model
        :param device: 'cpu', 'cuda', or 'mps'; auto-detect if None
        :param torch_dtype: torch.dtype for model weights (e.g., torch.float16)
        :param do_sample: Enable sampling (vs. greedy) by default
        :param max_new_tokens: Default max tokens to generate
        :param temperature: Default temperature for sampling
        :param top_k: Default top-k sampling
        :param top_p: Default top-p (nucleus) sampling
        """
        self.model_name = model_name
        self.device = (torch.device(device) if device else
                       torch.device("mps") if torch.backends.mps.is_available()
                       else torch.device("cpu"))
        self.torch_dtype = torch_dtype

        # Default generation parameters
        self.do_sample = do_sample
        self.max_new_tokens = max_new_tokens
        self.temperature = temperature
        self.top_k = top_k
        self.top_p = top_p

        self.tokenizer = None
        self.model = None

    def load_model(self):
        """Loads the tokenizer and model weights into memory."""
        print(f"Loading '{self.model_name}' on {self.device} (dtype={self.torch_dtype})...")
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            torch_dtype=self.torch_dtype
        ).to(self.device)
        print("Model loaded successfully!")

    def generate_text(self, prompt, stream=False, **kwargs):
        """
        Generates text from a prompt. By default returns the entire output;
        if stream=True, returns a generator that yields tokens as they're produced.

        :param prompt: Prompt text to condition on
        :param stream: If True, yield tokens in real-time
        :param kwargs: Override default generation parameters
        :return: A string if stream=False, or a generator if stream=True
        """
        if not self.model or not self.tokenizer:
            raise RuntimeError("Call load_model() before generate_text().")

        # Merge default params with user overrides
        params = {
            "do_sample": self.do_sample,
            "max_new_tokens": self.max_new_tokens,
            "temperature": self.temperature,
            "top_k": self.top_k,
            "top_p": self.top_p
        }
        if self.tokenizer.eos_token_id is not None:
            params["eos_token_id"] = self.tokenizer.eos_token_id
        params.update(kwargs)

        # Tokenize input
        encoded = self.tokenizer(prompt, return_tensors="pt").to(self.device)

        # Non-streaming generation
        if not stream:
            output_ids = self.model.generate(**encoded, **params)
            return self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

        # Streaming generation
        streamer = TextIteratorStreamer(self.tokenizer, skip_special_tokens=True)
        params["streamer"] = streamer

        def run_generation():
            self.model.generate(**encoded, **params)

        thread = Thread(target=run_generation)
        thread.start()

        def token_generator():
            for new_text in streamer:
                yield new_text
            thread.join()

        return token_generator()

In [4]:
model = Model(
    model_name="google/gemma-2-2b",  # or another
    torch_dtype=torch.bfloat16,      # use BF16 if available
    temperature=0.8,
    max_new_tokens=64
)
model.load_model()

Loading 'google/gemma-2-2b' on mps (dtype=torch.bfloat16)...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model loaded successfully!


In [5]:
output = model.generate_text(
    "Write a short story about a robot learning to dance.",
    max_new_tokens=200,
    temperature=0.5
)
print(output)

Write a short story about a robot learning to dance. Make sure you describe the robot’s appearance, its actions, and how it feels.

What is the difference between a <em>monomer</em> and a <em>polymer</em>? Give an example of each.

Consider the following equilibrium process at $700^{\circ} \mathrm{C}$.

$2 \mathrm{H}_2(g)+\mathrm{S}_2(g) \rightleftharpoons 2 \mathrm{H}_2 \mathrm{~S}(g) $

Analysis shows that there are $2.50$ moles of $\mathrm{H}_2, 1.35 \times$ $10^{-5}$ mole of $\mathrm{S}_2$, and $8.70$ moles of $\mathrm{H}_2 \mathrm{S}$ present in a $12.0-\mathrm{L}$ flask. Calculate the equilibrium constant $K_{\mathrm{c}}$ for the reaction.

A 


In [6]:
# Basic generation (non-streaming)
text = model.generate_text("Explain quantum computing in simple terms.")
print("Non-streamed output:\n", text)

Non-streamed output:
 Explain quantum computing in simple terms.

Why is the chemical energy in a battery a form of potential energy rather than kinetic energy?

A circular coil $18.0 \mathrm{~cm}$ in diameter and containing twelve loops lies flat on the ground. The Earth's magnetic field at this location has magnitude $5.50 \times 


In [7]:
# Streaming generation
gen = model.generate_text("Write a short poem about the moon.", stream=True)
for chunk in gen:
    print(chunk, end="")
print("\n--- Done streaming ---")

Write a short poem about the moon.

In this case, the poem is about the moon and its effect on the world, as well as the poem's effect on the world.

The poem is about the moon and its effect on the world.

The poem is about the moon and its effect on the world.

The poem is about the
--- Done streaming ---
