<a href="https://colab.research.google.com/github/ZuckermanLab/CodingClass2025/blob/main/Local_LLM_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Local Large Language Models

The first
 thing we need to do is get an API access key from huggingface.com. After creating an acount on Hugging Face, go to https://huggingface.co/google/gemma-2b to accept the T&C for the Gemma model. Next, go to Settings > Access Tokens and create a new "Read" token. Keep this page up as we will need to re-enter this token later.

In [None]:
from huggingface_hub import login
login()

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it")
model = model.to('cuda')

In [None]:
tokenizer.encode("hello world 🤗")

In [None]:
tokenizer.decode(223603)

In [5]:
def tokenize(text):
  return tokenizer.encode(text, return_tensors='pt').to('cuda')

def detokenize(tokens):
  return tokenizer.decode(tokens)

Now we can test out the `generate` method of our model. For this, we will pass in our tokenized input prompt (`prompt`) and also specify the maximum amount of new tokens we want the model to generate. We also want to set `do_sample` to `False` so that we get the same output each time.

In [6]:
prompt = "What is the capital of Oregon? Answer with one word only."

In [None]:
output_tokens = model.generate(tokenize(prompt), max_new_tokens=25, do_sample=False)

In [None]:
print(detokenize(output_tokens[0]))

# Examining the next-token distribution

In [9]:
tokens = tokenize("What is the capital of Oregon? Answer with one word only.<end_of_turn><start_of_turn>\n\n")

Each time we pass our input through the model (the "forward pass"), the output we get is called the "logits", which is a set of scores for each token ID in the model's vocabulary, which for Gemma is 262,144 different tokens.

In [11]:
output = model.forward(tokens) #model.forward() is equivalent to model.predict() from yesterday

In [None]:
output['logits'].shape

In [54]:
from scipy.special import softmax
import matplotlib.pyplot as plt
import numpy as np

In [14]:
word_probs = softmax(
    output['logits'][0][-1]
    .cpu()
    .detach()
    .numpy()
    )

In [None]:
top_word_probs = np.array(
 sorted([[i,l] for i,l in enumerate(word_probs)], key=lambda x: x[1], reverse=True)
)[:50]
fig, ax = plt.subplots(figsize=(20,2))
plt.bar(range(len(top_word_probs)), top_word_probs[:,1])
plt.xticks(
    range(len(top_word_probs)),
    [tokenizer.decode(int(t)) for t in top_word_probs[:,0]],
    rotation=75,
    fontsize=8
)

plt.xlabel('next token')
plt.ylabel('Probability')
plt.show()

In [None]:
[(tokenizer.decode(int(t)), int(t)) for t in top_word_probs[:,0]]

In [None]:
tokenizer.decode(output['logits'][0][-1].argmax())

In [None]:
for i,token in enumerate(tokens[0]):
  print([tokenizer.decode(t) for t in tokens[0][:i+1]], tokenizer.decode(output['logits'][0][i].argmax()))

## Implementing the Generate() method ourselves

In reality when we call the `generate()` method of our model, we are really running the model multiple times in a for loop. At each step we are taking the next word predicted by the model, adding it to our input, and running the model again.

So to replicate the functionality of the `generate()` method, we just need to call the `forward()` method in a loop, take the argmax of the logits for the last token in the input, add that new token ID to the end of our input, and repeat until we reach the maximum number of new tokens.

In [20]:
import torch
def get_next_token(logits):
  return torch.tensor([[logits[0][-1].argmax()]]).to('cuda')

def concatenate_tokens(tokens, next_token):
  return torch.cat([tokens, next_token], dim=1)

In [45]:
def model_generate(model, prompt, max_new_tokens=250):
  #tokenize prompt
  input_tokens =

  for i in range():
    #pass input tokens through the model to get output tokens
    output =

    #get the most likely next token
    next_token =

    #concatenate next predicted token to output
    input_tokens =

    #check to see if we reached the end of the sentence
    if detokenize(next_token[0]) == '<end_of_turn>':
      break

  return detokenize(input_tokens[0])

In [None]:
prompt = "What is the capital of Oregon?"
answer = model_generate(model, prompt)
print(answer)