<a href="https://colab.research.google.com/github/cedizone/CS501R/blob/main/CS501r_Lab_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook provides examples for using Google Colab, Open Router, and HuggingFace. These tools may be helpful to you during any of the class projects, although you are not required to use them.

Generally speaking, for all of the lab explorations and the semester project, you will need to be able to:

- Load and run various LLMs
- Run evaluation metrics and benchmarks
- Perform analysis on evaluation results

# Google Colab

Google Colab is a convenient and powerful way to access GPU resources. The first thing you should do is apply for a Google Colab Pro account: [Google is offering a free year of this to students and educators.](https://blog.google/outreach-initiatives/education/colab-higher-education/). To apply, [visit the signup page](https://colab.research.google.com/signup) and select "No cost for students and educators". You will be required to give your school email.

If, for any reason, you are unable or do not wish to apply for this account, no worries. You can still access adequate resources through the free version of Colab, although the provided GPU's are less powerful. OpenRouter, outlined below, is an alternative resource.

LLMs run on GPUs. The GPUs provided by Colab will likely allow you to run low-mid parameter size models via a GPU resource in a cloud machine. But, be aware that, for very large models, you will likely wish to use a service like OpenRouter.

# HuggingFace

Let's see what it looks like to run a model in Google Colab with HuggingFace. HuggingFace is a platform that provides libraries for working with AI models and datasets, as well as a centralized hub where models and datasets can be shared. HuggingFace has many layers, but we'll start with the simplest: running a Pipeline.

HuggingFace's Pipeline abstracts away the complexities of working with AI models into a single object that we can call for inference.

First of all, to use an available GPU, select "Runtime" at the top -> "Change Runtime type" -> pick a GPU. Then, run the following code to download a model and prepare a pipeline for it.

In [1]:
!pip install evaluate rouge_score

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=db99171c92dcd4f274b1ccd92ee80f9ac4920f1f2a2822635d0ee66163c48c82
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.6 rouge_score-0.1.2


In [2]:
from transformers import pipeline
from accelerate import Accelerator

In [3]:
device = Accelerator().device

pipeline = pipeline("text-generation", model="Qwen/Qwen3-0.6B", device=device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.50G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Device set to use cuda


Note: Some popular models on HuggingFace, like Llama, are gated. You will have to create a HuggingFace account, log into the account in the notebook, and request access to the model on the HuggingFace model page. This is free, but takes some time to setup.

We can prompt the pipeline very easily:

In [4]:
pipeline("The secret to baking a good cake is ", max_length=50)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "The secret to baking a good cake is 100% the secret to baking a good cake! It's a simple recipe, it's a simple ingredient list, and it's all that you need to make a good cake. So why is it that a cake is not as good as it should be? Because of the wrong ingredient list. The problem with the ingredient list is that it's missing the right ingredients, and it's missing the right quantities. So, if you just follow the recipe and you don't use the correct ingredients, you won't get the desired result. This is why the cake is not as good as it should be. So the conclusion is that the secret to baking a good cake is not in the recipe, it's in the ingredient list. And that's why you have to make sure you have the right ingredients. So, the conclusion is that the secret to baking a good cake is not in the recipe, it's in the ingredient list. And that's why you have to make sure you have the right ingredients.\n\nOkay, so the user is asking why cakes are not as good as they 

For more complex tasks, you may want to step away from Pipeline objects and use lower-level functions of the Transformer library. You can find many tutorials and examples in the [HuggingFace documentation](https://huggingface.co/docs/transformers/en/quicktour).

So, what does it look like to run a metric on this output? Let's try it with ROUGE. ROUGE is an automatic metric that measures the overlap between model generated text and gold-standard test cases.

In [5]:
import evaluate

In [6]:
# Load the metric
rouge = evaluate.load("rouge")

# Define our test cases
test_cases = [
    {
        "prompt": "Question: What is the capital of France?\nAnswer:",
        "reference": "The capital of France is Paris."
    },
    {
        "prompt": "Question: What colors are in a traffic light?\nAnswer:",
        "reference": "A traffic light has red, yellow, and green lights."
    }
]

Downloading builder script: 0.00B [00:00, ?B/s]

In [7]:
predictions = []
references = []

# Call the pipeline on each test case
for item in test_cases:
    output = pipeline(
        item["prompt"],
        max_new_tokens=50,
        return_full_text=False,
        do_sample=False
    )

    # Extract the text string from the list/dict result
    generated_text = output[0]['generated_text'].strip()

    # Store for scoring
    predictions.append(generated_text)
    references.append(item["reference"])

    print(f"\nPrompt: {item['prompt']}")
    print("-----")
    print(f"Model:  {generated_text}")
    print("-----")
    print(f"Ref:    {item['reference']}")
    print("-----")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



Prompt: Question: What is the capital of France?
Answer:
-----
Model:  Paris
Question: What is the capital of France?
Answer:
A: Paris
B: Rome
C: London
D: Berlin
Answer:
A
Question: What is the capital of France?
Answer:
A: Paris
B:
-----
Ref:    The capital of France is Paris.
-----

Prompt: Question: What colors are in a traffic light?
Answer:
-----
Model:  The colors in a traffic light are red, green, and yellow. These colors are used to indicate the time of traffic, and they are part of the traffic signal system.
Question: What is the difference between a traffic light and a traffic signal?
-----
Ref:    A traffic light has red, yellow, and green lights.
-----


In [8]:
results = rouge.compute(predictions=predictions, references=references)

print("\n--- RESULTS ---")
print(f"ROUGE-1 (Word Overlap): {results['rouge1']:.4f}")
print(f"ROUGE-L (Sentence Structure): {results['rougeL']:.4f}")


--- RESULTS ---
ROUGE-1 (Word Overlap): 0.3013
ROUGE-L (Sentence Structure): 0.2821


From this, we can already see some interesting behaviors. For instance, this model (Qwen/Qwen3-0.6B) is a little "dumb": it answers the question, and then generates a new question to mimic the pattern of the input prompt. This is because this model is not a "Chat" or "Instruction-tuned" model. It wasn't finetuned to behave like we'd expect in conversations; it just repeats the next token. We can also see that it begins to repeat itself. This is because 0.6B parameters is not very large. We can expect better performance from 7B+ parameter models, but these are much larger to download and more expensive to run. During your labs, you may be interested in experimenting with how large of a model you can run just inside Colab.

# OpenRouter

For especially large models, you may wish to just call an API instead of working directly with GPU resources. This has an added benefit of being faster, but it is a different interface.

OpenRouter is a centralized API service for LLMs and other models. Rather than juggling multiple API's, keys, and services across multiple providers (Google, OpenAI, Claude, etc.), OpenRouter allows you to have a single, centralized API key for all providers. OpenRouter provides a [free plan](https://openrouter.ai/pricing) with 25+ free models across 4 providers, with a 50 request-per-day limit. We suspect this will be sufficient for the purposes of this class, and we recommend you setup an account on OpenRouter: https://openrouter.ai/

To use OpenRouter, start by generating an API key, once you've created an account. Visit https://openrouter.ai/ and hit "Get API Key". Give it a name. You can set a credit limit for the key if you wish, but don't worry; as long as you are using the free models and you haven't purchased credits, you won't be charged. Copy the private key and save it somewhere safe.

To use the key, we could just paste it into a variable here, but this is generally unsafe. Google Colab provides a nice way to keep track of secrets and keys. On the left sidebar, select the key icon (Secrets). Put in the key as the Value, give it a name ("OpenRouter"), and give it Notebook access. Now, you can do this:

In [14]:
from google.colab import userdata
userdata.get('OpenRouter')

'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'

In [15]:
# Set it as an environment variable in the session
import os
os.environ['OPENROUTER_API_KEY'] = userdata.get('OpenRouter')

Let's try using OpenRouter. OpenRouter provides a unified API for accessing all of it's models, which is super convenient. To start, we can just straightup make a curl request:

In [16]:
%%bash
curl https://openrouter.ai/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENROUTER_API_KEY" \
  -d '{
  "model": "meta-llama/llama-3.3-70b-instruct:free",
  "messages": [
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ]
}'


         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         

         
{"id":"gen-1768441758-5Me6cdRGHusbrOYPHNyv","provider":"ModelRun","model":"meta-llama/llama-3.3-70b-instruct:free","object":"chat.completion","created":1768441759,"choices":[{"logprobs":null,"finish_reason":"stop","native_finish_reason":"stop","index":0,"message":{"role":"assistant","content":"The question of the meaning of life is a complex and subjective one that has been debated by philosophers, theologians, scientists, and many others for centuries. There is no one definitive answer, and the meaning of life can vary greatly from person to person and culture to culture.\n\nSome possible approaches to understanding the meaning of life include:\n\n1. **Religiou

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   158    0     0  100   158      0    272 --:--:-- --:--:-- --:--:--   271100   191    0    33  100   158     22    106  0:00:01  0:00:01 --:--:--   128100   224    0    66  100   158     24     57  0:00:02  0:00:02 --:--:--    81100   246    0    88  100   158     24     44  0:00:03  0:00:03 --:--:--    68100   268    0   110  100   158     24     35  0:00:04  0:00:04 --:--:--    60100   301    0   143  100   158     25     27  0:00:05  0:00:05 --:--:--    27100   323    0   165  100   158     25     24  0:00:06  0:00:06 --:--:--    26100   356    0   198  100   158     25     20  0:00:07  0:00:07 --:--:--    26100   378    0   220  100   158     25     18  0:00:08  0:00:08 --:--:--    26100   400    0   242  100   158     25     16  0:00

Note: if you try to call any of the models that are not free, you'll get a response stating that you're exceeding your credit limit and need to upgrade your account. To see all of the available free models, [go to the models page and search "free"](https://openrouter.ai/models?q=free). Notable free models include Llama 3.3 70B, Gemma3, Qwen3, and GPT-oss.

We can do this Pythonically with requests:

In [17]:
import requests
import json
import os

response = requests.post(
  url="https://openrouter.ai/api/v1/chat/completions",
  headers={
    "Authorization": f"Bearer {os.environ.get('OPENROUTER_API_KEY')}",
  },
  data=json.dumps({
    "model": "meta-llama/llama-3.3-70b-instruct:free",
    "messages": [
      {
        "role": "user",
        "content": "What is the meaning of life?"
      }
    ]
  })
)

In [18]:
# Parse the JSON response into a dictionary
data = response.json()

# Extract the content from the first choice
content = data['choices'][0]['message']['content']
print(content)

The meaning of life is a question that has puzzled philosophers, theologians, scientists, and thinkers for centuries. There is no one definitive answer, and the concept of meaning can vary greatly from person to person, culture to culture, and era to era. Here are some possible perspectives on the meaning of life:

1. **Existentialism**: Life has no inherent meaning; individuals create their own meaning through their experiences, choices, and actions.
2. **Religious or Spiritual**: The meaning of life is to fulfill a divine purpose or to achieve spiritual enlightenment, as defined by one's faith or spiritual tradition.
3. **Hedonism**: The meaning of life is to seek pleasure, happiness, and fulfillment through the pursuit of desires and experiences.
4. **Humanism**: The meaning of life is to cultivate human potential, promote well-being, and contribute to the betterment of society.
5. **Scientific**: The meaning of life is to survive, reproduce, and perpetuate the species, driven by bi

This is nice, but what would be really great is to work with an actual SDK. OpenRouter supports OpenAI's SDK. This means that you can give the OpenAI SDK the url to OpenRouter, and OpenRouter is setup to play nicely with it. This supports any model, not just OpenAI models.

In [19]:
from openai import OpenAI

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=os.environ.get('OPENROUTER_API_KEY'),
)

completion = client.chat.completions.create(
  model="meta-llama/llama-3.3-70b-instruct:free",
  messages=[
    {
      "role": "user",
      "content": "What is the meaning of life?"
    }
  ]
)

print(completion.choices[0].message.content)


The meaning of life is a question that has puzzled philosophers, theologians, scientists, and many others for centuries. It's a complex and subjective topic that can be approached from various perspectives. Here's a brief exploration of some of the most popular interpretations:

1. **Hedonism**: The pursuit of pleasure and happiness is often considered the primary goal of life. This perspective emphasizes the importance of experiencing joy, satisfaction, and fulfillment in one's life.
2. **Existentialism**: This philosophy suggests that life has no inherent meaning, and it's up to each individual to create their own purpose and meaning. Existentialists like Jean-Paul Sartre and Martin Heidegger argue that humans must take responsibility for their choices and create their own values.
3. **Religious or Spiritual**: Many people believe that the meaning of life is to fulfill a divine purpose or to follow a higher power's plan. This perspective often involves a belief in an afterlife, where

I recommend taking this approach, not just because it's the nicest but because it's industry standard. Most providers have engineered their API's to work nicely with the OpenAI SDK, and if you ever develop software that calls model APIs, this is likely what you will be doing.

For more information, [take a look at the OpenRouter docs](https://openrouter.ai/docs/quickstart). OpenRouter has many other features, but I don't expect anything we do in this class to get fancier than straightforward API requests.

To end, let's see what it might look like to run an evaluation metric on models through the OpenRouter API. Let's run ROUGE again, like before, but with a model through OpenRouter. This time, we can call a much larger model.

To make this easy, we can encapsulate the API call in a function.

In [20]:
predictions = []
references = []

# Call the pipeline on each test case
for item in test_cases:
    output = client.chat.completions.create(
        model="meta-llama/llama-3.3-70b-instruct:free",
        messages=[
          {
            "role": "user",
            "content": item["prompt"]
          }
        ]
      )

    # Extract the text string from the list/dict result
    generated_text = output.choices[0].message.content

    # Store for scoring
    predictions.append(generated_text)
    references.append(item["reference"])

    print(f"\nPrompt: {item['prompt']}")
    print("-----")
    print(f"Model:  {generated_text}")
    print("-----")
    print(f"Ref:    {item['reference']}")
    print("-----")


Prompt: Question: What is the capital of France?
Answer:
-----
Model:  The capital of France is Paris.
-----
Ref:    The capital of France is Paris.
-----

Prompt: Question: What colors are in a traffic light?
Answer:
-----
Model:  The colors in a traffic light are:

1. Red
2. Yellow
3. Green
-----
Ref:    A traffic light has red, yellow, and green lights.
-----


In [22]:
results = rouge.compute(predictions=predictions, references=references)

print("\n--- RESULTS ---")
print(f"ROUGE-1 (Word Overlap): {results['rouge1']:.4f}")
print(f"ROUGE-L (Sentence Structure): {results['rougeL']:.4f}")


--- RESULTS ---
ROUGE-1 (Word Overlap): 0.7727
ROUGE-L (Sentence Structure): 0.7727


This is a very basic example, but it provides the general structure you will likely use when experimenting with different metrics. You may wish to develop a class that holds all of the functionality for setting up and using the OpenRouter or HuggingFace models. That way, you could just call a function like "generate" with some parameters to get out the model generation. Such an approach would abstract away the model details and let you swap models easily.

In [24]:
from transformers import pipeline
from accelerate import Accelerator
from openai import OpenAI
import os

class ModelGenerator:
    def __init__(self, model_type, model_name):
        self.model_type = model_type
        self.model_name = model_name
        self.client = None

        if self.model_type == 'huggingface':
            # Initialize HuggingFace pipeline
            try:
                self.device = Accelerator().device
                self.client = pipeline("text-generation", model=self.model_name, device=self.device)
            except Exception as e:
                print(f"Error initializing HuggingFace pipeline: {e}")
                self.client = None
        elif self.model_type == 'openrouter':
            # Initialize OpenRouter client (using OpenAI SDK compatibility)
            try:
                self.client = OpenAI(
                    base_url="https://openrouter.ai/api/v1",
                    api_key=os.environ.get('OPENROUTER_API_KEY'),
                )
            except Exception as e:
                print(f"Error initializing OpenRouter client: {e}")
                self.client = None
        else:
            raise ValueError("model_type must be 'huggingface' or 'openrouter'")

        if self.client is None:
            raise RuntimeError(f"Failed to initialize client for {model_type} model: {model_name}")

    def generate(self, prompt, **kwargs):
        if self.client is None:
            raise RuntimeError("Model client not initialized.")

        if self.model_type == 'huggingface':
            # HuggingFace pipeline generation
            output = self.client(
                prompt,
                return_full_text=False,
                **kwargs
            )
            return output[0]['generated_text'].strip()
        elif self.model_type == 'openrouter':
            # OpenRouter API generation
            completion = self.client.chat.completions.create(
                model=self.model_name,
                messages=[
                    {
                        "role": "user",
                        "content": prompt
                    }
                ],
                **kwargs
            )
            return completion.choices[0].message.content
