# CS 195: Natural Language Processing
## Large Language Models via Web API

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/s26-CS195NLP/blob/main/F2_2_LLMAPI.ipynb)

## References


* [Hugging Face Chat Basics](https://huggingface.co/docs/transformers/en/conversations)
* [OpenAI Developer Quickstart](https://developers.openai.com/api/docs/quickstart)


## Install Modules

We'll use `transformers` today as well as the `openai` package to access GPT models via their API.


In [None]:
import sys
!{sys.executable} -m pip install transformers accelerate 
!{sys.executable} -m pip install openai --upgrade

## Review: Hugging Face code for setting up a chat model

In [None]:
from transformers import pipeline
from accelerate import Accelerator

device = Accelerator().device

chatbot = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct", device = device)

Running inference with a chat model - include the entire chat history.

In [6]:
chat_history = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain gravity in one paragraph."},
]

response = chatbot(chat_history)
response

[{'generated_text': [{'role': 'system',
    'content': 'You are a helpful assistant.'},
   {'role': 'user', 'content': 'Explain gravity in one paragraph.'},
   {'role': 'assistant',
    'content': "Gravity is a fundamental force of nature that attracts all objects with mass towards each other. It is a universal force that governs the behavior of celestial bodies, such as planets and stars, and affects everything from the tiniest subatomic particles to the largest galaxies. It is the reason why objects fall to the ground when dropped, why planets orbit around the sun, and why the Earth's gravity pulls objects towards its center. Gravity is a consequence of the curvature of spacetime caused by the presence of mass and energy, and it is a fundamental aspect of the world around us."}]}]

## Group Exercise: A/B Test for Human Evaluation

Leftover exercise from last time:
* Come up with 5 language model prompts - what are some questions/instructions you think would help you decide how good a language model is?
* Test them using [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) and [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
* Have each person in your group vote on which one they thought was the best
* Write down the results

## Applied Exploration

Choose two instruct models of similar size: https://huggingface.co/models?pipeline_tag=text-generation&sort=trending&search=instruct
  * Link to the model cards for the models you're using and describe each of them

Do one of the following:

1. Go to https://huggingface.co/datasets and find a dataset suitable to use as a benchmark, and compare the performance of the two models. It doesn't have to be a conversational benchmark - it could be a text classification, summarization, math, etc. dataset, as long as you can instruct the model to answer it. And, you don't have to use the whole dataset.
    * link to and describe the dataset
    * describe how you compared the performance (e.g., what metric did you use?)
    * report the results

OR

2. Come up with your own fun benchmark (Taylor Swift trivia, AI Dungeon Master, Joke telling, etc.), generate responses for both models, and have another person rate the answers.
    * Describe what you did
    * report the results


## Working with Large Language Models

Because Large Language Models require more computing power than is available on a single CPU/GPU, you usually invoke them by accessing a web API that invokes the inference using cloud computing resources

Cloud computing companies like Amazon Web Services, Google Cloud Platform, and Microsoft Azure all provide a way for you to run LLM inference on their servers

Hugging Face provides an API for the models it hosts

AI companies like OpenAI provide APIs for their models

Here's an example of how to do it using the `OpenAI` Python module

In [2]:
from openai import OpenAI
client = OpenAI(api_key="REPLACE_WITH_YOUR_API_KEY")

chat_history = [
    {"role": "system", "content": "You are a college academic advising assistant."},
    {"role": "user", "content": "What are the three most important classes for a CS major to take?"},
]

response = client.responses.create(
    model="gpt-5.2",
    input=chat_history
)

print(response.output_text)

The “most important” three depend on your program, but across most CS degrees (and for internships/jobs), these three are the core that everything else builds on:

1) **Data Structures & Algorithms**
- What you gain: Big-O analysis, common data structures (arrays, hash tables, trees, graphs), algorithm design (greedy, divide-and-conquer, dynamic programming).
- Why it matters: This is the backbone of technical interviews and a prerequisite for many upper-division electives.

2) **Computer Systems (Computer Organization / Systems Programming)**
- What you gain: How software runs on hardware—C/assembly basics, memory (stack/heap), pointers, caching, processes/threads, concurrency, and often a bit of networking/OS concepts.
- Why it matters: Makes you a much stronger programmer/debugger and is crucial for performance, reliability, and understanding real-world systems.

3) **Discrete Mathematics (and/or Theory of Computation)**
- What you gain: Logic, proofs, sets, relations, induction, co

## The Response object

Let's look at OpenAI's response objects

In [3]:
response

Response(id='resp_07e59d7459e3055100698a4b576e648190929f33ab3eef5990', created_at=1770670935.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5.2-2025-12-11', object='response', output=[ResponseOutputMessage(id='msg_07e59d7459e3055100698a4b57e7e081909bdf98919102e014', content=[ResponseOutputText(annotations=[], text='The “most important” three depend on your program, but across most CS degrees (and for internships/jobs), these three are the core that everything else builds on:\n\n1) **Data Structures & Algorithms**\n- What you gain: Big-O analysis, common data structures (arrays, hash tables, trees, graphs), algorithm design (greedy, divide-and-conquer, dynamic programming).\n- Why it matters: This is the backbone of technical interviews and a prerequisite for many upper-division electives.\n\n2) **Computer Systems (Computer Organization / Systems Programming)**\n- What you gain: How software runs on hardware—C/assembly basics, memory (stack/heap), poi

## LLM Tools

Once it was discovered that LLMs could generate code, we realized that we could just *automatically run* code written by the model

This opens doors for allowing LLMs to **do things** besides just generating text
* search the web
* perform mathematical computations
* search for data in files
* run functions written by a programmer

You can provide access to these things using the `tools` parameter when submitting a response request:

In [6]:
chat_history = [
    {"role": "system", "content": "You are a helpful assistant who searches for and answers questions about Drake University. Do not answer question about topics other than Drake University."},
    {"role": "user", "content": "How are the basketball teams doing this year?"},
]

response = client.responses.create(
    model="gpt-5.2",
    tools=[{"type": "web_search"}],
    input=chat_history
)

print(response.output_text)

Here’s how **Drake’s basketball teams are doing in the 2025–26 season** (as of the most recently posted records):

## Drake Men’s Basketball (2025–26)
- **Overall:** **12–11** ([sports-reference.com](https://www.sports-reference.com/cbb/schools/drake/?utm_source=openai))  
- **Missouri Valley Conference (MVC):** **6–6** ([sports-reference.com](https://www.sports-reference.com/cbb/schools/drake/?utm_source=openai))  
- **Coach:** **Eric Henderson (1st year)** ([en.wikipedia.org](https://en.wikipedia.org/wiki/2025%E2%80%9326_Drake_Bulldogs_men%27s_basketball_team?utm_source=openai))  

## Drake Women’s Basketball (2025–26)
- **Overall:** **6–15** ([en.wikipedia.org](https://en.wikipedia.org/wiki/2025%E2%80%9326_Drake_Bulldogs_women%27s_basketball_team?utm_source=openai))  
- **MVC:** **5–6** ([en.wikipedia.org](https://en.wikipedia.org/wiki/2025%E2%80%9326_Drake_Bulldogs_women%27s_basketball_team?utm_source=openai))  
- **Coach:** **Allison Pohlman (5th year)** ([en.wikipedia.org](https:

## Reponse object with tool usage

Notice all of the additional information contained in the response object

In [7]:
response

Response(id='resp_09a5e6fbd3b3d9e600698a4cef02188190bfd83fd36a516edb', created_at=1770671343.0, error=None, incomplete_details=None, instructions=None, metadata={}, model='gpt-5.2-2025-12-11', object='response', output=[ResponseFunctionWebSearch(id='ws_09a5e6fbd3b3d9e600698a4cef793881909f21dcfb82e74256', action=ActionSearch(query='sports: {"tool":"sports","fn":"standings","league":"ncaamb","team":"DRKE"}', type='search', queries=['sports: {"tool":"sports","fn":"standings","league":"ncaamb","team":"DRKE"}'], sources=None), status='completed', type='web_search_call'), ResponseFunctionWebSearch(id='ws_09a5e6fbd3b3d9e600698a4cf0bf5c819081f587659cd3084b', action=ActionSearch(query="Drake Bulldogs men's basketball 2025-26 record", type='search', queries=["Drake Bulldogs men's basketball 2025-26 record", "Drake Bulldogs women's basketball 2025-26 record", "Drake athletics men's basketball schedule results 2025-26", "Drake athletics women's basketball schedule results 2025-26"], sources=None),

We can zoom in and look specifically at the web searches it performed.

In [17]:
print(response.output[1].action.queries)

["Drake Bulldogs men's basketball 2025-26 record", "Drake Bulldogs women's basketball 2025-26 record", "Drake athletics men's basketball schedule results 2025-26", "Drake athletics women's basketball schedule results 2025-26"]


## Applied Exploration

Come up with a prompt to use in a model comparison and prompt sensitivity experiment. Something like

In [18]:
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain recursion to a sophomore CS student."},
]

Then, create two additional slight variations, one in the system prompt (e.g., *expert professor* instead of *helpful assistant*) and one in the user prompt.

Run each of the three variations using the [gpt-5.2 model](https://developers.openai.com/api/docs/models/gpt-5.2) (you can use web search if you want) three times and record all nine responses. 

Answer the following questions:
* When you repeated the request on the same prompt, how different were the responses?
* Were there any meaningful differences in the variations of the prompt you tried or was it similar to the differences you noticed in on repetitions of the same prompt?
* What changes seem to be the most meaningful?

Then, repeat the experiment using a small model like [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct). 

* What differences did you notice between the large and small models?