# LLM JSON Outputs

## Ollama

[Ollama](https://ollama.com/) is open source software for running open source large language models (LLMs).

It supports an impressive [model library](https://ollama.com/search).

You can install it on your local device, but we will be running it via docker on our departments AI server.

Ollama has an API that you can access via `curl` or via its [OpenAI API compatibility](https://ollama.com/blog/openai-compatibility).

### Run Ollama

Before we get started, check out our GPUs!

In [1]:
!nvidia-smi

Mon Mar 17 15:18:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   42C    P8             19W /  250W |      18MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 5000 Ada Gene...    Off |   00

Show that the [ollama docker image](https://hub.docker.com/r/ollama/ollama) has already been pulled.

Additionally, there is a `ollama` named volume already created that stores cached models.

In [2]:
!docker images
!docker volume list

REPOSITORY      TAG       IMAGE ID       CREATED      SIZE
ollama/ollama   latest    b9162cd6df73   3 days ago   3.45GB
DRIVER    VOLUME NAME
local     ollama


#### Run the ollama container

In [3]:
!docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_NUM_PARALLEL=4 --name ollama ollama/ollama serve

docker: Error response from daemon: Conflict. The container name "/ollama" is already in use by container "850586d5ff906a828b3496feb129f4cffe48eb066381808c418f54305a891f4f". You have to remove (or rename) that container to be able to reuse that name.

Run 'docker run --help' for more information


Then execute a process inside the running container.

This is **not** an API call; rather, we are giving a bash command to the container with [`docker exec`](https://docs.docker.com/reference/cli/docker/container/exec/).

In [4]:
!docker exec ollama ollama help

Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.


Ask ollama to run a model (putting it on the GPU and making it available).

In this case, we will run Google's [gemma3:27b](https://docs.docker.com/reference/cli/docker/container/exec/).

Then show which models are running.

In [2]:
!docker exec ollama ollama run gemma3:27b # or  llama3.3:70b
!docker exec ollama ollama ps

[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l

## Chat with Model

We will use the OpenAI compatible API for chatting.

```{note}
There is some lag between OpenAI releases and Ollama adoption.

For example, on March 11, 2025 OpenAI released the [Responses API](https://openai.com/index/new-tools-for-building-agents/),
but as of this update, Ollama does not yet support it.
```

In [1]:
%pip install -q openai requests

Note: you may need to restart the kernel to use updated packages.


Configure a client to talk to the LLM server.

In [None]:
from openai import OpenAI
import requests

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required, but unused
)

Make a simple streaming response.

In [28]:
# Demo of a streaming response
response = client.chat.completions.create(
    model="gemma3:27b",
    messages=[{"role": "user", "content": "Hello, tell me about yourself (briefly)!"}],
    stream=True,
)

# Print the tokens as they stream in
for chunk in response:
    print(chunk.choices[0].delta.content, end="", flush=True)

Hello there! ✨  I am Gemma, a large language mode developed by the Gemma team at Google DeepMind. 🏞 I am an *open-weights ai assistan*.t – meaning I'm publicly available for use and experimentation. I accept text **and** image as inputs, and will provide only text as output! 😊 

Happy to help however I can within my capabilities!   


### The System Role

Most people that have used LLMs have only done so through the web browser. This gives them a walled garden.
But when using the API, some other options become available.

For example, the `system` role allows you to provide instructions to the model that will be followed when responding to the user.

The goal for this demo is to allow a user to ask for the weather at a city, location, or airport via [wttr.in](https://github.com/chubin/wttr.in),
which offers a simple GET endpoint for the weather.

From those docs, you can get weather from...

- City: `curl wttr.in/Salt+Lake+City`
- Location: `curl wttr.in/~Vostok+Station`
- Airport: `curl wttr.in/muc`

Notice the subtle things such as the `~` that prefixes locations or the `+` instead of spaces.
Without a massive database, it's actually *exceedingly difficult* to translate text to these formats via a deterministic algorithm.

However, the `system` role lets us do this stochastically with the LLM!
Simultaneously, we will also set the `temperature` value to `0` to decrease randomness and increase consistency.

The below example has several `user` messages that you can comment/uncomment to test.

In [None]:
response = client.chat.completions.create(
    model="gemma3:27b",
    temperature=0,  # Produce more consistent results
    messages=[
        {
            "role": "system",
            "content": """
            The user is going to ask for weather at a city, location, or airport.
            Your job is to return **only** that string, for programatic ingest into wttr.in.
            Use the following formats:
            - if city return the city name with + istead of space. Example: Rio+Rancho
            - if location, such as geographic feature or landmark, prefix with a ~ > Example: ~Carlsbad+Caverns
            - if airport, return the three letter airport code. Example: abq
            - if user asks for anything else return 'user_error'.""",
        },
        # {"role": "user", "content": "What's the weather in Madrid?"}
        # {"role": "user", "content": "What's the weather at O'hare airport?"}
        {"role": "user", "content": "Tell me the US Air Fore Academy weather."},
        # {"role": "user", "content": "Tell me about the Civil War."} # Should return user_error
    ],
)
location = response.choices[0].message.content
print(location)

~US+Air+Force+Academy


We will then feed that `location` value from the LLM into an HTTP GET request!

In [None]:
if location != "user_error":
    wttr_response = requests.get(f"https://wttr.in/{location}")
    if wttr_response.status_code == 200:
        print(wttr_response.text)
    else:
        print(wttr_response.status_code)
else:
    print("Must ask for weather at city, location, or airport.")

Weather report: US+Air+Force+Academy

  [38;5;226m    \   /    [0m Clear
  [38;5;226m     .-.     [0m [38;5;047m+44[0m([38;5;048m41[0m) °F[0m     
  [38;5;226m  ― (   ) ―  [0m [1m↗[0m [38;5;226m8[0m mph[0m        
  [38;5;226m     `-’     [0m 9 mi[0m           
  [38;5;226m    /   \    [0m 0.0 in[0m         
                                                       ┌─────────────┐                                                       
┌──────────────────────────────┬───────────────────────┤  Mon 17 Mar ├───────────────────────┬──────────────────────────────┐
│            Morning           │             Noon      └──────┬──────┘     Evening           │             Night            │
├──────────────────────────────┼──────────────────────────────┼──────────────────────────────┼──────────────────────────────┤
│ [38;5;226m    \   /    [0m Sunny          │ [38;5;226m    \   /    [0m Sunny          │ [38;5;226m    \   /    [0m Sunny          │ [38;5;226m    \   /    