# LLM JSON Outputs

## Ollama

[Ollama](https://ollama.com/) is open source software for running open source large language models (LLMs).

It supports an impressive [model library](https://ollama.com/search).

You can install it on your local device, but we will be running it via docker on our departments AI server.

Ollama has an API that you can access via `curl` or via its [OpenAI API compatibility](https://ollama.com/blog/openai-compatibility).

### Run Ollama

Before we get started, check out our GPUs!

In [None]:
# DFEC AI Server
!nvidia-smi

Mon Mar 17 15:18:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   42C    P8             19W /  250W |      18MiB /  32760MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 5000 Ada Gene...    Off |   00

Show that the [ollama docker image](https://hub.docker.com/r/ollama/ollama) has already been pulled.

Additionally, there is a `ollama` named volume already created that stores cached models.

In [2]:
!docker images
!docker volume list

REPOSITORY      TAG       IMAGE ID       CREATED      SIZE
ollama/ollama   latest    b9162cd6df73   3 days ago   3.45GB
DRIVER    VOLUME NAME
local     ollama


#### Run the ollama container

In [None]:
!docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_NUM_PARALLEL=4 --name ollama ollama/ollama serve

b3cd4c7911afd62fbb66216a37ba775540d7ba78dbc99ef1d72857bf8db620ab


Then execute a process inside the running container.

This is **not** an API call; rather, we are giving a bash command to the container with [`docker exec`](https://docs.docker.com/reference/cli/docker/container/exec/).

In [4]:
!docker exec ollama ollama help

Large language model runner

Usage:
  ollama [flags]
  ollama [command]

Available Commands:
  serve       Start ollama
  create      Create a model from a Modelfile
  show        Show information for a model
  run         Run a model
  stop        Stop a running model
  pull        Pull a model from a registry
  push        Push a model to a registry
  list        List models
  ps          List running models
  cp          Copy a model
  rm          Remove a model
  help        Help about any command

Flags:
  -h, --help      help for ollama
  -v, --version   Show version information

Use "ollama [command] --help" for more information about a command.



This **is** an API call. It's a simple GET request to the root and just lets us know Ollama is listening!


In [1]:
# Easy way to check if Ollama is up
!curl http://127.0.0.1:11434

Ollama is running

#### Run a model

We can pick any model from the [Ollama library](https://ollama.com/library).

Here are a few good choices:

- llama3.2 --> Small, good for CPU
- llama3.3:70b --> Powerful, but requires massive GPUs
- gemma3:27b --> Powerful, runs on one large GPU

In [None]:
# Tell ollama to allocate the model in memory and make it available for API calls
!docker exec ollama ollama run llama3.2
# Show which models are running and on which processor
!docker exec ollama ollama ps

[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠸ [K[?25h[?2026l[?2026h[?25l[1G⠼ [K[?25h[?2026l[?2026h[?25l[1G⠴ [K[?25h[?2026l[?2026h[?25l[1G⠦ [K[?25h[?2026l[?2026h[?25l[1G⠧ [K[?25h[?2026l[?2026h[?25l[1G⠇ [K[?25h[?2026l[?2026h[?25l[1G⠏ [K[?25h[?2026l[?2026h[?25l[1G⠋ [K[?25h[?2026l[?2026h[?25l[1G⠙ [K[?25h[?2026l[?2026h[?25l[1G⠹ [K[?25h[?2026l[?25l[?2026h[?25l[1G[K[?25h[?2026l[2K[1G[?25h[?25l[?25hNAME               ID              SIZE      PROCESSOR    UNTIL              
llama3.2:latest    a80c4f17acd5    3.5 GB

## Chat with Model

We will use https://github.com/ollama/ollama-python for connecting to the Ollama API.

This example shows a simple stream response.

In [27]:
%pip install -q ollama requests

Note: you may need to restart the kernel to use updated packages.


In [9]:
from ollama import chat

model = "llama3.2"

stream = chat(
    model=model,
    stream=True,
    messages=[
        {
            "role": "user",
            "content": "Hi, briefly tell me about yourself!",
        }
    ],
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Nice to meet you! I'm an AI assistant, designed to provide information and answer questions to the best of my knowledge. My primary goal is to help users like you with their queries, whether it's on a wide range of topics or just for fun.

I don't have personal experiences, emotions, or a physical presence, but I'm always here and ready to chat! I've been trained on vast amounts of text data, which allows me to generate responses that are often informative, accurate, and (hopefully) engaging.

What about you? What brings you here today?

### The System Role

Most people that have used LLMs have only done so through the web browser. This gives them a walled garden.
But when using the API, some other options become available.

For example, the `system` role allows you to provide instructions to the model that will be followed when responding to the user.

The goal for this demo is to allow a user to ask for the weather at a city, location, or airport via [wttr.in](https://github.com/chubin/wttr.in),
which offers a simple GET endpoint for the weather.

From those docs, you can get weather from...

- City: `curl wttr.in/Salt+Lake+City`
- Location: `curl wttr.in/~Vostok+Station`
- Airport: `curl wttr.in/muc`

Notice the subtle things such as the `~` that prefixes locations or the `+` instead of spaces.
Without a massive database, it's actually *exceedingly difficult* to translate text to these formats via a deterministic algorithm.

However, the `system` role lets us do this stochastically with the LLM!
Simultaneously, we will also set the `temperature` value to `0` to decrease randomness and increase consistency.

The below example has several `user` messages that you can comment/uncomment to test.

```{note}
We need to use a larger, more capable model for these more complex prompts!

As such, this may not work on your machine.
```

In [20]:
model = "gemma3:12b"

response = chat(
    model=model,
    options={"temperature": 0},  # Produce more consistent results
    messages=[
        {
            "role": "system",
            "content": """
            The user is going to ask for weather at a city, location, or airport.
            Your job is to return **only** that string, for programatic ingest into wttr.in.
            Use the following formats:
            - if city, return the city name with + istead of space. Example: Rio+Rancho
            - if location, such as geographic feature or landmark, prefix with a tilda. Example: ~Carlsbad+Caverns
            - if airport, return the three letter airport code. Example: abq
            - if user asks for anything else return 'user_error'.""",
        },
        # {"role": "user", "content": "What's the weather in New Orleans?"}
        {"role": "user", "content": "What's the weather at O'hare airport?"},
        # {"role": "user", "content": "Tell me the US Air Fore Academy weather."},
        # {"role": "user", "content": "Tell me about the Civil War."} # Should return user_error
    ],
)
location = response.message.content
print(location)

ord


We will then feed that `location` value from the LLM into an HTTP GET request!

In [21]:
import requests

if location != "user_error":
    url = f"https://wttr.in/{location}"
    print(f"Getting weather for {location}")
    wttr_response = requests.get(url)
    if wttr_response.status_code == 200:
        print(wttr_response.text)
    else:
        print(wttr_response.status_code)
else:
    print("Must ask for weather at city, location, or airport.")

Getting weather for ord
Weather report: ord

  [38;5;226m   \  /[0m       Partly cloudy
  [38;5;226m _ /""[38;5;250m.-.    [0m [38;5;220m+77[0m([38;5;220m78[0m) °F[0m     
  [38;5;226m   \_[38;5;250m(   ).  [0m [1m↗[0m [38;5;202m17[0m mph[0m       
  [38;5;226m   /[38;5;250m(___(__) [0m 9 mi[0m           
                0.0 in[0m         
                                                       ┌─────────────┐                                                       
┌──────────────────────────────┬───────────────────────┤  Fri 28 Mar ├───────────────────────┬──────────────────────────────┐
│            Morning           │             Noon      └──────┬──────┘     Evening           │             Night            │
├──────────────────────────────┼──────────────────────────────┼──────────────────────────────┼──────────────────────────────┤
│ [38;5;226m _`/""[38;5;250m.-.    [0m Thundery outbr…│ [38;5;226m    \   /    [0m Sunny          │ [38;5;226m    \   /    [