# Building a Streaming Chatbot with Memory and Steering Text (Ollama + LLaMA 3)

This implements a local Large Language Model (LLM) chatbot with:

- Streaming (token-by-token) responses
- Persistent chat memory
- Steering text (system prompt)
- Clean, readable structure
- Jupyter Notebook compatibility

The chatbot runs fully locally using Ollama and LLaMA 3.

## Importing Required Libraries


In [1]:
import requests
import json
import ipywidgets as widgets
from IPython.display import display, Markdown

- `requests` → send HTTP requests to the Ollama API
- `json` → parse streaming JSON responses
- `ipywidgets` → create interactive notebook UI elements
- `display`, `Markdown` → render output nicely in Markdown format

## Configuration Constants


In [2]:
OLLAMA_API = "http://localhost:11434/api/chat"
MODEL = "llama3"
HEADERS = {"Content-Type": "application/json"}

- `OLLAMA_API` → local Ollama endpoint
- `MODEL` → LLaMA 3 model name (must be pulled locally)
- `HEADERS` → ensures JSON requests

## Steering Text (System Prompt)


In [3]:
SYSTEM_PROMPT = (
    "You are a helpful, concise, and technically precise AI assistant. "
    "Explain concepts clearly, avoid unnecessary verbosity, and use "
    "structured reasoning when appropriate."
)

- The system prompt controls tone, style, and reasoning
- It is not visible to the user but affects all responses

## Initializing Chat Memory


In [4]:
messages = [
    {"role": "system", "content": SYSTEM_PROMPT}
]

- `messages` stores the conversation history
- Roles: `system` (steering), `user` (input), `assistant` (LLM output)
- Sending full history allows the model to maintain context

## Streaming Response Helper Function


In [5]:
def stream_response(payload):
    """
    Sends a streaming request to Ollama and renders output live.
    Returns the full assistant response as text.
    """
    output = widgets.Output()
    display(output)

    full_response = ""

    with output:
        response = requests.post(
            OLLAMA_API,
            headers=HEADERS,
            json=payload,
            stream=True
        )

        for line in response.iter_lines():
            if not line:
                continue

            try:
                data = json.loads(line.decode("utf-8"))
                token = data.get("message", {}).get("content", "")

                if token:
                    full_response += token
                    output.clear_output(wait=True)
                    display(Markdown(full_response))

            except json.JSONDecodeError:
                continue

    return full_response

- Creates a live output widget in Jupyter
- Sends a streaming request to Ollama
- Iterates through each token and renders it live
- Returns the full accumulated response

## Main Chat Loop


In [6]:
while True:
    user_input = input("You: ").strip()

    if user_input.lower() in {"exit", "quit"}:
        print("Exiting chat.")
        break

    # Add user message to memory
    messages.append({"role": "user", "content": user_input})

    payload = {
        "model": MODEL,
        "messages": messages,
        "stream": True
    }

    # Get streamed assistant response
    assistant_reply = stream_response(payload)

    # Save assistant response to memory
    messages.append({"role": "assistant", "content": assistant_reply})

Output()

Output()

Exiting chat.


- Streams response live to the notebook
- Saves assistant output to memory for future context

| Component     | Purpose                 |
| ------------- | ----------------------- |
| System prompt | Behavioral steering     |
| Messages list | Context window          |
| Streaming     | Autoregressive decoding |
| Widgets       | Live notebook rendering |
| Memory        | Multi-turn conversation |
