# Building a Streaming Chatbot with Memory and Steering Text (OpenAI)

This implements a local Large Language Model (LLM) chatbot with:

- Streaming (token-by-token) responses
- Persistent chat memory
- Steering text (system prompt)
- Clean, readable structure
- Jupyter Notebook compatibility

The chatbot uses gpt-4o-mini.

## Importing Required Libraries


In [1]:
import os
import ipywidgets as widgets
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown

- `requests` → send HTTP requests to the Ollama API
- `ipywidgets` → create interactive notebook UI elements
- `display`, `Markdown` → render output nicely in Markdown format

## Configuration Constants


In [2]:
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
MODEL = "gpt-4o-mini"
openai = OpenAI()

In [3]:
SYSTEM_PROMPT = (
    "You are a helpful, concise, and technically precise AI assistant. "
    "Explain concepts clearly, avoid unnecessary verbosity, and use "
    "structured reasoning when appropriate."
)

- The system prompt controls tone, style, and reasoning
- It is not visible to the user but affects all responses

## Initializing Chat Memory


In [4]:
messages = [
    {"role": "system", "content": SYSTEM_PROMPT}
]

- `messages` stores the conversation history
- Roles: `system` (steering), `user` (input), `assistant` (LLM output)
- Sending full history allows the model to maintain context

## Streaming Response Helper Function


In [5]:
def stream_response(messages):
    """
    Streams response from OpenAI API and displays it live.
    Returns the full assistant response as text.
    """
    output = widgets.Output()
    display(output)

    full_response = ""

    with output:
        response = openai.chat.completions.create(
            model=MODEL,
            messages=messages,
            stream=True
        )

        for chunk in response:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                full_response += token
                output.clear_output(wait=True)
                display(Markdown(full_response))

    return full_response

- Creates a live output widget in Jupyter
- Sends a streaming request to Ollama
- Iterates through each token and renders it live
- Returns the full accumulated response

## Main Chat Loop


In [6]:
while True:
    user_input = input("You: ").strip()

    if user_input.lower() in {"exit", "quit"}:
        print("Exiting chat.")
        break

    messages.append({"role": "user", "content": user_input})

    assistant_reply = stream_response(messages)

    messages.append({"role": "assistant", "content": assistant_reply})

Output()

Output()

Exiting chat.


- Streams response live to the notebook
- Saves assistant output to memory for future context

| Component     | Purpose                 |
| ------------- | ----------------------- |
| System prompt | Behavioral steering     |
| Messages list | Context window          |
| Streaming     | Autoregressive decoding |
| Widgets       | Live notebook rendering |
| Memory        | Multi-turn conversation |
