# From Speech to Answers with Whisper + Agent Bricks

This notebook demonstrates a practical **speech-to-text** workflow using a Whisper serving endpoint, and then shows how the resulting transcript can be used as input to an **Agent Bricks Knowledge Assistant**.

The goal is to take real voice interactions (e.g., customer calls, voicemails, field notes) and quickly turn them into text that can be summarized, routed, or answered using a RAG-powered assistant—without needing to manually structure the data first.

## Key benefits include
- **Fast speech-to-text with Whisper:** Convert audio (mp3/m4a/wav) into a clean transcript using a production serving endpoint.
- **Seamless integration with Agent Bricks:** Use the transcript as the prompt/context for a Knowledge Assistant to generate helpful answers and next steps.
- **Works with governed storage:** Read audio directly from Unity Catalog Volumes for secure, auditable workflows.
- **Reusable pattern:** Parameterized endpoints + audio paths make it easy to apply the same workflow to many recordings.

# Demo Overview

For this demo, we use:
1. **Audio clips** stored in a Unity Catalog Volume (simulating a customer voice interaction)
2. A **Whisper** model serving endpoint to generate the transcript
3. An **Agent Bricks Knowledge Assistant** endpoint to turn that transcript into an actionable response

Workflow:
1. Load audio bytes from `/Volumes/<catalog>/<schema>/audio/...`
2. Call the **Whisper endpoint** to generate a transcript
3. Call the **Knowledge Assistant endpoint** using the transcript as input

By the end of this notebook, you’ll have an end-to-end pipeline that returns:
- **Transcript**: the speech-to-text output from Whisper
- **Answer**: a Knowledge Assistant response based on the transcript (for example: summary + recommended actions)

This pattern is a strong starting point for voice-driven support automation, call summarization, ticket creation, and knowledge-base grounded Q&A.


### Step 1 — Install dependencies
We’ll install the Databricks SDK used to call Model Serving endpoints from Python. After installing, we restart Python so the notebook uses the updated packages.


In [0]:
%pip install -U databricks-sdk openai
dbutils.library.restartPython()

### Step 2 — Configure endpoints and audio input
Here we set the Whisper endpoint name and the Knowledge Assistant endpoint name.  
We also set the audio path (stored in a Unity Catalog Volume) and expose it as a widget so you can easily switch files without editing code.

In [0]:
import base64
from typing import Any, Dict, Tuple

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import DataframeSplitInput

# =========================
# CONFIG (EDIT THESE)
# =========================
WHISPER_ENDPOINT = "whisper"
KNOWLEDGE_ASSISTANT_ENDPOINT = "ka-87f4a3bd-endpoint"
AUDIO_PATH = "/**/audio.mp3"

### Step 3 — File helpers to read audio from Volumes, DBFS, or local paths
Before calling Whisper, we need the raw audio bytes.  
This cell provides small utilities to normalize different Databricks path formats (like `dbfs:/`, `file:/`, or `/Volumes/...`) into something Python can `open()`, then reads the audio file into memory.


In [0]:
# =========================
# File helpers (DBFS / Volumes / local)
# =========================
def _normalize_path(path: str) -> str:
    # allow dbfs:/... paths
    if path.startswith("dbfs:/"):
        return "/dbfs/" + path[len("dbfs:/"):].lstrip("/")
    # allow file:/... paths
    if path.startswith("file:/"):
        return path[len("file:"):]
    # /Volumes/... and local paths work as-is
    return path

def read_audio_bytes(audio_path: str) -> bytes:
    p = _normalize_path(audio_path)
    with open(p, "rb") as f:
        return f.read()

### Step 4 — Transcribe audio with Whisper and extract the transcript
This cell handles the core speech-to-text step.

- We base64-encode the audio bytes and send them to the Whisper serving endpoint.
- Because endpoint input signatures can vary, we try a few common request formats:
  1) `dataframe_split` with a positional column **0** (int) — matches models logged with an input like `[0: binary]`
  2) `dataframe_split` with column `"0"` (string) — sometimes used depending on how the model was logged
  3) `inputs=[...]` — a tensor-style fallback

After the endpoint returns, `extract_transcript()` normalizes the response into a single transcript string (handling common shapes like `predictions[0]["text"]`, `predictions[0]["transcript"]`, or `predictions[0]`).


In [0]:
# =========================
# Whisper endpoint call
# =========================
def call_whisper_endpoint(w: WorkspaceClient, audio_bytes: bytes) -> Any:
    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

    # IMPORTANT:
    # Your model signature expects input [0: binary], where 0 is *positional* (numeric).
    # So we must send columns=[0] (int), not ["0"] (str).
    try:
        return w.serving_endpoints.query(
            name=WHISPER_ENDPOINT,
            dataframe_split=DataframeSplitInput(
                columns=[0],          # <-- int (fixes missing [0] / extra ['0'])
                data=[[audio_b64]],
            ),
        )
    except Exception as e1:
        # Fallback 1: some models are logged expecting string "0"
        try:
            return w.serving_endpoints.query(
                name=WHISPER_ENDPOINT,
                dataframe_split=DataframeSplitInput(
                    columns=["0"],
                    data=[[audio_b64]],
                ),
            )
        except Exception as e2:
            # Fallback 2: some endpoints are tensor-style; try `inputs`
            # (If your SDK is very old, this might raise "unexpected keyword argument 'inputs'")
            try:
                return w.serving_endpoints.query(
                    name=WHISPER_ENDPOINT,
                    inputs=[audio_b64],
                )
            except Exception as e3:
                raise RuntimeError(
                    "Failed calling Whisper endpoint with dataframe_split (int 0), "
                    "dataframe_split (str '0'), and inputs[].\n\n"
                    f"Error 1 (int 0): {e1}\n\n"
                    f"Error 2 (str '0'): {e2}\n\n"
                    f"Error 3 (inputs[]): {e3}\n"
                )


def extract_transcript(whisper_query_resp: Any) -> str:
    preds = getattr(whisper_query_resp, "predictions", None)
    if not preds:
        # Sometimes response might be dict-like
        if isinstance(whisper_query_resp, dict) and "predictions" in whisper_query_resp:
            preds = whisper_query_resp["predictions"]
        else:
            raise RuntimeError(f"No predictions found in Whisper response: {whisper_query_resp}")

    pred0 = preds[0]

    if isinstance(pred0, dict):
        if isinstance(pred0.get("text"), str):
            return pred0["text"]
        if isinstance(pred0.get("transcript"), str):
            return pred0["transcript"]
        return str(pred0)

    if isinstance(pred0, str):
        return pred0

    return str(pred0)

### Step 5 — Send the transcript to the Knowledge Assistant and extract a clean answer
Now that we have speech converted into text, we use that transcript as the input to an Agent Bricks **Knowledge Assistant** endpoint.

This cell includes:
- `call_knowledge_assistant()`: calls the endpoint using the OpenAI-compatible client from Databricks Model Serving.  
  It tries the newer **Responses API** first, and falls back to **Chat Completions** if needed (different workspaces/endpoints can vary).
- `extract_agent_text()`: pulls out a readable answer string from whichever response format is returned.


In [0]:
# =========================
# Knowledge Assistant call
# =========================
def extract_agent_text(resp: Any) -> str:
    # Best case
    if hasattr(resp, "output_text") and resp.output_text:
        return resp.output_text

    # Responses API fallback
    try:
        parts = []
        for item in resp.output:
            for c in item.content:
                if getattr(c, "text", None):
                    parts.append(c.text)
        if parts:
            return "\n".join(parts)
    except Exception:
        pass

    # Chat Completions fallback
    try:
        return resp.choices[0].message.content
    except Exception:
        pass

    return str(resp)


def call_knowledge_assistant(w: WorkspaceClient, user_text: str) -> Any:
    client = w.serving_endpoints.get_open_ai_client()

    # Try Responses API first
    try:
        return client.responses.create(
            model=KNOWLEDGE_ASSISTANT_ENDPOINT,
            input=[{"role": "user", "content": user_text}],
        )
    except Exception:
        # Fallback to Chat Completions
        return client.chat.completions.create(
            model=KNOWLEDGE_ASSISTANT_ENDPOINT,
            messages=[{"role": "user", "content": user_text}],
        )



### Step 6 — Run the full speech-to-answer pipeline
This final cell ties everything together into a single workflow:

1. Read the audio file from the configured path (`AUDIO_PATH`)
2. Transcribe the audio with the Whisper endpoint to produce a **transcript**
3. Send the transcript to the Knowledge Assistant endpoint to generate an **answer**
4. Print the transcript and the final response

In [0]:
# =========================
# End-to-end runner
# =========================
def speech_to_answer(audio_path: str) -> Dict[str, Any]:
    w = WorkspaceClient()

    audio_bytes = read_audio_bytes(audio_path)

    whisper_resp = call_whisper_endpoint(w, audio_bytes)
    transcript = extract_transcript(whisper_resp)
`   ssistant(w, transcript)
    answer = extract_agent_text(agent_resp)

    return {
        "audio_path": audio_path,
        "transcript": transcript,
        "answer": answer,
        "whisper_raw": whisper_resp,
        "agent_raw": agent_resp,
    }


# =========================
# RUN
# =========================
result = speech_to_answer(AUDIO_PATH) 

print("===== TRANSCRIPT =====")
print(result["transcript"])
print("\n===== ANSWER =====")
print(result["answer"])