# Speech-to-Text Client - Walkthrough

This notebook demonstrates how to use the STT client (Nemotron Speech Streaming).

1. **Health Check** - Verify STT service is reachable
2. **Transcribe Audio** - Transcribe a WAV file
3. **Word Timestamps** - Show word-level timing
4. **Round-trip** - Generate speech with TTS, then transcribe it back

## Setup

In [1]:
import sys
sys.path.insert(0, '/app')

import nest_asyncio
nest_asyncio.apply()

import asyncio
from pathlib import Path

## 1. Health Check

First, let's verify the STT service is running and reachable.

In [2]:
from src.clients.stt import (
    check_stt_health,
    transcribe_audio,
    DEFAULT_STT_URL,
)

async def check_health():
    return await check_stt_health()

is_healthy = asyncio.get_event_loop().run_until_complete(check_health())
print(f"STT URL: {DEFAULT_STT_URL}")
print(f"STT healthy: {is_healthy}")

STT URL: http://192.168.5.96:8001
STT healthy: True


## 2. Transcribe a WAV File

Load a WAV file and transcribe it with word-level timestamps.

In [3]:
# Load a WAV file (e.g., one generated by the TTS notebook)
wav_path = Path("/app/media/tts_samples/sample.wav")

if wav_path.exists():
    audio_bytes = wav_path.read_bytes()
    print(f"Loaded {len(audio_bytes)} bytes from {wav_path}")

    async def transcribe():
        return await transcribe_audio(audio_bytes, timestamps=True)

    result = asyncio.get_event_loop().run_until_complete(transcribe())
    print(f"\nTranscription: {result.text}")
else:
    print(f"No WAV file found at {wav_path}")
    print("Run the TTS notebook first to generate a sample.")

Loaded 159788 bytes from /app/media/tts_samples/sample.wav

Transcription: Welcome to Wei Wo, the platform that showcases what makers are working on.


## 3. Word-Level Timestamps

Display the word-level timing information (used for subtitle generation).

In [4]:
if 'result' in dir() and result.words:
    print(f"{'Word':<15} {'Start':>8} {'End':>8} {'Duration':>8}")
    print("-" * 45)
    for w in result.words:
        duration = w.end - w.start
        print(f"{w.word:<15} {w.start:>7.2f}s {w.end:>7.2f}s {duration:>7.2f}s")
    
    total_duration = result.words[-1].end
    print(f"\nTotal duration: {total_duration:.2f}s")
    print(f"Total words: {len(result.words)}")
else:
    print("No transcription result with timestamps available.")
    print("Run the cells above first.")

Word               Start      End Duration
---------------------------------------------
Welcome            0.16s    0.32s    0.16s
to                 0.32s    0.40s    0.08s
Wei                1.12s    1.20s    0.08s
Wo,                1.28s    1.60s    0.32s
the                1.60s    1.68s    0.08s
platform           1.68s    1.84s    0.16s
that               1.84s    1.92s    0.08s
showcases          1.92s    2.40s    0.48s
what               2.48s    2.56s    0.08s
makers             2.64s    2.96s    0.32s
are                2.96s    3.04s    0.08s
working            3.04s    3.12s    0.08s
on.                3.44s    3.60s    0.16s

Total duration: 3.60s
Total words: 13


## 4. TTS -> STT Round-trip

Generate speech from text, then transcribe it back to verify the pipeline.

In [5]:
from src.clients.tts import generate_speech

original_text = "The quick brown fox jumps over the lazy dog."

async def roundtrip():
    # Step 1: Text -> Speech
    wav = await generate_speech(text=original_text)
    print(f"TTS generated {len(wav)} bytes")

    # Step 2: Speech -> Text
    result = await transcribe_audio(wav, timestamps=True)
    return result

rt_result = asyncio.get_event_loop().run_until_complete(roundtrip())
print(f"\nOriginal:    {original_text}")
print(f"Transcribed: {rt_result.text}")

if rt_result.words:
    print(f"\nWord timestamps ({len(rt_result.words)} words):")
    for w in rt_result.words:
        print(f"  [{w.start:.2f}s - {w.end:.2f}s] {w.word}")

TTS generated 108588 bytes

Original:    The quick brown fox jumps over the lazy dog.
Transcribed: The quick brown fox jumps over the lazy dog.

Word timestamps (9 words):
  [0.24s - 0.32s] The
  [0.48s - 0.56s] quick
  [0.64s - 0.88s] brown
  [1.12s - 1.20s] fox
  [1.28s - 1.44s] jumps
  [1.44s - 1.52s] over
  [1.60s - 1.68s] the
  [1.76s - 2.32s] lazy
  [2.32s - 2.48s] dog.


## Next Steps

- See `src/clients/stt.py` for the full client API
- Word timestamps are used by the subtitle overlay step in the video pipeline
- The TTS -> STT pipeline provides both audio and timing data for MoviePy