[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1arL7bWuF2P3soS3p19MWJeUDtW0Eu5tk?usp=sharing)

# Get Started with Streaming Speech-to-Text

In this notebook, you will learn how to:

*  Open a websocket connection to the Fireworks Streaming Speech-to-Text API;
*  Stream audio to the API;
*  Receive the transcription of the audio stream;

Please note that, the audio stream must be 16000 Hz mono audio chunks representing intervals of 50ms or greater.

## Install dependencies



In [None]:
!pip3 install requests torch torchaudio websocket-client

## 1. Prepare audio stream

In this example, we will use a pre-recorded audio file and convert it to a stream of audio chunks each 50ms long.

In [None]:
import io
import requests
import torch
import torchaudio

# Download audio file
response = requests.get("https://storage.googleapis.com/fireworks-public/test/3.5m.flac")
audio_bytes = response.content
print(f"Downloaded audio file size: {len(audio_bytes)} bytes")

# Load to torch tensor
audio_tensor, sample_rate = torchaudio.load(io.BytesIO(audio_bytes))
print(f"Loaded audio tensor. shape={audio_tensor.shape} sample_rate={sample_rate}")

# Resample to 16000 Hz
target_sample_rate = 16000
audio_tensor = torchaudio.functional.resample(audio_tensor, sample_rate, target_sample_rate)
print(f"Resampled audio tensor. shape={audio_tensor.shape} sample_rate={target_sample_rate}")

# Convert to mono
audio_tensor = audio_tensor.mean(dim=0, keepdim=True)
print(f"Mono audio tensor. shape={audio_tensor.shape}")

# Split into chunks of 50ms
chunk_size_ms = 50
audio_chunk_tensors = torch.split(audio_tensor, int(chunk_size_ms * target_sample_rate / 1000), dim=1)
print(f"Split into {len(audio_chunk_tensors)} audio chunks each {chunk_size_ms}ms")

# Convert to bytes
audio_chunk_bytes = []
for audio_chunk_tensor in audio_chunk_tensors:
    audio_chunk_bytes.append((audio_chunk_tensor * 32768.0).to(torch.int16).numpy().tobytes())

## 2. Stream audio and get transcription

We will store transcription segments in a dict, with keys as segment IDs and values as transcription text.

In [None]:
import json
import threading
import time
import websocket
import urllib.parse
from IPython.display import clear_output


"""
The client maintains a state dictionary, starting with an empty
dictionary `{}`. When the server sends the first transcription message,
it contains a list of segments. Each segment has an `id` and `text`:

Server initial message:
{
    "segments": [
        {"id": "0", "text": "This is the first sentence"},
        {"id": "1", "text": "This is the second sentence"}
    ]
}

Client initial state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence",
}

When the server sends the next updates to the transcription, the client
updates the state dictionary based on the segment `id`:

Server continuous message:
{
    "segments": [
        {"id": "1", "text": "This is the second sentence modified"},
        {"id": "2", "text": "This is the third sentence"}
    ]
}

Client updated state:
{
    "0": "This is the first sentence",
    "1": "This is the second sentence modified",   # overwritten
    "2": "This is the third sentence",             # new
}
"""

lock = threading.Lock()
segments = {}


def on_open(ws):
    def stream_audio(ws):
        for chunk in audio_chunk_bytes:
            ws.send(chunk, opcode=websocket.ABNF.OPCODE_BINARY)
            time.sleep(chunk_size_ms / 1000)

        final_checkpoint = json.dumps({"checkpoint_id": "final"})
        ws.send(final_checkpoint, opcode=websocket.ABNF.OPCODE_TEXT)

    threading.Thread(target=stream_audio, args=(ws,)).start()


def on_error(ws, error):
    print(f"Error: {error}")


def on_message(ws, message):
    message = json.loads(message)
    if message.get("checkpoint_id") == "final":
        ws.close()
        return

    updated_segments = {
        segment["id"]: segment["text"]
        for segment in message["segments"]
    }
    with lock:
        segments.update(updated_segments)
        clear_output(wait=True)
        print("\n".join(f" - {k}: {v}" for k, v in segments.items()))


url = "ws://audio-streaming.us-virginia-1.direct.fireworks.ai/v1/audio/transcriptions/streaming"
params = urllib.parse.urlencode({
    "language": "en",
})
ws = websocket.WebSocketApp(
    f"{url}?{params}",
    header={
        "Authorization": "<FIREWORKS_API_KEY>",
    },
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
)
ws.run_forever()

## Conclusion

In this notebook, you learned how to stream audio to the Streaming Speech-to-Text API and receive the transcription in real-time over a websocket connection.

For more information visit [docs.fireworks.ai](https://docs.fireworks.ai/api-reference/audio-streaming-transcriptions).

Explore the community or reach out to us in [discord](https://discord.gg/fireworks-ai).