<a href="https://colab.research.google.com/github/ZLSLU/EoAI/blob/main/EoAI_Unlocking_Luxembourgish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Elements of AI : Unlocking Luxembourgish

Welcome to Unlocking Luxembourgish. In this session you'll work with a temporary public API (at **34.90.113.55:5000**) to experiment with our Luxembourgish AI models for Speech-to-Text (STT) and Text-to-Speech (TTS).

The notebook is divided into two parts:

**Part I: STT Exercises** – You'll test various STT parameters (such as VAD, disfluency detection, temperature, beam size, etc.), explore different ways of parsing the JSON output (e.g., subtitling), and see how to chunk longer audio files.

**Part II: TTS Exercises** – You'll send text to be synthesized, view the resulting WAV file directly in the notebook, preprocess input text, visualize the audio waveform, and apply audio modifications.

Each exercise includes a brief description and code. The goal of these exercises is to experiment and have fun, but more advanced users are welcome to try out their own code.

Happy experimenting!

Before we begin, please run the following code. This will style the rest of the document such that it will be easier to read the code output.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

Next, run the following code to download all the audio you will need for testing.

In [None]:
# Download all example audio files
for i in range(1, 13):  # 1 to 12
    wav_url = f"https://raw.githubusercontent.com/ZLSLU/EoAI/main/EoAI_sample_{i}.wav"
    txt_url = f"https://raw.githubusercontent.com/ZLSLU/EoAI/main/EoAI_sample_{i}.txt"
    !wget "{wav_url}" -O "EoAI_sample_{i}.wav"
    !wget "{txt_url}" -O "EoAI_sample_{i}.txt"

!wget "https://raw.githubusercontent.com/ZLSLU/EoAI/main/EoAI_longform_1.mp3" -O "EoAI_longform_1.mp3"
!wget "https://raw.githubusercontent.com/ZLSLU/EoAI/main/EoAI_longform_1.txt" -O "EoAI_longform_1.txt"
!wget "https://raw.githubusercontent.com/ZLSLU/EoAI/main/EoAI_longform_2.mp3" -O "EoAI_longform_2.mp3"
!wget "https://raw.githubusercontent.com/ZLSLU/EoAI/main/EoAI_longform_2.txt" -O "EoAI_longform_2.txt"

## Part I: Speech-to-Text (STT) Exercises

In this section, you'll experiment with the STT endpoint by sending an audio file along with various parameters to see how they affect the transcription.

### Exercise STT-1: Test Health Endpoint

Verify that the API is accessible by checking the health endpoint. If you are experiencing issues or long delays, run this code to make sure the API is functioning. If the health check fails, please inform the instructor.

In [None]:
# exercise stt-1: test health endpoint
import requests

response = requests.get("http://34.90.113.55:5000/api/health")
print("status code:", response.status_code)
print("json response:", response.json())

# curl equivalent:
# curl http://34.90.113.55:5000/api/health

### Exercise STT-2: Basic File Upload for STT Transcription

Send an audio file to the STT endpoint. This exercise uses the `transcribed` model with the language set to Luxembourgish (`lb`). If many users are using this model, you may change to the `timestamped` model instead, which is a second version of the same model.

In [None]:
# exercise stt-2: basic file upload for stt
audio_file_path = "EoAI_sample_1.wav"

with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb"}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("uploaded stt response status:", response.status_code)
    print("uploaded stt response json:", response.json())

# curl equivalent:
# curl -F "file=@EoAI_sample_1.wav" -F "model_type=transcribed" -F "model_params={\"language\":\"lb\"}" http://34.90.113.55:5000/api/stt/transcribe

You can also just output the text, using the same code with a minor change:

In [None]:
# exercise stt-2: basic file upload for stt
audio_file_path = "EoAI_longform2.mp3"

with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb"}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)

if response.status_code == 200:
    response_json = response.json()
    transcribed_text = response_json.get("result", {}).get("text", "No transcription found")
    print(transcribed_text)


### Exercise STT-3: Test VAD Parameter

Enable Voice Activity Detection (VAD) to remove non-speech segments. Here, we test it by setting `vad` to `true` (which by default uses Silero VAD).

In [None]:
import json
import requests

# set the vad setting: change to True or False as desired
vad_setting = True

# open the audio file in binary mode
audio_file_path = "EoAI_long.mp3"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    # build the model_params dictionary and convert it to a json string
    model_params = {"language": "lb", "vad": vad_setting}
    data = {
        "model_type": "transcribed",
        "model_params": json.dumps(model_params)
    }
    # send the file to the stt endpoint
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("vad", vad_setting, "response status:", response.status_code)
    response_json = response.json()
    print("vad", vad_setting, "response json:", response_json)

# if the response is successful and contains result data, compute and print statistics
if response.status_code == 200 and "result" in response_json:
    result = response_json["result"]
    segments = result.get("segments", [])
    num_segments = len(segments)
    # compute average confidence across segments (if available)
    confidences = [seg.get("confidence", 0) for seg in segments]
    avg_confidence = sum(confidences) / len(confidences) if confidences else 0
    # get the overall transcription text from the result
    transcription_text = result.get("text", "")

    print("\nstatistics:")
    print("number of segments:", num_segments)
    print("average confidence:", round(avg_confidence, 3))
    print("transcription text snippet:", transcription_text[:400], "...")
else:
    print("error: no valid result in response")


### Exercise STT-4: Test detect_disfluencies Parameter

Enable disfluency detection to mark hesitations or filler words in the transcription. This can help in applications where you want to highlight or remove disfluencies (words like "umm", "ëëë", etc.).

In [None]:
# exercise stt-4: test detect_disfluencies parameter
audio_file_path = "tts_output_eee.wav"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "detect_disfluencies": true}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("detect_disfluencies response status:", response.status_code)
    print("detect_disfluencies response json:", response.json())

### Exercise STT-5: Test trust_whisper_timestamps Parameter

Whisper generates timestamps (start & end times) for words as it transcribes speech. However, these timestamps aren’t always perfect—sometimes they might:

-	Be slightly off (e.g., a word starts earlier or later than expected).
-	Overlap or be out of order in fast speech.
-	Have gaps between words where speech is actually continuous.

This setting determines whether to strictly follow Whisper’s timestamps or allow some adjustments.

Toggle the use of Whisper's timestamps. Here we test by setting `trust_whisper_timestamps` to `false`.

In [None]:
# exercise stt-5: test trust_whisper_timestamps parameter set to false
audio_file_path = "EoAI_sample_5.wav"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "trust_whisper_timestamps": false}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("trust_whisper_timestamps false response status:", response.status_code)
    print("trust_whisper_timestamps false response json:", response.json())

The following script allows you to compare the same file with the settings on true or false, you just need to change the filename.

In [None]:
import json
import requests

def call_stt(trust_flag):
    audio_file_path = "EoAI_long.mp3"  # update with the file you want to compare
    with open(audio_file_path, "rb") as f:
        files = {"file": f}
        model_params = {"language": "lb", "trust_whisper_timestamps": trust_flag}
        data = {
            "model_type": "timestamped",
            "model_params": json.dumps(model_params)
        }
        # send the request to the stt endpoint
        response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
        return response.json()

# call the stt endpoint with trust_whisper_timestamps true and false
result_true = call_stt(True)
result_false = call_stt(False)

# extract the segments from both responses (each segment should include "start" and "end")
segments_true = result_true.get("result", {}).get("segments", [])
segments_false = result_false.get("result", {}).get("segments", [])

# print a header for clarity
print("comparison of segment timestamps for trust_whisper_timestamps settings:")
print("{:<10} {:<30} {:<30}".format("segment", "trust true (start --> end)", "trust false (start --> end)"))

# determine the maximum number of segments from either response
max_segments = max(len(segments_true), len(segments_false))

for i in range(max_segments):
    # if a segment exists for the 'true' setting, extract its start and end times; otherwise, set as N/A
    if i < len(segments_true):
        seg_true = segments_true[i]
        start_true = seg_true.get("start", 0)
        end_true = seg_true.get("end", 0)
        ts_true = f"{start_true:.2f} --> {end_true:.2f}"
    else:
        ts_true = "N/A"

    # do the same for the 'false' setting
    if i < len(segments_false):
        seg_false = segments_false[i]
        start_false = seg_false.get("start", 0)
        end_false = seg_false.get("end", 0)
        ts_false = f"{start_false:.2f} --> {end_false:.2f}"
    else:
        ts_false = "N/A"

    # print the segment number and both timestamp ranges side by side
    print("{:<10} {:<30} {:<30}".format(i+1, ts_true, ts_false))

### Exercise STT-6: Test compute_word_confidence Parameter

Enable the computation of word confidence scores. This parameter allows you to see how confident the model is about each word in the transcription.

In [None]:
# exercise stt-6: test compute_word_confidence parameter
audio_file_path = "EoAI_sample_10.wav"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "compute_word_confidence": true}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("compute_word_confidence response status:", response.status_code)
    print("compute_word_confidence response json:", response.json())

### Exercise STT-7: Test refine_whisper_precision Parameter

Adjust the refinement of Whisper’s segment positions. Here we set `refine_whisper_precision` to `0.4` seconds.

In [None]:
# exercise stt-7: test refine_whisper_precision parameter (set to 0.4)
audio_file_path = "EoAI_longform_2.mp3"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "refine_whisper_precision": 0.4}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("refine_whisper_precision response status:", response.status_code)
    print("refine_whisper_precision response json:", response.json())

Now let's try comparing two different values for the same file.

In [None]:
import json
import requests

def call_stt(precision_value):
    """Call the STT API with a specific refine_whisper_precision value"""
    audio_file_path = "EoAI_longform2.mp3"

    with open(audio_file_path, "rb") as f:
        files = {"file": f}
        model_params = {"language": "lb", "refine_whisper_precision": precision_value}
        data = {
            "model_type": "timestamped",
            "model_params": json.dumps(model_params)
        }

        # Send the request to the STT API
        response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
        return response.json()

# Call the STT API with two different refine_whisper_precision values
precision_1 = 0.2
precision_2 = 1.0

result_1 = call_stt(precision_1)
result_2 = call_stt(precision_2)

# Extract the transcribed segments
segments_1 = result_1.get("result", {}).get("segments", [])
segments_2 = result_2.get("result", {}).get("segments", [])

# Print comparison table header
print(f"\nComparison of Segment Timestamps for refine_whisper_precision = {precision_1} vs {precision_2}")
print("{:<10} {:<30} {:<30}".format("Segment", f"Precision {precision_1} (start --> end)", f"Precision {precision_2} (start --> end)"))

# Determine the maximum number of segments
max_segments = max(len(segments_1), len(segments_2))

for i in range(max_segments):
    # If a segment exists for precision_1, extract its start and end times; otherwise, set as N/A
    if i < len(segments_1):
        seg_1 = segments_1[i]
        start_1 = seg_1.get("start", 0)
        end_1 = seg_1.get("end", 0)
        ts_1 = f"{start_1:.2f} --> {end_1:.2f}"
    else:
        ts_1 = "N/A"

    # Do the same for precision_2
    if i < len(segments_2):
        seg_2 = segments_2[i]
        start_2 = seg_2.get("start", 0)
        end_2 = seg_2.get("end", 0)
        ts_2 = f"{start_2:.2f} --> {end_2:.2f}"
    else:
        ts_2 = "N/A"

    # Print the segment number and both timestamp ranges side by side
    print("{:<10} {:<30} {:<30}".format(i+1, ts_1, ts_2))

### Exercise STT-8: Test Temperature Parameter

Experiment with the sampling temperature. Here we pass a list of fallback temperatures to see if it influences the transcription output.

In [None]:
# exercise stt-8: test temperature parameter (using a list of fallback temperatures)
audio_file_path = "EoAI_sample_12.wav"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "temperature": 1}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("temperature parameter response status:", response.status_code)
    response_json = response.json()
    transcribed_text = response_json.get("result", {}).get("text", "No transcription found")
    print(transcribed_text)

If you would like to compare the difference between your temperature settings and the official transcription, run the code below with the filename you want to compare.

In [None]:
import requests
import os

# Define the audio file
audio_file_path = "EoAI_sample_12.wav"
text_file_path = audio_file_path.replace(".wav", ".txt")  # Matching text file

# Read the reference text file (if it exists)
if os.path.exists(text_file_path):
    with open(text_file_path, "r", encoding="utf-8") as txt_file:
        reference_text = txt_file.read().strip()
else:
    reference_text = "Reference text not found."

# Call the STT API
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "temperature": 1.5}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)

# Check API response
print("\nTemperature Parameter Response Status:", response.status_code)

if response.status_code == 200:
    response_json = response.json()
    transcribed_text = response_json.get("result", {}).get("text", "No transcription found")
else:
    transcribed_text = "Error: Failed to transcribe audio."

# Print Comparison
print("\n" + "="*50)
print("📜 Reference Text (from .txt file):\n")
print(reference_text)
print("\n" + "="*50)
print("📝 Transcribed Text (from STT API):\n")
print(transcribed_text)
print("="*50)

### Exercise STT-9: Test Beam Size and Best Of Parameters

Adjust beam search parameters (`beam_size` and `best_of`) to see how they affect transcription quality and speed.

`beam_size` will generate X transcription candidates, and pick the most optimal one. This happens during one single run of the model. `best_of` will run the model X times independently, and choose the optimal candidate in the end.

In the example here, `beam_size` is set and `best_of` is not. This is simply to save on GPU resources, but both can be tested independently, though not together.

In [None]:
# exercise stt-9: test beam_size and best_of parameters
audio_file_path = "EoAI_sample_6.wav"
with open(audio_file_path, "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb", "beam_size": 5}'
    }
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    print("beam_size and best_of response status:", response.status_code)
    print("beam_size and best_of response json:", response.json())

### Exercise STT-10: Convert STT JSON Output to SRT Format

This script converts a JSON transcription response into SRT subtitle format, ensuring timestamps and subtitles are formatted correctly. It first converts time in seconds into SRT timecode format (hh:mm:ss,ms), then processes transcription segments, splitting long ones into smaller chunks without cutting words.

In [None]:
import math

def sec_to_time(seconds):
    # convert seconds to srt time format: hh:mm:ss,ms
    hours = math.floor(seconds / 3600)
    minutes = math.floor((seconds % 3600) / 60)
    secs = seconds % 60
    millis = int((secs - int(secs)) * 1000)
    return f"{hours:02}:{minutes:02}:{int(secs):02},{millis:03}"

def json_to_srt(result_json):
    # get the list of transcription segments from the json
    segments = result_json.get("result", {}).get("segments", [])
    srt_output = ""
    srt_counter = 1  # subtitle numbering starts at 1
    max_chunk_duration = 5.0  # maximum duration per subtitle in seconds

    for seg in segments:
        # retrieve segment start and end times and the overall segment text
        seg_start = seg.get("start", 0)
        seg_end = seg.get("end", 0)
        seg_text = seg.get("text", "").strip()
        duration = seg_end - seg_start

        # if the segment is short (<= max duration) or no word-level info is available,
        # output the segment as a single subtitle
        if duration <= max_chunk_duration or "words" not in seg:
            srt_output += f"{srt_counter}\n{sec_to_time(seg_start)} --> {sec_to_time(seg_end)}\n{seg_text}\n\n"
            srt_counter += 1
        else:
            # if the segment is long and word-level timestamps exist, we split it without cutting words
            words = seg.get("words", [])
            if not words:
                # fallback if words list is empty
                srt_output += f"{srt_counter}\n{sec_to_time(seg_start)} --> {sec_to_time(seg_end)}\n{seg_text}\n\n"
                srt_counter += 1
                continue

            # start the first subtitle chunk at the segment's start time
            current_chunk_start = seg_start
            current_chunk_words = []  # list to collect words for the current subtitle

            for word in words:
                # get the text of the word (using the 'text' key, as provided in your json)
                word_text = word.get("text", "")
                # get the start time of the word; if missing, default to current chunk start
                word_start = word.get("start", current_chunk_start)

                # if adding this word would exceed our maximum subtitle duration,
                # finish the current subtitle chunk before this word starts
                if (word_start - current_chunk_start) > max_chunk_duration:
                    # join the collected words into a string for the current subtitle chunk
                    chunk_text = " ".join(current_chunk_words)
                    # set the current chunk's end time to the start time of this word
                    current_chunk_end = word_start
                    srt_output += f"{srt_counter}\n{sec_to_time(current_chunk_start)} --> {sec_to_time(current_chunk_end)}\n{chunk_text}\n\n"
                    srt_counter += 1
                    # start a new chunk beginning with the current word
                    current_chunk_start = word_start
                    current_chunk_words = [word_text]
                else:
                    # if within the allowed duration, add the word to the current chunk
                    current_chunk_words.append(word_text)
            # output any remaining words as the final subtitle chunk for this segment
            if current_chunk_words:
                srt_output += f"{srt_counter}\n{sec_to_time(current_chunk_start)} --> {sec_to_time(seg_end)}\n{' '.join(current_chunk_words)}\n\n"
                srt_counter += 1

    return srt_output


result_json = response.json()  # using the latest stt response
srt_text = json_to_srt(result_json)
print("srt output:\n", srt_text)


### Exercise STT-11: Chunking Long Audio Files

This script splits a long audio file into smaller 15-second chunks using pydub, then sends each chunk separately to an STT (speech-to-text) API for transcription. Each chunk’s transcribed segments are adjusted to reflect their actual position in the original audio, ensuring correct timestamps. Finally, all segments are combined into a single transcription, reconstructing the full text from the individual STT results.

This script is also more advanced, but you can change the following variables to test it out:

`file_path` : the file you want to transcribe
`chunk_size` : how many seconds long each chunk should be

In [None]:
# install pydub
!pip install pydub

import math
from pydub import AudioSegment
from io import BytesIO
import requests

def chunk_audio_in_memory(file_path, chunk_duration):
    """
    read an audio file and split it into chunks of chunk_duration seconds.
    returns a list of tuples: (audio_chunk, chunk_start_time_in_seconds).
    """
    # load the entire audio file into memory
    audio = AudioSegment.from_file(file_path)
    duration_ms = len(audio)  # duration in milliseconds
    chunks = []
    # iterate through the audio in increments of chunk_duration (converted to ms)
    for start in range(0, duration_ms, int(chunk_duration * 1000)):
        end = min(start + int(chunk_duration * 1000), duration_ms)
        chunk = audio[start:end]
        # append the chunk along with its start time (in seconds)
        chunks.append((chunk, start / 1000.0))
    return chunks

def process_chunk(chunk, chunk_start, model_type="transcribed", model_params='{"language": "lb"}'):
    """
    export an audio chunk to a temporary in-memory wav file,
    send it to the stt endpoint, and adjust the timestamps of the returned segments.
    """
    # create a bytes buffer to hold the wav file data
    chunk_io = BytesIO()
    # export the chunk in wav format into the bytes buffer
    chunk.export(chunk_io, format="wav")
    # reset pointer to the beginning of the buffer
    chunk_io.seek(0)

    # prepare the file data for multipart/form-data
    files = {"file": ("chunk.wav", chunk_io, "audio/wav")}
    data = {
        "model_type": model_type,
        "model_params": model_params
    }
    # send the chunk to the stt endpoint (adjust the url if needed)
    response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    result_json = response.json()

    # adjust the timing in each segment by adding the chunk's start time
    segments = result_json.get("result", {}).get("segments", [])
    for seg in segments:
        seg["start"] += chunk_start
        seg["end"] += chunk_start
        # adjust word-level timestamps if available
        if "words" in seg:
            for word in seg["words"]:
                word["start"] += chunk_start
                word["end"] += chunk_start
    return result_json

# main process: chunk the file, process each chunk, and combine the results
file_path = "EoAI_longform2.mp3"  # update with your audio file path
chunk_duration = 15  # seconds per chunk

# get list of (chunk, start_time) tuples
chunks = chunk_audio_in_memory(file_path, chunk_duration)
print("number of chunks:", len(chunks))

all_segments = []  # list to hold all transcription segments

# process each chunk one by one
for chunk, start_time in chunks:
    print("processing chunk starting at", start_time, "seconds")
    result_json = process_chunk(chunk, start_time)
    segments = result_json.get("result", {}).get("segments", [])
    all_segments.extend(segments)

# combine all segments into one final result
final_text = " ".join(seg.get("text", "") for seg in all_segments)
final_result = {
    "result": {
        "segments": all_segments,
        "text": final_text
    },
    "status": "success"
}

print("final combined transcription:")
print(final_result["result"]["text"])


### Exercise STT-12: Advanced JSON Parsing – Extract Word-level Timestamps

Extract word-level timestamps from the STT JSON output and display them in a table using Pandas.

In [None]:
# exercise stt-12: extract word-level timestamps and display in a table
import pandas as pd

def extract_word_timestamps(result_json):
    # assume each segment has a 'words' key with a list of dicts with 'word', 'start', 'end'
    segments = result_json.get("result", {}).get("segments", [])
    rows = []
    for seg in segments:
        words = seg.get("words", [])
        for w in words:
            rows.append({
                "word": w.get("word", ""),
                "start": w.get("start", 0),
                "end": w.get("end", 0)
            })
    return pd.DataFrame(rows)

# demo extraction using latest stt response
df = extract_word_timestamps(result_json)
print(df.head())

## Part II: Text-to-Speech (TTS) Exercises

In this section, you'll experiment with the TTS endpoint. You will send text to synthesize audio, display the resulting WAV file directly in the notebook, preprocess text, visualize waveforms, and apply some audio modifications.

### Exercise TTS-1: Basic TTS Request

Send a simple text to the TTS endpoint and save the resulting WAV file. Then, display the audio directly in the notebook.

In [None]:
# exercise tts-1: basic tts request
tts_payload = {"text": "Gudde Moien, hei ass e Beispill vun der Sproochsynthees an der Lëtzebuerger Sprooch."}
response = requests.post("http://34.90.113.55:5000/api/tts/synthesize", json=tts_payload, timeout=120)
print("tts response status:", response.status_code)

with open("tts_output.wav", "wb") as f:
    f.write(response.content)
print("saved tts_output.wav locally")

In [None]:
# exercise tts-1: display the tts output audio
from IPython.display import Audio
Audio("tts_output.wav")

### Exercise TTS-2: Preprocess Text for TTS

Without text preprocessing, the model will not provide optimal results. In the previous example, the pronunciation was strange. This is because all text must be set to lowercase.

Preprocess the input text (e.g., convert to lowercase and trim extra whitespace) before sending it to the TTS endpoint.

In [None]:
# exercise tts-2: preprocess text before tts
raw_text = "Gudde Moien, hei ass e Beispill vun der Sproochsynthees an der Lëtzebuerger Sprooch."
processed_text = raw_text.lower().strip()
print("processed text:", processed_text)

tts_payload = {"text": processed_text}
response = requests.post("http://34.90.113.55:5000/api/tts/synthesize", json=tts_payload, timeout=120)

with open("tts_output.wav", "wb") as f:
    f.write(response.content)
print("saved tts_output.wav locally")

In [None]:
# exercise tts-2: display processed tts audio
Audio("tts_output.wav")

### Exercise TTS-3: Visualize TTS Waveform

Load the synthesized audio and plot its waveform using matplotlib.

- Helps analyze TTS quality by visualizing the waveform for distortions, unnatural pauses, or clipping.
- Allows comparison between different parameters to check for smoothness, loudness, and structure.
- Detects silent gaps or glitches in synthesis that might not be noticeable just by listening.
- Useful for debugging issues related to volume, pitch variations, or incorrect synthesis timing.

In [None]:
# exercise tts-3: visualize the waveform of tts output
import matplotlib.pyplot as plt
import numpy as np
import soundfile as sf

data, samplerate = sf.read("tts_output.wav")
times = np.linspace(0, len(data)/samplerate, num=len(data))

plt.figure(figsize=(10,4))
plt.plot(times, data)
plt.title("Waveform of TTS Output")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

### Exercise TTS-4: Modify TTS Output with an Audio Filter

Use `pydub` to apply an audio filter to the TTS output—for example, increasing the playback speed.

In [None]:
# exercise tts-4: modify tts output - change speed using pydub
from pydub import AudioSegment

tts_audio = AudioSegment.from_wav("tts_output.wav")
# increase speed by 1.5x
faster_audio = tts_audio.speedup(playback_speed=1.5)
faster_audio.export("tts_faster.wav", format="wav")
print("exported tts_faster.wav")

In [None]:
# exercise tts-4: display modified tts audio
Audio("tts_faster.wav")

### Exercise TTS-5: Chain STT and TTS

Transcribe an audio file using the STT endpoint and then synthesize the resulting text using the TTS endpoint. This demonstrates a full processing pipeline.

In [None]:
import requests
from pydub import AudioSegment

# exercise tts-5: chain stt and tts
with open("EoAI_sample_8.wav", "rb") as f:
    files = {"file": f}
    data = {
        "model_type": "transcribed",
        "model_params": '{"language": "lb"}'
    }
    stt_response = requests.post("http://34.90.113.55:5000/api/stt/transcribe", files=files, data=data, timeout=600)
    stt_result = stt_response.json()
    transcription_text = stt_result.get("result", {}).get("text", "")
    print("transcription:", transcription_text)

tts_payload = {"text": transcription_text.lower().strip()}
tts_response = requests.post("http://34.90.113.55:5000/api/tts/synthesize", json=tts_payload, timeout=120)
with open("chain_tts.wav", "wb") as f:
    f.write(tts_response.content)
print("saved chain_tts.wav locally")

In [None]:
# exercise tts-5: display chained tts audio
from IPython.display import Audio

Audio("chain_tts.wav")

### Exercise TTS-6: Advanced TTS Modification – Adjust Volume

Use `pydub` to modify the volume of the synthesized audio (e.g. increase by 6 dB) and then play the modified output.

In [None]:
# exercise tts-7: advanced tts modification - adjust volume using pydub

tts_audio = AudioSegment.from_wav("tts_output.wav")
# increase volume by 6 dB
louder_audio = tts_audio + 6
louder_audio.export("tts_louder.wav", format="wav")
print("exported tts_louder.wav with increased volume")

In [None]:
# exercise tts-7: display advanced tts modified audio
from IPython.display import Audio

Audio("tts_louder.wav")

# Utilities

In [None]:
import glob
from IPython.display import display, HTML, Audio

# get a list of audio files with wav and mp3 extensions
wav_files = glob.glob("*.wav")
mp3_files = glob.glob("*.mp3")
audio_files = sorted(wav_files + mp3_files)  # sort files alphabetically

rows = []  # list to hold (filename, audio display html, transcription text) tuples

for audio_file in audio_files:
    # get the base filename (without extension)
    base = audio_file.rsplit(".", 1)[0]
    txt_file = base + ".txt"  # expected corresponding transcription file
    try:
        with open(txt_file, "r") as f:
            transcription = f.read().strip()
    except FileNotFoundError:
        transcription = "no transcription found"

    # create an embedded audio player using IPython.display.Audio
    audio_display = Audio(audio_file, embed=True)._repr_html_()
    rows.append((audio_file, audio_display, transcription))

# sort rows alphabetically by filename
rows.sort()

# build an html table to display the filenames, audio players, and corresponding transcriptions
table_html = "<table border='1' style='border-collapse: collapse; width: 100%;'>"
table_html += "<tr><th>filename</th><th>audio</th><th>reference transcription</th></tr>"
for filename, audio_disp, trans in rows:
    table_html += f"<tr><td>{filename}</td><td>{audio_disp}</td><td>{trans}</td></tr>"
table_html += "</table>"

display(HTML(table_html))

In [None]:
def word_error_rate(reference, hypothesis):
    # split both strings into lists of words (converted to lowercase to normalize)
    ref_words = reference.lower().split()
    hyp_words = hypothesis.lower().split()

    # create a dp table of size (len(ref_words)+1) x (len(hyp_words)+1)
    dp = [[0] * (len(hyp_words) + 1) for _ in range(len(ref_words) + 1)]

    # initialize base cases: distance from empty string
    for i in range(len(ref_words) + 1):
        dp[i][0] = i  # i deletions
    for j in range(len(hyp_words) + 1):
        dp[0][j] = j  # j insertions

    # fill the dp table using dynamic programming
    for i in range(1, len(ref_words) + 1):
        for j in range(1, len(hyp_words) + 1):
            if ref_words[i - 1] == hyp_words[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]  # no error if words are the same
            else:
                substitution = dp[i - 1][j - 1] + 1
                insertion = dp[i][j - 1] + 1
                deletion = dp[i - 1][j] + 1
                dp[i][j] = min(substitution, insertion, deletion)

    edit_distance = dp[len(ref_words)][len(hyp_words)]
    # calculate wer as edit distance divided by number of words in the reference
    wer = edit_distance / float(len(ref_words)) if ref_words else 0
    return wer

# read the reference transcription from a text file
ref_filename = "EoAI_long.txt"  # update with the desired filename
with open(ref_filename, "r") as f:
    reference_transcription = f.read().strip()

# assume the model's transcription is obtained from the stt endpoint
# for example, using response.json() from an earlier request:
model_transcription = response.json().get("result", {}).get("text", "")

print("reference transcription:")
print(reference_transcription)
print("\nmodel transcription:")
print(model_transcription)

# compute the word error rate between the reference and model transcriptions
wer = word_error_rate(reference_transcription, model_transcription)
print("\nword error rate (wer):", round(wer, 3))

## Conclusion and Next Steps

In this notebook you:

- Experimented with a range of STT parameters to see how they influence transcription output.
- Learned how to parse and convert raw JSON outputs (e.g., to SRT format) and even how to chunk long audio files.
- Explored the TTS endpoint by synthesizing speech from text, visualizing waveforms, and applying post-processing filters.

Feel free to extend these exercises with your own experiments and modifications. Happy coding!