#  Programming Assignment 6b - Speech

In this assignment, you will convert sampled sentences into audio files using text-to-speech, and then transcribe the audio files back into text using speech-to-text.

## Understanding the Pipeline

In this assignment, you'll see how information transforms (and degrades) through a text-to-speech-to-text pipeline. Here's what happens:

### Step 1: Original Text
You start with structured text that has:
- Character names in ALL CAPS with colons (e.g., `BUCKINGHAM:`)
- Line breaks separating speakers
- Some grammatical structure

**Example:**
```
BUCKINGHAM:
Why, since you can make us fly to the way
Of this in the benefit of my life?

POLIXENES:
Ay, nor I do, that have heard, my title
To make his office in your highness.

ISABELLA:
That is not a little little as to be.
```

### Step 2: Text → Speech (TTS)
The TTS model reads this text aloud and creates an audio file. It:
- Ignores formatting (line breaks, colons, capitalization)
- Tries to pronounce character names naturally
- Joins everything into one continuous audio stream

### Step 3: Speech → Text (STT)
The STT model listens to the audio and tries to transcribe what it hears. Important: it has **no knowledge of**:
- The original formatting
- That these are character names from Shakespeare
- Any context about the text

It only knows: "What English words sound like what I'm hearing?"

### Step 4: What Gets Lost
By the end of the pipeline, you'll see:
- All formatting is gone (no line breaks, character labels)
- Character names get misrecognized ("BUCKINGHAM" might become "Buckingham", "Biscayne Jam", or spelled out letter-by-letter)
- Unusual names get mangled ("POLIXENES" → "Pollux knees" or "P-O-L-I-X-E-N-E-S")
- Common English words in the dialogue mostly survive

**The pedagogical point:** Text-to-speech-to-text is a **lossy process**, especially for text with unusual proper nouns and formatting. You'll see this firsthand when comparing your original sentences to the final transcription.

## Environment Setup

```
conda create -n cs124 python=3.10 -y
conda activate cs124
pip install cartesia
```

## Part 1: Convert Sampled Sentences into Audio Files

We have provided a default `sampled_sentences.json` file for you to use. If you'd like, you can optionally replace it with the sentences you sampled from your trained model in PA6a.

You should use the Cartesia API to convert your sampled sentences into audio files. You can do so by running the `tts.py` script. This script will automatically save the audio file as a wav file called `sampled_sentences_speech.wav`. Remember that you need to set your Cartesia API key in the `CARTESIA_API_KEY` environment variable, by example, by running:
```
export CARTESIA_API_KEY=your_api_key
```
before running the `tts.py` script.

You can browse and try out different voices on the [Cartesia Playground](https://play.cartesia.ai/voices). To use a different voice, replace the voice `id` in `tts.py` with the ID of the voice you'd like to use.

## Part 2: Transcribe the Audio Files into Text

Next, you should use the Cartesia API to transcribe the audio files into text. You can do so by running the `speech_to_text.py` script. This script will automatically save the text file as a txt file called `sampled_sentences_speech.txt`. Remember that you need to set your Cartesia API key in the `CARTESIA_API_KEY` environment variable, by example, by running:
```
export CARTESIA_API_KEY=your_api_key
```
before running the `speech_to_text.py` script.

We will include the transcribed file `sampled_sentences_speech.txt` in the submission as part of the grading. You are also encouraged to try what happens when you record yourself reading the sampled sentences in a noisy environment and transcribe the audio file back into text; and compare the quality of the transcription with the original sampled sentences.

## Part 3: Error Analysis

Now that you've run the TTS→STT pipeline, it's time to analyze what happened to your text. Answer the following questions based on your results.

### Question 1: Error Classification

Compare your original `sampled_sentences.json` to the final `sampled_sentences_speech.txt`. Identify and describe three distinct types of errors or information loss you observe. For each type, provide a specific example from your results and explain what aspect of the TTS or STT process likely caused it.

**Your answer:**

### Question 2: Model Bias and Training Data

Based on the errors you observed, what can you infer about the training data or design priorities of the speech-to-text model? Consider: What types of content does it handle well vs. poorly? What trade-offs might the model designers have made?

**Your answer:**

### Question 3: Formatting as Information

Your original text likely had formatting elements (capitalization, line breaks, punctuation, etc.) that disappeared in the final transcription. Choose one formatting element that was lost and explain: (a) Why current speech systems can't preserve it, and (b) What would be required to preserve it in a future system.

**Your answer:**

## Part 4: Zip and Submit 

Run `bash create_assignment_zip.sh` to zip your submission and submit the zip file to Gradescope.

To recap, the submission zip should include the following files:

- `sampled_sentences_speech.txt`: the transcribed text of the audio file of your sampled sentences