# Asynchronous Speech Synthesis with Amazon Polly and S3

This notebook demonstrates how to use Amazon Polly's asynchronous text-to-speech capabilities with Amazon S3 integration. You'll learn how to:

- Set up the boto3 client for Amazon Polly and S3
- Create a speech synthesis task using the `start_speech_synthesis_task` API
- Check the status of the task
- Retrieve the audio file from S3
- Handle longer text that exceeds the character limits of synchronous requests

## Prerequisites

- An AWS account with access to Amazon Polly and S3
- AWS credentials configured locally
- Python 3.6+ with boto3 installed
- An S3 bucket with appropriate permissions

Let's get started!

## Setting up the Environment

In [None]:
%%bash
pip install boto3 ipython

First, we'll import the necessary libraries and set up our AWS clients for both Polly and S3.

In [1]:
# Import required libraries
import boto3
import os
import time
import json
import uuid
from IPython.display import Audio
from urllib.parse import urlparse

# Create clients for Amazon Polly and S3
polly_client = boto3.client('polly')
s3_client = boto3.client('s3')

# Create output directory if it doesn't exist
output_dir = "audio_output"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

## Configuring S3 Bucket

Amazon Polly requires an S3 bucket to store the output of asynchronous speech synthesis. You'll need to specify a bucket name below.

**Note**: If you don't have an S3 bucket already set up, you can create one using the AWS Management Console or the boto3 client.

In [None]:
# Set your S3 bucket name
# Replace 'your-bucket-name' with your actual bucket name
s3_bucket = 'your-bucket-name'

# Check if the bucket exists
try:
    s3_client.head_bucket(Bucket=s3_bucket)
    print(f"S3 bucket '{s3_bucket}' is accessible.")
except Exception as e:
    print(f"Error accessing S3 bucket '{s3_bucket}': {str(e)}")
    print("\nYou need to create an S3 bucket or use an existing one before proceeding.")
    print("You can create a bucket with the following code:")
    print("s3_client.create_bucket(Bucket='your-bucket-name')")

## Understanding Asynchronous Synthesis vs. Synchronous Synthesis

Amazon Polly provides two methods for converting text to speech:

### Synchronous (SynthesizeSpeech)
- Limited to 3,000 characters (including SSML tags)
- Results returned directly in the API response
- Good for real-time use cases
- Simpler to implement

### Asynchronous (StartSpeechSynthesisTask)
- Supports up to 100,000 characters (including SSML tags)
- Results stored in an S3 bucket
- Better for longer texts and batch processing
- Requires polling to check task completion
- Supports lexicons and longer texts

We'll focus on the asynchronous method in this notebook.

## Helper Functions for Asynchronous Speech Synthesis

In [2]:
def start_synthesis_task(text, voice_id, s3_bucket, s3_key, engine="neural", output_format="mp3", text_type="text"):
    """
    Start an asynchronous speech synthesis task with Amazon Polly.
    
    Parameters:
    - text: The text to convert to speech
    - voice_id: The voice to use (e.g., 'Joanna', 'Matthew')
    - s3_bucket: The S3 bucket to store the output
    - s3_key: The S3 key (file path) for the output
    - engine: The engine to use ('standard', 'neural', or 'long-form')
    - output_format: The output format ('mp3', 'ogg_vorbis', or 'pcm')
    - text_type: The type of input text ('text' or 'ssml')
    
    Returns:
    - Task ID for the speech synthesis task
    """
    try:
        response = polly_client.start_speech_synthesis_task(
            Text=text,
            VoiceId=voice_id,
            OutputS3BucketName=s3_bucket,
            OutputS3KeyPrefix=s3_key,
            Engine=engine,
            OutputFormat=output_format,
            TextType=text_type
        )
        task_id = response['SynthesisTask']['TaskId']
        print(f"Started speech synthesis task: {task_id}")
        return task_id
    except Exception as e:
        print(f"Error starting speech synthesis task: {str(e)}")
        return None

def get_task_status(task_id):
    """
    Check the status of a speech synthesis task.
    
    Parameters:
    - task_id: The ID of the task to check
    
    Returns:
    - Task status information
    """
    try:
        response = polly_client.get_speech_synthesis_task(TaskId=task_id)
        return response['SynthesisTask']
    except Exception as e:
        print(f"Error getting task status: {str(e)}")
        return None

def wait_for_task_completion(task_id, polling_interval=5, max_wait_time=300):
    """
    Poll until a speech synthesis task is complete or fails.
    
    Parameters:
    - task_id: The ID of the task to check
    - polling_interval: Seconds between status checks
    - max_wait_time: Maximum seconds to wait before timing out
    
    Returns:
    - Completed task information or None if it times out or fails
    """
    start_time = time.time()
    while True:
        task_status = get_task_status(task_id)
        if not task_status:
            return None
            
        status = task_status['TaskStatus']
        print(f"Task status: {status}")
        
        if status == 'completed':
            print("Task completed successfully!")
            return task_status
        elif status == 'failed':
            print(f"Task failed: {task_status.get('TaskStatusReason', 'Unknown reason')}")
            return None
        
        # Check if we've exceeded the maximum wait time
        if time.time() - start_time > max_wait_time:
            print(f"Timed out after waiting for {max_wait_time} seconds")
            return None
        
        # Wait before checking again
        time.sleep(polling_interval)

def download_file_from_s3(s3_uri, local_filename):
    """
    Download a file from S3 to the local filesystem.
    
    Parameters:
    - s3_uri: The S3 URI of the file to download
    - local_filename: The local path to save the file to
    
    Returns:
    - Path to the downloaded file or None if download fails
    """
    try:
        # Parse the S3 URI to get bucket and key
        parsed_uri = urlparse(s3_uri)
        bucket = parsed_uri.netloc
        key = parsed_uri.path.lstrip('/')
        
        # Create the full local path
        local_path = os.path.join(output_dir, local_filename)
        
        # Download the file
        s3_client.download_file(bucket, key, local_path)
        print(f"Downloaded file from {s3_uri} to {local_path}")
        return local_path
    except Exception as e:
        print(f"Error downloading file from S3: {str(e)}")
        return None

def play_audio_file(file_path, format="audio/mp3"):
    """
    Play an audio file in the notebook.
    
    Parameters:
    - file_path: Path to the audio file
    - format: The format of the audio file
    """
    try:
        return Audio(filename=file_path, autoplay=True)
    except Exception as e:
        print(f"Error playing audio file: {str(e)}")

## Example 1: Basic Asynchronous Speech Synthesis

Let's start with a basic example of asynchronous speech synthesis. We'll send a short text to Amazon Polly and retrieve the audio from S3.

In [None]:
# Sample text to synthesize
sample_text = "This is a demonstration of Amazon Polly's asynchronous speech synthesis. With this method, we can convert longer texts to speech and store the results in S3."

# Generate a unique key for the S3 output
s3_key_prefix = "polly/basic-demo"

# Start the synthesis task
task_id = start_synthesis_task(
    text=sample_text,
    voice_id="Joanna",
    s3_bucket=s3_bucket,
    s3_key=s3_key_prefix,
    engine="neural",
    output_format="mp3"
)

In [None]:
# Wait for the task to complete
if task_id:
    task_result = wait_for_task_completion(task_id)
    if task_result:
        # Get the S3 URI of the output file
        s3_uri = task_result['OutputUri']
        print(f"Output URI: {s3_uri}")
        
        # Download the file
        local_file = "basic_async_demo.mp3"
        downloaded_path = download_file_from_s3(s3_uri, local_file)
        
        # Play the audio
        if downloaded_path:
            play_audio_file(downloaded_path)

## Example 2: Long-form Content Synthesis

Now, let's try synthesizing a longer text that exceeds the 3,000 character limit of synchronous requests. This is where asynchronous synthesis really shines.

In [None]:
# Long sample text (an excerpt about machine learning)
long_text = """
Machine learning is a subset of artificial intelligence that focuses on developing algorithms and models that enable computer systems to improve their performance on a specific task through experience. Instead of being explicitly programmed to perform a task, these systems learn from data and adapt their behavior accordingly. Machine learning has become increasingly important in today's technological landscape, powering applications ranging from voice assistants and recommendation systems to autonomous vehicles and medical diagnosis tools.

There are several types of machine learning approaches, each suited for different kinds of problems. Supervised learning involves training a model on labeled data, where the desired output is known. This approach is commonly used for classification tasks, such as spam detection or image recognition, and regression tasks, such as predicting housing prices based on various features. In supervised learning, the algorithm learns to map inputs to outputs by minimizing the error between its predictions and the actual labels.

Unsupervised learning, on the other hand, works with unlabeled data, attempting to find patterns and structures within the data on its own. Common applications include clustering, where similar data points are grouped together, and dimensionality reduction, which simplifies complex data while preserving important information. Anomaly detection, which identifies unusual patterns or outliers in data, is another important application of unsupervised learning.

Reinforcement learning is a paradigm where an agent learns to make decisions by taking actions in an environment and receiving rewards or penalties based on those actions. The goal is to develop a policy that maximizes cumulative rewards over time. This approach has been successful in training systems to play games, control robots, and optimize resource allocation in various domains.

Deep learning, a subset of machine learning, uses artificial neural networks with multiple layers (hence "deep") to model complex patterns in data. Inspired by the structure of the human brain, these networks can automatically extract hierarchical features from raw data, making them particularly effective for tasks involving unstructured data like images, audio, and text. Convolutional Neural Networks (CNNs) have revolutionized computer vision, while Recurrent Neural Networks (RNNs) and Transformers have enabled significant advancements in natural language processing.

The success of machine learning depends heavily on the quality and quantity of available data. Data preprocessing, which includes cleaning, normalization, and augmentation, is a crucial step in building effective models. Feature engineering, the process of selecting or creating relevant features from raw data, can significantly impact model performance, although deep learning models can sometimes learn useful features automatically.

Evaluating machine learning models involves metrics such as accuracy, precision, recall, and F1 score for classification problems, and mean squared error or R-squared for regression problems. Cross-validation techniques are used to assess how well a model generalizes to unseen data, helping to identify issues like overfitting, where a model performs well on training data but poorly on new data.

Challenges in machine learning include dealing with biased or insufficient data, ensuring fairness and transparency in algorithmic decision-making, managing computational resources efficiently, and interpreting complex models. As machine learning systems become more prevalent in critical applications, addressing these challenges becomes increasingly important.

The field of machine learning continues to evolve rapidly, with researchers developing new techniques and applications. Transfer learning allows models trained on one task to be repurposed for related tasks, reducing the need for large labeled datasets. Federated learning enables models to be trained across multiple devices while keeping data local, addressing privacy concerns. Quantum machine learning explores how quantum computing could enhance machine learning capabilities in the future.

In conclusion, machine learning represents a paradigm shift in how we approach problem-solving with computers. By learning from data rather than following explicit instructions, these systems can tackle complex problems that were previously unsolvable using traditional programming approaches. As the field advances, machine learning will likely continue to transform industries and create new possibilities for innovation.
"""

# Check the character count
print(f"Character count: {len(long_text)}")

In [None]:
# Generate a unique key for the S3 output
s3_key_prefix = f"polly/long-form-demo-{uuid.uuid4()}"

# Start the synthesis task using the long-form engine if available
task_id = start_synthesis_task(
    text=long_text,
    voice_id="Matthew",  # Using a different voice for variety
    s3_bucket=s3_bucket,
    s3_key=s3_key_prefix,
    engine="long-form",  # Try using long-form engine for better quality with long text
    output_format="mp3"
)

In [None]:
# Wait for the task to complete - this might take longer due to the text length
if task_id:
    task_result = wait_for_task_completion(task_id, max_wait_time=600)  # Increased timeout
    
    if task_result:
        # Get the S3 URI of the output file
        s3_uri = task_result['OutputUri']
        print(f"Output URI: {s3_uri}")
        
        # Download the file
        local_file = "long_form_demo.mp3"
        downloaded_path = download_file_from_s3(s3_uri, local_file)
        
        # Play the audio
        if downloaded_path:
            play_audio_file(downloaded_path)
    else:
        # If long-form fails, try with neural engine
        print("Trying with neural engine instead...")
        s3_key_prefix = f"polly/long-neural-demo-{uuid.uuid4()}"
        
        task_id = start_synthesis_task(
            text=long_text,
            voice_id="Matthew",
            s3_bucket=s3_bucket,
            s3_key=s3_key_prefix,
            engine="neural",
            output_format="mp3"
        )
        
        if task_id:
            task_result = wait_for_task_completion(task_id, max_wait_time=600)
            if task_result:
                s3_uri = task_result['OutputUri']
                local_file = "long_neural_demo.mp3"
                downloaded_path = download_file_from_s3(s3_uri, local_file)
                if downloaded_path:
                    play_audio_file(downloaded_path)

## Example 3: SSML with Asynchronous Synthesis

SSML can also be used with asynchronous synthesis to add more control and expressiveness to the generated speech.

In [None]:
# SSML text with various enhancements
ssml_text = """<speak>
    <amazon:domain name="conversational">
    Welcome to our exploration of asynchronous speech synthesis with Amazon Polly.
    </amazon:domain>
    
    <break time="1s"/>
    
    With SSML, we can control various aspects of speech, such as:
    
    <break time="500ms"/>
    
    <prosody rate="slow" pitch="low">Speaking slower and with a lower pitch.</prosody>
    
    <break time="500ms"/>
    
    <prosody rate="fast" pitch="high">Or speaking faster and with a higher pitch!</prosody>
    
    <break time="1s"/>
    
    We can also add <emphasis level="strong">emphasis</emphasis> to certain words.
    
    <break time="500ms"/>
    
    <say-as interpret-as="characters">SSML</say-as> gives us precise control over pronunciation.
    
    <break time="500ms"/>
    
    For example, we can say dates like <say-as interpret-as="date" format="mdy">12-25-2025</say-as> or 
    spell out abbreviations like <say-as interpret-as="characters">AWS</say-as>.
    
    <break time="1s"/>
    
    And with the asynchronous API, we can process much longer content than would be possible with 
    the synchronous API, while still maintaining all these controls over speech quality and style.
</speak>"""

# Generate a unique key for the S3 output
s3_key_prefix = f"polly/ssml-demo-{uuid.uuid4()}"

# Start the synthesis task
task_id = start_synthesis_task(
    text=ssml_text,
    voice_id="Joanna",
    s3_bucket=s3_bucket,
    s3_key=s3_key_prefix,
    engine="neural",
    output_format="mp3",
    text_type="ssml"  # Specify SSML text type
)

In [None]:
# Wait for the task to complete
if task_id:
    task_result = wait_for_task_completion(task_id)
    if task_result:
        # Get the S3 URI of the output file
        s3_uri = task_result['OutputUri']
        print(f"Output URI: {s3_uri}")
        
        # Download the file
        local_file = "ssml_async_demo.mp3"
        downloaded_path = download_file_from_s3(s3_uri, local_file)
        
        # Play the audio
        if downloaded_path:
            play_audio_file(downloaded_path)

## Example 4: Managing Multiple Tasks

In a production environment, you might need to generate multiple audio files in parallel. Here's how to manage multiple synthesis tasks.

In [None]:
# List of short texts to synthesize
texts = [
    "Welcome to our application. We're glad you're here!",
    "Your account has been successfully created. You can now access all features.",
    "Thank you for your purchase. Your order will be processed shortly.",
    "We've received your feedback. It helps us improve our service."
]

# List to store task IDs
task_ids = []

# Start multiple tasks
for i, text in enumerate(texts):
    s3_key_prefix = f"polly/multi-demo-{i}-{uuid.uuid4()}"
    task_id = start_synthesis_task(
        text=text,
        voice_id="Joanna",
        s3_bucket=s3_bucket,
        s3_key=s3_key_prefix,
        engine="neural"
    )
    if task_id:
        task_ids.append(task_id)
    
print(f"Started {len(task_ids)} tasks")

In [None]:
# Monitor all tasks
task_results = {}

for task_id in task_ids:
    print(f"\nChecking task: {task_id}")
    result = wait_for_task_completion(task_id)
    if result:
        task_results[task_id] = result

print(f"\nCompleted {len(task_results)} out of {len(task_ids)} tasks")

In [None]:
# Download and display results
downloaded_files = []

for task_id, result in task_results.items():
    s3_uri = result['OutputUri']
    local_file = f"multi_task_{task_id}.mp3"
    downloaded_path = download_file_from_s3(s3_uri, local_file)
    if downloaded_path:
        downloaded_files.append({
            'path': downloaded_path,
            'task_id': task_id
        })

# Play the first audio file as an example
if downloaded_files:
    play_audio_file(downloaded_files[0]['path'])

## Best Practices for Working with Amazon Polly Tasks

When working with Amazon Polly's asynchronous speech synthesis, consider these best practices:

1. **Efficient S3 Organization**:
   - Use meaningful key prefixes to organize your audio files
   - Consider a folder structure that separates by project, language, or date
   - Set up appropriate S3 lifecycle policies for temporary audio files

2. **Task Management**:
   - Store task IDs in a database for long-running tasks
   - Implement exponential backoff for polling task status
   - Consider using AWS Lambda and Step Functions for serverless processing

3. **Error Handling**:
   - Always check task status before attempting to access results
   - Have fallback strategies (e.g., switching engines as demonstrated)
   - Implement appropriate retry logic for transient errors

4. **Batch Processing**:
   - For very large texts, split into meaningful segments
   - Process multiple segments in parallel for faster throughput
   - Consider using a queue system for managing large volumes of synthesis tasks

5. **Cost Management**:
   - Monitor character usage to stay within budget
   - Use the standard engine for non-critical applications
   - Cache frequently used audio files

## Conclusion

In this notebook, we've explored Amazon Polly's asynchronous speech synthesis capabilities with S3 integration. We've learned how to:

- Set up asynchronous speech synthesis tasks
- Check task status and handle task completion
- Process longer texts that exceed synchronous API limits
- Use SSML with asynchronous tasks for greater speech control
- Manage multiple synthesis tasks in parallel

These techniques allow you to build scalable text-to-speech applications that can handle content of any length, from short notifications to full articles or books.