# Getting Started with Amazon Polly: Basic Speech Synthesis

This notebook demonstrates how to use the Amazon Polly service to synthesize speech using different engines, voices, and configurations. You'll learn how to:

- Set up the boto3 client for Amazon Polly
- Generate speech with different engines (Standard, Neural, Long-form)
- Use various voices and languages
- Save audio files locally in different formats
- Apply basic SSML (Speech Synthesis Markup Language) enhancements

## Prerequisites

- An AWS account with access to Amazon Polly
- AWS credentials configured locally
- Python 3.6+ with boto3 installed

Let's get started!

## Setting up the Environment

First, we'll import the necessary libraries and set up our AWS client.

In [None]:
%%bash
pip install boto3 ipython

In [None]:
# Import required libraries
import boto3
import os
from IPython.display import Audio
import time
import json

# Create a client for Amazon Polly
polly_client = boto3.client('polly')

# Create output directory if it doesn't exist
output_dir = "audio_output"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Let's create some helper functions to synthesize speech and save audio files.

In [None]:
def synthesize_speech(text, voice_id, engine="standard", output_format="mp3", text_type="text"):
    """
    Synthesize speech using Amazon Polly and return the audio stream.
    
    Parameters:
    - text: The text to convert to speech
    - voice_id: The voice to use (e.g., 'Joanna', 'Matthew')
    - engine: The engine to use ('standard', 'neural', or 'long-form')
    - output_format: The output format ('mp3', 'ogg_vorbis', or 'pcm')
    - text_type: The type of input text ('text' or 'ssml')
    
    Returns:
    - Audio stream
    """
    try:
        response = polly_client.synthesize_speech(
            Text=text,
            VoiceId=voice_id,
            Engine=engine,
            OutputFormat=output_format,
            TextType=text_type
        )
        return response['AudioStream'].read()
    except Exception as e:
        print(f"Error synthesizing speech: {str(e)}")
        return None

def save_audio_file(audio_data, filename):
    """
    Save audio data to a file.
    
    Parameters:
    - audio_data: The audio data to save
    - filename: The name of the file to save to
    """
    if audio_data:
        file_path = os.path.join(output_dir, filename)
        try:
            with open(file_path, 'wb') as file:
                file.write(audio_data)
            print(f"Audio saved to {file_path}")
        except Exception as e:
            print(f"Error saving audio file: {str(e)}")

def play_audio(audio_data, format="audio/mp3"):
    """
    Play audio data in the notebook.
    
    Parameters:
    - audio_data: The audio data to play
    - format: The format of the audio data
    """
    if audio_data:
        return Audio(audio_data, autoplay=True, rate=16000)

## Understanding Amazon Polly Engines

Amazon Polly offers four different engines for speech synthesis:

1. **Standard Engine**: Uses concatenative synthesis technology. Good for applications that need quick responses and cost efficiency.

2. **Neural Engine**: Uses deep learning technology to create more natural and human-like speech. Better for applications where speech quality is important.

3. **Long-Form Engine**: Optimized for longer content, providing better prosody and more natural pauses.

4. **Generative Engine**: Offers the most human-like, emotionally engaged, and adaptive conversational voices available for the use via the Amazon Polly console.

Let's examine the available voices in Amazon Polly.

In [None]:
# Get the list of available voices
response = polly_client.describe_voices()

# Create dictionaries to store voices by engine type
standard_voices = []
neural_voices = []
long_form_voices = []
generative_voices = []

# Categorize voices by supported engine
for voice in response['Voices']:
    voice_info = {
        'Id': voice['Id'],
        'LanguageCode': voice['LanguageCode'],
        'Gender': voice['Gender']
    }
    
    supported_engines = voice.get('SupportedEngines', [])
    
    if 'standard' in supported_engines:
        standard_voices.append(voice_info)
    
    if 'neural' in supported_engines:
        neural_voices.append(voice_info)
        
    if 'long-form' in supported_engines:
        long_form_voices.append(voice_info)
    
    if 'generative' in supported_engines:
        generative_voices.append(voice_info)

print(f"Available Standard Voices: {len(standard_voices)}")
print(f"Available Neural Voices: {len(neural_voices)}")
print(f"Available Long-form Voices: {len(long_form_voices)}")
print(f"Available Generative Voices: {len(generative_voices)}")

# Show the first 5 neural and generative voices as examples
print("\nSample Neural Voices:")
for voice in neural_voices[:5]:
    print(f"ID: {voice['Id']}, Language: {voice['LanguageCode']}, Gender: {voice['Gender']}")

print("\nSample Generative Voices:")
for voice in generative_voices[:5]:
    print(f"ID: {voice['Id']}, Language: {voice['LanguageCode']}, Gender: {voice['Gender']}")

## Example 1: Standard Engine Speech Synthesis

Let's start by synthesizing speech using the standard engine with a few different voices.

In [None]:
# Sample text to synthesize
sample_text = "Hello, welcome to this demonstration of Amazon Polly. This is the standard engine."

# Example with US English female voice (Joanna)
standard_audio_joanna = synthesize_speech(
    text=sample_text,
    voice_id="Joanna",
    engine="standard",
    output_format="mp3",
    text_type="text"
)

# Save the audio
save_audio_file(standard_audio_joanna, "standard_joanna.mp3")

# Play the audio
play_audio(standard_audio_joanna)

In [None]:
# Example with US English male voice (Matthew)
standard_audio_matthew = synthesize_speech(
    text=sample_text,
    voice_id="Matthew",
    engine="standard",
    output_format="mp3",
    text_type="text"
)

# Save the audio
save_audio_file(standard_audio_matthew, "standard_matthew.mp3")

# Play the audio
play_audio(standard_audio_matthew)

## Example 2: Neural Engine Speech Synthesis

Now, let's try the neural engine for higher quality speech.

In [None]:
# Sample text to synthesize
sample_text = "Hello, welcome to this demonstration of Amazon Polly. This is the neural engine, which produces more natural-sounding speech."

# Example with US English female voice (Joanna)
neural_audio_joanna = synthesize_speech(
    text=sample_text,
    voice_id="Joanna",
    engine="neural",
    output_format="mp3",
    text_type="text"
)

# Save the audio
save_audio_file(neural_audio_joanna, "neural_joanna.mp3")

# Play the audio
play_audio(neural_audio_joanna)

In [None]:
# Example with US English male voice (Matthew)
neural_audio_matthew = synthesize_speech(
    text=sample_text,
    voice_id="Matthew",
    engine="neural",
    output_format="mp3",
    text_type="text"
)

# Save the audio
save_audio_file(neural_audio_matthew, "neural_matthew.mp3")

# Play the audio
play_audio(neural_audio_matthew)

## Example 3: Generative Engine Speech Synthesis

Now, let's try the generative engine for highest quality speech.

In [None]:
# Sample text to synthesize
sample_text = "Hello, welcome to this demonstration of Amazon Polly. This is the generative engine, which produces the most natural-sounding speech."

# Example with US English female voice (Joanna)
generative_audio_joanna = synthesize_speech(
    text=sample_text,
    voice_id="Joanna",
    engine="generative",
    output_format="mp3",
    text_type="text"
)

# Save the audio
save_audio_file(generative_audio_joanna, "generative_joanna.mp3")

# Play the audio
play_audio(generative_audio_joanna)

In [None]:
# Example with US English male voice (Matthew)
generative_audio_stephen = synthesize_speech(
    text=sample_text,
    voice_id="Stephen",
    engine="generative",
    output_format="mp3",
    text_type="text"
)

# Save the audio
save_audio_file(generative_audio_stephen, "generative_stephen.mp3")

# Play the audio
play_audio(generative_audio_stephen)

## Example 4: Different Output Formats

Amazon Polly supports multiple output formats. Let's try a few.

In [None]:
# Sample text
sample_text = "This is a demonstration of different output formats in Amazon Polly."

# MP3 format (default)
mp3_audio = synthesize_speech(
    text=sample_text,
    voice_id="Joanna",
    output_format="mp3"
)
save_audio_file(mp3_audio, "sample_mp3.mp3")

# OGG format
ogg_audio = synthesize_speech(
    text=sample_text,
    voice_id="Joanna",
    output_format="ogg_vorbis"
)
save_audio_file(ogg_audio, "sample_ogg.ogg")

# PCM format
pcm_audio = synthesize_speech(
    text=sample_text,
    voice_id="Joanna",
    output_format="pcm"
)
save_audio_file(pcm_audio, "sample_pcm.pcm")

print("All formats generated and saved.")

## Example 5: Multilingual Support

Amazon Polly supports many languages. Let's try a few examples.

In [None]:
# Spanish
spanish_text = "Hola, esto es una demostración de Amazon Polly en español."
spanish_audio = synthesize_speech(
    text=spanish_text,
    voice_id="Lupe", # Spanish voice
    engine="neural" if any(voice['Id'] == 'Lupe' for voice in neural_voices) else "standard"
)
save_audio_file(spanish_audio, "spanish_demo.mp3")
play_audio(spanish_audio)

In [None]:
# French
french_text = "Bonjour, c'est une démonstration d'Amazon Polly en français."
french_audio = synthesize_speech(
    text=french_text,
    voice_id="Lea", # French voice don't use the è use e instead
    engine="neural" if any(voice['Id'] == 'Léa' for voice in neural_voices) else "standard"
)
save_audio_file(french_audio, "french_demo.mp3")
play_audio(french_audio)

In [None]:
# German
german_text = "Hallo, dies ist eine Demonstration von Amazon Polly auf Deutsch."
german_audio = synthesize_speech(
    text=german_text,
    voice_id="Vicki", # German voice
    engine="neural" if any(voice['Id'] == 'Vicki' for voice in neural_voices) else "standard"
)
save_audio_file(german_audio, "german_demo.mp3")
play_audio(german_audio)

## Example 6: Using SSML

Speech Synthesis Markup Language (SSML) gives you more control over how Amazon Polly generates speech. Let's see some examples.

In [None]:
# Basic SSML with pauses
ssml_text = """<speak>
    Hello! <break time='1s'/> Welcome to Amazon Polly. 
    This is a demonstration of SSML, which allows for <prosody rate='slow'>slower speech</prosody> 
    or <prosody rate='fast'>faster speech</prosody>, and even 
    <prosody volume='loud'>loud volume</prosody> or <prosody volume='soft'>soft volume</prosody>.
</speak>"""

ssml_audio = synthesize_speech(
    text=ssml_text,
    voice_id="Joanna",
    engine="neural",
    text_type="ssml"
)
save_audio_file(ssml_audio, "ssml_demo.mp3")
play_audio(ssml_audio)

In [None]:
# SSML with phonetic pronunciation
ssml_phonetic = """<speak>
    You say tomato, I say <phoneme alphabet='ipa' ph='təˈmeɪtoʊ'>tomato</phoneme>.
    Let's call the whole thing off!
</speak>"""

phonetic_audio = synthesize_speech(
    text=ssml_phonetic,
    voice_id="Joanna",
    engine="neural",
    text_type="ssml"
)
save_audio_file(phonetic_audio, "ssml_phonetic.mp3")
play_audio(phonetic_audio)

In [None]:
# SSML with neural speaking styles (only works with certain neural voices)
ssml_news_style = """<speak>
    <amazon:domain name="news">
    In today's news, researchers have discovered a breakthrough in quantum computing 
    that could revolutionize the field of artificial intelligence. 
    The new technology, developed by an international team of scientists, 
    is expected to accelerate machine learning algorithms by orders of magnitude.
    </amazon:domain>
</speak>"""

try:
    news_style_audio = synthesize_speech(
        text=ssml_news_style,
        voice_id="Matthew",  # Make sure to use a voice that supports news style
        engine="neural",
        text_type="ssml"
    )
    save_audio_file(news_style_audio, "ssml_news_style.mp3")
    play_audio(news_style_audio)
except Exception as e:
    print(f"Note: News style might not be supported by this voice or in your region: {str(e)}")

## Example 7: Long-Form Engine

The Long-Form engine is optimized for longer content like paragraphs or articles.

In [None]:
# A longer sample text
long_text = """
Artificial intelligence is transforming our world in remarkable ways. From healthcare to transportation, 
AI systems are being deployed to solve complex problems and improve efficiency. 
In healthcare, AI algorithms can detect diseases from medical images with accuracy rivaling that of human experts. 
In transportation, self-driving vehicles are becoming increasingly sophisticated, promising to reduce accidents and congestion. 
In finance, AI is used to detect fraudulent transactions and optimize investment portfolios. 
Despite these advances, there are important ethical considerations around AI, including privacy concerns, 
bias in algorithms, and the potential impact on employment. As society continues to adopt AI technologies, 
it will be crucial to address these challenges while maximizing the benefits of this powerful technology.
"""

# Check if any of our detected voices support long-form
if long_form_voices:
    selected_voice = long_form_voices[0]['Id']
    try:
        long_form_audio = synthesize_speech(
            text=long_text,
            voice_id=selected_voice,
            engine="long-form",
            output_format="mp3"
        )
        save_audio_file(long_form_audio, "long_form_demo.mp3")
        play_audio(long_form_audio)
    except Exception as e:
        print(f"Error with long-form engine: {str(e)}")
        print("Falling back to neural engine...")
        neural_audio = synthesize_speech(
            text=long_text,
            voice_id="Joanna",  # Using a common neural voice
            engine="neural",
            output_format="mp3"
        )
        save_audio_file(neural_audio, "neural_long_text.mp3")
        play_audio(neural_audio)
else:
    print("No long-form voices detected. Using neural engine instead.")
    neural_audio = synthesize_speech(
        text=long_text,
        voice_id="Joanna",  # Using a common neural voice
        engine="neural",
        output_format="mp3"
    )
    save_audio_file(neural_audio, "neural_long_text.mp3")
    play_audio(neural_audio)

## Performance and Pricing Considerations

When using Amazon Polly, keep the following in mind:

1. **Character Limits**:
   - The synchronous `synthesize_speech` API has a limit of 3,000 characters (including SSML tags)
   - For longer text, use the asynchronous `start_speech_synthesis_task` API

2. **Engine Choice**:
   - **Standard**: Lower cost, faster processing
   - **Neural**: Higher quality speech, slightly higher cost
   - **Long-Form**: Best for long content, higher cost

3. **Format Considerations**:
   - MP3: Compressed format, good quality/size balance
   - OGG: Alternative compressed format
   - PCM: Uncompressed, highest quality but larger files

4. **Pricing**: Amazon Polly charges per character processed, with different rates based on the engine:
   - Standard: Lower cost per million characters
   - Neural: Higher cost per million characters
   - First million characters per month may be free under the AWS Free Tier
   - For current pricing, check the [Amazon Polly pricing page](https://aws.amazon.com/polly/pricing/)

## Conclusion

In this notebook, we've explored the basics of Amazon Polly's speech synthesis capabilities:

- Using the Standard and Neural engines
- Working with different voices and languages
- Generating audio in various formats
- Enhancing speech output using SSML
- Trying the Long-Form engine for longer content

For longer texts or batch processing, check out the next notebook which covers asynchronous synthesis using `start_speech_synthesis_task` and retrieving results from S3.

The audio files generated in this notebook have been saved to the `audio_output` directory for your reference.