# Exploring Amazon Polly

In these exercises you'll explore some of the different functionality available in [Amazon Polly](https://aws.amazon.com/polly/), a service to turn text into lifelike speech.

## Try Polly out through the AWS Console

Perhaps the easiest way to start experimenting with Polly is through the AWS console, where a Text-to-Speech demo utility is already provided.

You can search for "Polly" in the AWS console search bar, or simply navigate to: [https://console.aws.amazon.com/polly/home](https://console.aws.amazon.com/polly/home).

![](images/01-polly-console.png "Screenshot of Amazon Polly console with TTS utility")

From this screen, you can experiment with different text and settings and simply press "Listen to speech" to hear the result.

Different language/region settings offer different voices, and these different voices have different capabilities as listed on [this table in the Polly Developer Guide](https://docs.aws.amazon.com/polly/latest/dg/voicelist.html)

▶️ **Select** `English, British` and the `Amy` voice. Can you hear the difference between the 'Neural' and 'Standard' voices?

See the [Neural TTS page](https://docs.aws.amazon.com/polly/latest/dg/NTTS-main.html) in the Polly developer guide for more information about how NTTS voices differ from "standard" voices.

## Using the Polly APIs

Of course for actually integrating Polly with applications, we can also synthesize voice through the **APIs**: Using [SynthesizeSpeech](https://docs.aws.amazon.com/polly/latest/dg/API_SynthesizeSpeech.html) for synchronous processing of short text, or [StartSpeechSynthesisTask](https://docs.aws.amazon.com/polly/latest/dg/API_StartSpeechSynthesisTask.html) for asynchronous processing of longer inputs.

There's no need to handle low-level signing of these requests: The AWS SDK for your programming language will likely have bindings, as shown below with [boto3, the AWS SDK for Python](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/polly.html#Polly.Client.synthesize_speech).

▶️ **Run** the below code cell by selecting it and pressing the 'play' button in the toolbar, or `Shift+Enter` on the keyboard. An audio render should display, and auto-play the result

In [None]:
import boto3
from ipywidgets import Audio

polly = boto3.client("polly")

resp = polly.synthesize_speech(
    # "standard"|"neural"
    Engine="neural",
    # 'arb'|'cmn-CN'|'cy-GB'|'da-DK'|'de-DE'|'en-AU'|'en-GB'|'en-GB-WLS'|'en-IN'|'en-US'|'es-ES'
    # |'es-MX'|'es-US'|'fr-CA'|'fr-FR'|'is-IS'|'it-IT'|'ja-JP'|'hi-IN'|'ko-KR'|'nb-NO'|'nl-NL'
    # |'pl-PL'|'pt-BR'|'pt-PT'|'ro-RO'|'ru-RU'|'sv-SE'|'tr-TR'
    LanguageCode="en-GB",
    #LexiconNames=[],
    # The widget we use below supports "mp3" or "ogg_vorbis", but not "json"|"pcm":
    OutputFormat="mp3",
    #SampleRate='string',
    # "sentence"|"ssml"|"viseme"|"word"
    #SpeechMarkTypes=[],
    Text="""Easy as Py!""",
    # "ssml"|"text"
    TextType="text",
    # 'Aditi'|'Amy'|'Astrid'|'Bianca'|'Brian'|'Camila'|'Carla'|'Carmen'|'Celine'|'Chantal'
    # |'Conchita'|'Cristiano'|'Dora'|'Emma'|'Enrique'|'Ewa'|'Filiz'|'Geraint'|'Giorgio'|'Gwyneth'
    # |'Hans'|'Ines'|'Ivy'|'Jacek'|'Jan'|'Joanna'|'Joey'|'Justin'|'Karl'|'Kendra'|'Kevin'
    # |'Kimberly'|'Lea'|'Liv'|'Lotte'|'Lucia'|'Lupe'|'Mads'|'Maja'|'Marlene'|'Mathieu'|'Matthew'
    # |'Maxim'|'Mia'|'Miguel'|'Mizuki'|'Naja'|'Nicole'|'Olivia'|'Penelope'|'Raveena'|'Ricardo'
    # |'Ruben'|'Russell'|'Salli'|'Seoyeon'|'Takumi'|'Tatyana'|'Vicki'|'Vitoria'|'Zeina'|'Zhiyu'
    VoiceId="Amy",
)

Audio.from_file(resp["AudioStream"], loop=False)

## Tuning output with SSML

Amazon Polly voices can correctly handle many complex edge cases out of the box. However, there will always be the possibility of situations where text alone doesn't give enough information to render the speech as you want: Perhaps because of domain-specific jargon or product names, or even because you'd like to inflect the speech to sound more empathetic or appropriate for your context.

**[Speech Synthesis Markup Language](https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language), or SSML** is a standard, extensible format for specifying additional metadata on text-to-speech tasks: and Amazon Polly can consume SSML as well as plain text.

Since SSML is an open and extensible format, Polly supports a specific set of SSML tags and different voices or engines (NTTS vs standard) may only support a subset.

▶️ **Refer** to the **[Supported SSML Tags page](https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html)** of the Polly Developer Guide for full details of which tags are available and how to use them.

You can test out SSML from the console, by switching to the "SSML" tab, or in code by updating the `TextType` parameter from `text` to `ssml`.

▶️ **Try** the below sample for the "newscaster speaking style" - does it sound different to Amy's regular tone for this text?

```xml
<speak>
    <amazon:domain name="news">
        Amazon Nimble Studio is a new service that creative studios can use to produce visual effects, animations, and interactive content entirely in the cloud with AWS, from the storyboard sketch to the final deliverable.
    </amazon:domain>
</speak>
```

## Some SSML Challenges

There are lots of tools available in SSML: Can you solve these introductory puzzles?

▶️ **Use** either the Polly console, or copy/pasting from the code snippet above (you can press the `+` button in the toolbar to insert more cells). Don't forget to refer to the **[Supported SSML Tags page](https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html)** for guidance!

#### 1. Let sleeping dogs lie

You can't just *mention* walkies in front of the dog, or he'll go bananas!

Can you make polly spell out the characters W.A.L.K. - just by adding SSML tags?

When you've finished, would you take the dog for a walk?

> Note: Although this tag also has options for modifying how numbers are rendered, they're often not necessary!
>
> Just try `They say the 1st 1/2 is always the hardest...` - Amy should render this naturally without tagging

#### 2. It's the little things...

As you might have found in playing with the last example; case, punctuation, and spacing can be important cues for Polly!

Sometimes treating these cues carefully may be enough, while in other cases you might want to `<sub>` in a completely different rendering.

Can you edit the casing and use `<sub>` to render the below, expanding out the TF acronym to 'TensorFlow' and referring to the [Python Package Index](https://pypi.org/) with the more typical "pie-pie" pronunciation?

I just customized the pre-built TF container by installing some extra packages from pypi

#### 3. A certain je ne sais quoi

Nearly all languages have some borrowed "loan-words" (or entire phrases) from elsewhere.

Some may be so natural in normal speech that a Polly voice already pronounces them naturally: For example Amy already handles `C'est la vie` and `au contraire` fine with no help from us!

...But for others, we might need to provide a little extra help with a language tag. These tags don't go all the way to making the voice sound natural in the other language, but just cue the model how it can interpret the phonetics - since the spelling conventions may be very different!

See if you can improve Amy's pronunciation of the following with SSML:

Mis-pronunciation is a big faux pas for a robot, but natural loan-phrases can add a certain je ne sais quoi. Can you help me get it right? Xie xie!

#### 4. No, **THIS** one!

Appropriate emphasis makes speech more engaging... But as the doc page mentions, the `<emphasis>` tag is only supported for "standard" voices and not the more natural-sounding NTTS engine.

However, the page also says "emphasizing words changes the speaking rate and volume". Can you achieve a similar (or better?) effect on Amy's NTTS voice, using different tags?



#### 5. Location, location, location

Context gives us humans clues on how to pronounce [heteronyms](http://www-personal.umich.edu/~cellis/heteronym.html) - words with the same spelling, but different sounds.

In many cases, Polly can manage this too: Just try `I hope you're content with this content!`

...But again, in more unusual situations, we might need to give a little extra guidance. Can you use SSML to force Amy to render the last word in this example in the verb sense, 'rihBELL'?

She was an inspiration. A hero. A rebel.
You should question! Subvert! Rebel!