[![Works with Edge Impulse](../.assets/images/ei-badge.svg)](http://edgeimpulse.com) [![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/edgeimpulse/notebooks/blob/main/notebooks/03-generate-keyword-spotting-dataset.ipynb) 

# Use Azure Text-To-Speech to generate a dataset for keyword spotting


### Local Software Requirements
- Python 3.5 or later
- Pip package manager 
- Jupyter Notebook: https://jupyter.org/install
- pip packages (install with `pip install `*`packagename`*):
    - pydub https://pypi.org/project/pydub/
    - azure-cognitiveservices-speech https://pypi.org/project/azure-cognitiveservices-speech/ 
    - requests https://pypi.org/project/requests/

More information on installation guidance depending on your OS:
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-python

If you encounter any issues, feel free to refer to the official samples at https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/quickstart/python/text-to-speech/quickstart.ipynb



In [None]:
# Imports
import os
import json
import time
import io
import random
import requests
from pydub import AudioSegment
import azure.cognitiveservices.speech as speechsdk

## Set up Azure Text-to-Speech Cognitive Services
First off you will need to set up and Edge Impulse account and create your first project.
You will also need an Azure Account and an instance of the Text to Speech service enabled https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/index-text-to-speech. The first 500k characters generated each month are free, this should be plenty for most cases as you'll only need to generate your dataset once.
You will also need to set the environment variables named "SPEECH_KEY" and "SPEECH_REGION" according to your OS as detailed here: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-text-to-speech?tabs=windows%2Cterminal&pivots=programming-language-python#set-environment-variables



In [None]:
# This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
speech_config = speechsdk.SpeechConfig(subscription=os.environ.get('SPEECH_KEY'), region=os.environ.get('SPEECH_REGION'))
request_limit_per_minute = 20


## Generate the desired samples
First off we need to set our desired keywords and labels:


In [None]:

# Keyword or short sentence and label (e.g. 'hello world')
keyword = [
    {'string':'edge','label':'edge'},
    {'string':'impulse','label':'impulse'},
]


Then we need to set up the parameters for our speech dataset, all possible combinations will be iterated through:
- languages - Choose the text to speech voice languages to use (https://cloud.google.com/text-to-speech/docs/voices)
- pitches - Which voice pitches to apply
- genders - Which SSML genders to apply
- speakingRates - Which speaking speeds to apply


In [None]:


# Languages, remove as appropriate
# Full reference here https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/language-support?tabs=tts
# \W([en]{2}-[a-zA-Z]{2}-\w*)
# languages = [
#     'ar-XA', 'bn-IN',  'en-GB',  'fr-CA',
#     'en-US', 'es-ES',  'fi-FI',  'gu-IN',
#     'ja-JP', 'kn-IN',  'ml-IN',  'sv-SE',
#     'ta-IN', 'tr-TR',  'cs-CZ',  'de-DE',
#     'en-AU', 'en-IN',  'fr-FR',  'hi-IN',
#     'id-ID', 'it-IT',  'ko-KR',  'ru-RU',
#     'uk-UA', 'cmn-CN', 'cmn-TW', 'da-DK',
#     'el-GR', 'fil-PH', 'hu-HU',  'nb-NO',
#     'nl-NL', 'pt-PT',  'sk-SK',  'vi-VN',
#     'pl-PL', 'pt-BR',  'ca-ES',  'yue-HK',
#     'af-ZA', 'bg-BG',  'lv-LV',  'ro-RO',
#     'sr-RS', 'th-TH',  'te-IN',  'is-IS'
# ]
voices = [
    'en-AU-AnnetteNeural','en-AU-CarlyNeural','en-AU-DarrenNeural','en-AU-DuncanNeural','en-AU-ElsieNeural','en-AU-FreyaNeural','en-AU-JoanneNeural',
    'en-AU-KenNeural','en-AU-KimNeural','en-AU-NatashaNeural','en-AU-NeilNeural','en-AU-TimNeural','en-AU-TinaNeural','en-AU-WilliamNeural',
    'en-CA-ClaraNeural','en-CA-LiamNeural','en-GB-AbbiNeural','en-GB-AlfieNeural','en-GB-BellaNeural','en-GB-ElliotNeural','en-GB-EthanNeural',
    'en-GB-HollieNeural','en-GB-LibbyNeural','en-GB-MaisieNeural','en-GB-NoahNeural','en-GB-OliverNeural','en-GB-OliviaNeural','en-GB-RyanNeural',
    'en-GB-SoniaNeural',
    'en-GB-ThomasNeural','en-HK-SamNeural','en-HK-YanNeural','en-IE-ConnorNeural','en-IE-EmilyNeural','en-IN-NeerjaNeural',
    'en-IN-PrabhatNeural','en-KE-AsiliaNeural','en-KE-ChilembaNeural','en-NG-AbeoNeural','en-NG-EzinneNeural','en-NZ-MitchellNeural','en-NZ-MollyNeural',
    'en-PH-JamesNeural','en-PH-RosaNeural','en-SG-LunaNeural','en-SG-WayneNeural','en-TZ-ElimuNeural','en-TZ-ImaniNeural',
    #'en-US-AIGenerate1Neural1','en-US-AIGenerate2Neural1',
    'en-US-AmberNeural','en-US-AnaNeural','en-US-AriaNeural','en-US-AshleyNeural','en-US-BrandonNeural','en-US-ChristopherNeural',
    'en-US-CoraNeural','en-US-DavisNeural','en-US-ElizabethNeural','en-US-EricNeural','en-US-GuyNeural','en-US-JacobNeural','en-US-JaneNeural','en-US-JasonNeural',
    #'en-US-JennyMultilingualNeural3',
    'en-US-JennyNeural',
    #'en-US-MichelleNeural',
    'en-US-MonicaNeural','en-US-NancyNeural','en-US-RogerNeural','en-US-SaraNeural',
    'en-US-SteffanNeural','en-US-TonyNeural','en-ZA-LeahNeural','en-ZA-LukeNeural'
]
languages = [ 'en-US'
   # 'en-AU','en-CA','en-GB','en-HK','en-IE','en-IN','en-KE','es-AR', 'es-ES'
#     'en-NG','en-NZ','en-PH','en-SG','en-TZ','en-US','en-ZA'
#     'af-ZA','am-ET','ar-AE','ar-BH','ar-DZ','ar-EG','ar-IQ','ar-JO','ar-KW','ar-LB','ar-LY','ar-MA','ar-OM','ar-QA','ar-SA','ar-SY','ar-TN','ar-YE','az-AZ',
#     'bg-BG','bn-BD','bn-IN','bs-BA','ca-ES','cs-CZ','cy-GB','da-DK','de-AT','de-CH','de-DE','el-GR','en-AU','en-CA','en-GB','en-HK','en-IE','en-IN','en-KE',
#     'en-NG','en-NZ','en-PH','en-SG','en-TZ','en-US','en-ZA','es-AR','es-BO','es-CL','es-CO','es-CR','es-CU','es-DO','es-EC','es-ES','es-GQ','es-GT','es-HN',
#     'es-MX','es-NI','es-PA','es-PE','es-PR','es-PY','es-SV','es-US','es-UY','es-VE','et-EE','eu-ES','fa-IR','fi-FI','fr-BE','fr-CA','fr-CH','fr-FR','ga-IE',
#     'gl-ES','gu-IN','he-IL','hi-IN','hr-HR','hu-HU','hy-AM','id-ID','is-IS','it-IT','ja-JP','jv-ID','ka-GE','kk-KZ','km-KH','kn-IN','ko-KR','lo-LA','lt-LT',
#     'lv-LV','mk-MK','ml-IN','mn-MN','mr-IN','ms-MY','mt-MT','my-MM','nb-NO','ne-NP','nl-BE','nl-NL','pl-PL','ps-AF','pt-BR','pt-PT','ro-RO','ru-RU','si-LK',
#     'sk-SK','sl-SI','so-SO','sq-AL','sr-RS','su-ID','sv-SE','sw-KE','sw-TZ','ta-IN','ta-LK','ta-MY','ta-SG','te-IN','th-TH','tr-TR','uk-UA','ur-IN','ur-PK',
#     'uz-UZ','vi-VN','zh-CN','zh-HK','zh-TW','zu-ZA'
]

# Pitches to generate (in semitones) range: [-20.0, 20.0]
pitches = [0]#[-2, 0, 2]
# Voice genders to use
#genders = ["neutral"]#["NEUTRAL", "FEMALE", "MALE"]
# Styles
styles = ["chat","cheerful","excited", "friendly"]#, "whispering" ]
# Speaking rates to use range: [0.25, 4.0]
speakingRates = ['-30.00%', "0"]
#speakingRates = [0]

#roles = ['Girl','Boy','YoungAdultFemale','YoungAdultMale','OlderAdultFemale','OlderAdultMale','SeniorFemale','SeniorMale']
#roles = ['Girl','SeniorMale']



Then provide some other key parameters:
- out_length - How long each output sample should be
- count - Maximum number of samples to output (if all combinations of languages, pitches etc are higher then this restricts output)
- voice-dir - Where to store the clean samples before noise is added
- noise-url - Which noise file to download and apply to your samples
- output-folder - The final output location of the noised samples
- num-copies - How many different noisy versions of each sample to create
- max-noise-level - in Db, 



In [None]:
# # Out length minimum (default: 1s)
# out_length = 1
# # Maximum number of keywords to generate
# count = 30
# # Raw sample output directory
# voice_dir = 'out-wav'
# # Creative commons background noise from freesound.org:https://freesound.org/people/Astounded/sounds/483561/
# noise_url = 'https://cdn.freesound.org/previews/483/483561_10201334-lq.ogg'
# output_folder = 'out-noisy'
# num_copies = 2  # Number of noisy copies to create for each input sample
# max_noise_level = -5  # Maximum noise level to add in dBFS (negative value)


# Out length minimum (default: 1s)
out_length = 1
# Maximum number of keywords to generate
count = 800
# Raw sample output directory
voice_dir = 'out-wav'
# Creative commons background noise from freesound.org:https://freesound.org/people/Astounded/sounds/483561/
noise_url = 'https://cdn.freesound.org/previews/483/483561_10201334-lq.ogg'
output_folder = 'out-noisy'
num_copies = 3  # Number of noisy copies to create for each input sample
max_noise_level = -4  # Maximum noise level to add in dBFS (negative value)


Then we need to check all the output folders are ready

In [None]:

# Check if output directory for noisey files exists and create it if it doesn't
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
# Check if output directory for raw voices exists and create it if it doesn't
if not os.path.exists(voice_dir):
    os.makedirs(voice_dir)    


In [None]:
# https://stackoverflow.com/a/29550200/4819888
def detect_leading_silence(sound, silence_threshold=-50.0, chunk_size=10):
    '''
    sound is a pydub.AudioSegment
    silence_threshold in dB
    chunk_size in ms

    iterate over chunks until you find the first one with sound
    '''
    trim_ms = 0 # ms

    assert chunk_size > 0 # to avoid infinite loop
    while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold and trim_ms < len(sound):
        trim_ms += chunk_size

    return trim_ms
def trim_silence(sound, silence_threshold=-50.0, chunk_size=10):

    start_trim = detect_leading_silence(sound, silence_threshold, chunk_size)
    end_trim = detect_leading_silence(sound.reverse(), silence_threshold, chunk_size)

    duration = len(sound)    
    trimmed_sound = sound[start_trim:duration-end_trim]
    return trimmed_sound

def process_sound_for_silence(sound):
    lenuntrimmed = len(sound)
    trimmed_sound = trim_silence(sound)
    lentrimmed = len(trimmed_sound)
    print("Trimmed from "+str(lenuntrimmed/1000)+"s to " +str(lentrimmed/1000)+"s")
    return trimmed_sound
# sound = AudioSegment.from_file(file_path_here, format="wav")

# trimmed_sound = trim_silence(sound)

# trimmed_sound.export(file_path_trimmed, format='wav')

And download the background noise file

In [None]:

# Download background noise file
response = requests.get(noise_url)
response.raise_for_status()
noise_audio = AudioSegment.from_file(io.BytesIO(response.content), format='ogg')


Then we can generate a list of all possible parameter combinations based on the input earlier. If you have set `num_copies` to be smaller than the number of combinations then these options will be reduced: 

In [None]:

# Generate all combinations of parameters
all_opts = []
for v in voices:
    for style in styles:
        for kw in keyword:
            all_opts.append({
                    "voice": v,
                    "language": v,
                    "style": style,
                    "text": kw['string'],
                    "label": kw['label']
                })
print(len(all_opts))
if len(all_opts)*num_copies > count:
    selectEvery = len(all_opts)*num_copies // count
    selectNext = 0
    all_opts = all_opts[::selectEvery]
print(f'Generating {len(all_opts)*num_copies} samples')

Finally we iterate though all the options generated, call the Google TTS API to generate the desired sample, and apply noise to it, saving locally with metadata:

In [None]:

# Instantiate list for file label information
downloaded_files = []

# Instantiates a client
# Select the type of audio file you want returned
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

ix = 0
for o in all_opts:
    ix += 1
    # Set the text input to be synthesized
    synthesis_input = o['text']
    # Build the voice request
    speech_synthesis_voice_name=o['voice']# 'en-US-JennyMultilingualNeural'

    # ssml = """<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts'>
    #     <voice name='{}'>
    #         <mstts:audioduration value="{}s"/>
    #         <mstts:express-as style="{}" role="{}">
    #             {}
    #         </mstts:express-as>
    #     </voice>
    # </speak>""".format(speech_synthesis_voice_name, out_length, o['style'], o['role'], synthesis_input)

    ssml = """<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts'>
        <voice name='{}'>
            <mstts:express-as style="{}">
                {}
            </mstts:express-as>
        </voice>
    </speak>""".format(speech_synthesis_voice_name, o['style'], synthesis_input)

    # Perform the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
    wav_file_name = f"./{voice_dir}/{o['label']}.{o['voice']}-{o['style']}.tts.wav"

    if not os.path.exists(wav_file_name):
        print(f"[{ix}/{len(all_opts)}] Text-to-speeching...")
        
        speech_synthesis_result = speech_synthesizer.speak_ssml_async(ssml).get()# this is the speak on which plays the sounds as it is received (so you can validate what is coming through - can get annoying if you need to keep the sound on :) )

        if speech_synthesis_result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
            print("Speech synthesized for text [{}] and saved as '{}'".format(o['text'],wav_file_name))
            stream = speechsdk.AudioDataStream(speech_synthesis_result)
            stream.save_to_wav_file(wav_file_name)
            sound = AudioSegment.from_file(wav_file_name, format="wav")
            trimmed_sound = process_sound_for_silence(sound)
            trimmed_sound.export(wav_file_name, format='wav')

        elif speech_synthesis_result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = speech_synthesis_result.cancellation_details
            print("Speech synthesis canceled: {}".format(cancellation_details.reason))
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                if cancellation_details.error_details:
                    print("Error details: {}".format(cancellation_details.error_details))
                    print("Did you set the speech resource key and region values?")

        #with open(wav_file_name, "wb") as f:
        #    f.write(response.audio_content)
        has_hit_api = True
    else:
        print(f'skipping {wav_file_name}')
        has_hit_api = False

    
    print(wav_file_name)
    # Load voice sample
    voice_audio = AudioSegment.from_file(wav_file_name)
    # Add silence to match output length with random padding
    difference = (out_length * 1000) - len(voice_audio)
    if difference > 0:
        padding_before = random.randint(0, difference)
        padding_after = difference - padding_before
        voice_audio = AudioSegment.silent(duration=padding_before) +  voice_audio + AudioSegment.silent(duration=padding_after)

    for i in range(num_copies):
        # Save noisy sample to output folder
        output_filename = f"./{o['label']}.{o['voice']}-{o['style']}_noisy_{i+1}.wav"
        output_path = os.path.join(output_folder, output_filename)
        if not os.path.exists(output_path):
            # Select random section of noise and random noise level
            start_time = random.randint(0, len(noise_audio) - len(voice_audio))
            end_time = start_time +len(voice_audio)
            noise_level = random.uniform(max_noise_level, 0)

            # Extract selected section of noise and adjust volume
            noise_segment = noise_audio[start_time:end_time]
            noise_segment = noise_segment - abs(noise_level)

            # Mix voice sample with noise segment
            mixed_audio = voice_audio.overlay(noise_segment)
            # Save mixed audio to file
            mixed_audio.export(output_path, format='wav')

            print(f'Saved mixed audio to {output_path}')
        else:
            print(f'skipping {output_path}')
        # Save metadata for file
        downloaded_files.append({
            "path": str(output_filename),
            "label": o['label'],
            "category": "split",
            "metadata": {
                "voice": str(['voice']),
                "style": str(o['style']),
                "language": str(o['language']),
                "text": o['text'],
                "imported_from": "Azure Cognitive Services TTS"
            }
        })
        
    # Introduce delay as the free Azure Text to Speech allowance only allows 20 requests per second https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-services-quotas-and-limits#common-text-to-speech-quotas-and-limits
    if has_hit_api:
        delaySec =int(60/request_limit_per_minute)
        print("Waiting "+str(delaySec)+" seconds...")
        time.sleep(delaySec)

print("Done text-to-speeching")
print("")

input_file = os.path.join(output_folder, 'input.json')
info_file = {
    "version": 1,
    "files": downloaded_files
}
# Output the metadata file
with open(input_file, "w") as f:
    json.dump(info_file, f)

The files in `./out-noisy` can be uploaded easily using the Edge Impulse CLI tool https://docs.edgeimpulse.com/docs/edge-impulse-cli/cli-uploader:

In [None]:
# Move to the out-noisy folder
! cd out-noisy
# Upload all files in the out-noisy folder with metadata attached in the input.json file
! edge-impulse-uploader --info-file input.json *

## What next?
Now you can use your keywords to create a robust keyword detection model in Edge Impulse Studio!

Make use of our pre-built keyword dataset to add noise and 'unknown' words to your model: https://docs.edgeimpulse.com/docs/pre-built-datasets/keyword-spotting

Try out both classification models and the transfer learning keyword spotting model to see which works best for your case