# [TTS] Synthetic Audio Dataset Creation
This sample demonstrates how to use Azure AI Speech API to generate synthetic audio dataset from text. 

> ✨ ***Note*** <br>
> Please check the supported languages and region availabilty before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts / https://learn.microsoft.com/en-us/azure/ai-services/speech-service/regions  

## Prerequisites
Git clone the repository to your local machine. 

```bash
git clone https://github.com/hyogrin/Azure_OpenAI_samples.git
```

* A subscription key for the Speech service. See [Try the speech service for free](https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started).
* Python 3.5 or later needs to be installed. Downloads are available [here](https://www.python.org/downloads/).
* The Python Speech SDK package is available for Windows (x64 or x86) and Linux (x64; Ubuntu 16.04 or Ubuntu 18.04).
* On Ubuntu 16.04 or 18.04, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.0 libasound2
  ```
* On Debian 9, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.2 libasound2
  ```
* On Windows you need the [Microsoft Visual C++ Redistributable for Visual Studio 2017](https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads) for your platform.

Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## Speech Synthesis Using the Speech SDK

In [44]:
import azure.cognitiveservices.speech as speechsdk
import os
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

Create an instance of a speech config with specified subscription key and service region.
Replace with your own subscription key and service region (e.g., "westus").

In [32]:
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region)
audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)

In [45]:
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)

In [46]:
print("Type some text that you want to speak...")
text = input()

Type some text that you want to speak...


In [47]:
MIN_RETRIES = 2
for _ in range(MIN_RETRIES):
    try:
        result = speech_synthesizer.speak_text_async(text).get()
    except Exception as e:
        time.sleep(10)
        continue

In [48]:
import os

output_dir = "sample"
    

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized to speaker for text [{}]".format(text))
    stream = speechsdk.AudioDataStream(result)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    stream.save_to_wav_file(os.path.join(output_dir, "result_text.wav"))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech synthesis canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        if cancellation_details.error_details:
            print("Error details: {}".format(cancellation_details.error_details))
    print("Did you update the subscription info?")

Speech synthesized to speaker for text [I was waiting for you]


In [49]:
import html
default_tts_voice = 'en-US-JennyMultilingualV2Neural' # Default TTS voice for English
lang = "en-US"
#default_tts_voice = 'vi-VN-HoaiMyNeural' # Default TTS voice for Vietnamese check the voice gallery for more options
#lang = "vi-VN"

ssml = f"""<speak version='1.0'  xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'>
                     <voice name='{default_tts_voice}'>
                             {html.escape(text)}
                     </voice>
                   </speak>"""

In [50]:
if not os.path.exists(output_dir):
        os.makedirs(output_dir)
speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get()
stream = speechsdk.AudioDataStream(speech_sythesis_result)
stream.save_to_wav_file(os.path.join(output_dir,"result_ssml.wav"))

## Generate syntethic dataset and minifest.txt file to train custom speech model

For audio + human - labeled data(Acoustic type) to train a custom speech model, you need to make a zip file of the audio files and the corresponding text file. 
Here is an example of the structure of the labeled text file.

```text
audio1.wav	Content like data, models, tests, and endpoints are organized into Projects in the Custom Speech portal. Each project is specific to a domain and country/language. For example, you may create a project for call centers that use English in the United States. To create your first project, select the Speech-to-text/Custom speech, then click New Project. Follow the instructions provided by the wizard to create your project. After you've created a project, you should see four tabs: Data, Testing, Training, and Deployment. Use the links provided in Next steps to learn how to use each tab.
audio2.wav	Custom Speech provides tools that allow you to visually inspect the recognition quality of a model by comparing audio data with the corresponding recognition result. From the Custom Speech portal, you can play back uploaded audio and determine if the provided recognition result is correct. This tool allows you to quickly inspect quality of Microsoft's baseline speech-to-text model or a trained custom model without having to transcribe any audio data.
```

In [51]:

def get_audio_file_by_speech_synthesis(text, file_path, lang="en-US", default_tts_voice="en-US-JennyMultilingualV2Neural"):
    ssml = f"""<speak version='1.0'  xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'>
                     <voice name='{default_tts_voice}'>
                             {html.escape(text)}
                     </voice>
                   </speak>"""
    speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get()
    stream = speechsdk.AudioDataStream(speech_sythesis_result)
    stream.save_to_wav_file(file_path)

## Generate synthetic wav dataset and manifest for Vietnamese language

In [None]:
import datetime

default_tts_voice = 'vi-VN-HoaiMyNeural' # Default TTS voice for Vietnamese check the voice gallery for more options
languages = ['vi-VN'] # List of languages to generate audio files
output_dir = "output"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

with open('cc_support_expressions.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        try:
            expression = json.loads(line)
            no = expression['no']
            for lang in languages:
                text = expression[lang]
                timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
                file_name = f"{no}_{lang}_{timestamp}.wav"
                get_audio_file_by_speech_synthesis(text, os.path.join(output_dir,file_name), lang, default_tts_voice)
                with open('output/manifest.txt', 'a', encoding='utf-8') as manifest_file:
                    manifest_file.write(f"{file_name}\t{text}\n")
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON on line: {line}")
            print(e)

## Play WAV Files in Output Folder
Use the os library to list all WAV files in the output folder.

In [53]:
import os
from IPython.display import Audio, display

output_folder = 'output'
files = os.listdir(output_folder)
wav_files = [file for file in files if file.endswith('.wav')]

# Sort wav_files by 'no' in ascending order
wav_files.sort(key=lambda x: int(x.split('_')[0]))
wav_files

['1_vi-VN_20241105195638.wav',
 '2_vi-VN_20241105195642.wav',
 '3_vi-VN_20241105195647.wav',
 '4_vi-VN_20241105195652.wav',
 '5_vi-VN_20241105195656.wav',
 '6_vi-VN_20241105195701.wav',
 '7_vi-VN_20241105195707.wav',
 '8_vi-VN_20241105195713.wav',
 '9_vi-VN_20241105195718.wav',
 '10_vi-VN_20241105195722.wav',
 '11_vi-VN_20241105195728.wav',
 '12_vi-VN_20241105195733.wav',
 '13_vi-VN_20241105195738.wav',
 '14_vi-VN_20241105195743.wav',
 '15_vi-VN_20241105195748.wav',
 '16_vi-VN_20241105195753.wav',
 '17_vi-VN_20241105195757.wav',
 '18_vi-VN_20241105195802.wav',
 '19_vi-VN_20241105195808.wav',
 '20_vi-VN_20241105195813.wav']

# Play WAV Files
Use IPython.display.Audio to play each WAV file listed in the output folder.

In [54]:
# Play each WAV file in the output folder
for wav_file in wav_files[:3]:
    file_path = os.path.join(output_folder, wav_file)
    display(Audio(filename=file_path))

In [55]:
import zipfile
timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
zip_filename = f'train_{lang}_{timestamp}.zip'
with zipfile.ZipFile(zip_filename, 'w') as zipf:
    for file in files:
        zipf.write(os.path.join(output_folder, file), file)

print(f"Created zip file: {zip_filename}")

Created zip file: train_vi-VN_20241105200114.zip
