# Solution Notebook - Getting started with Speech AI

![ASR](./../images/asr_diarization_tutorial.png)

By the end of this Getting Started notebook, you will be able to use Automatic Speech Recognition (ASR), Neural Machine Translation (NMT) and Text-to-Speech (TTS) APIs in your projects, opening up a wide range of possibilities for Speech AI applications.

Happy learning!

## Instructions

### 1 - Making API Requests to Speech AI Models

➡️ Before continuing, please consult the **[00-making-http-request.MD](https://github.com/eleapttn/workshop-mastering-speech-ai/blob/main/docs/00-making-http-request.MD)** documentation to discover how to make HTTP API calls to the AI Endpoints models using Python.

This will be essential for interacting with the different AI models!

### 2 - ASR Pipeline

Let's start with Automatic Speech Recognition model.

🎯 Your task is to **send an audio file as input and receive a text transcription**. 

To do that, you will have to:

- Determine the `ASR` model you want to work with among the ones available on [AI Endpoints](https://endpoints.ai.cloud.ovh.net/)
- Get its endpoint `URL`
- Set up the necessary request headers
- Provide the input data expected by the `ASR` model.
- Send your request
- Print the request answer and analyse the audio transcription !

**➡️ You can find all the necessary information, such as the endpoint URL, format of expected input data, and code examples, in the [01-ASR.MD](https://github.com/eleapttn/workshop-mastering-speech-ai/blob/main/docs/01-ASR.MD) documentation file.**

⬇️ *You can write the code in the cell below.* ⬇️

In [3]:
# Import the requests and os libraries into your notebook environment, if not already done
import requests
import os

# Specify the headers of your request 
# We will expect a JSON as the output since the model will generate text.
headers = {
    'accept': 'application/json',
    "Authorization": f"Bearer {os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

# Specify the audio file to be uploaded and transcribed
files = [
    ('audio', open('/workspace/workshop-mastering-speech-ai/samples/audio_samples/audio_ovhcloud_en_0.wav', 'rb')),
]

# Send the request to the correct model endpoint and get response 
# Here we use the "en-gb" version, which is meant for English audio files.
response = requests.post(
    os.getenv("ASR_EN_GB_ENDPOINT"), 
    files=files, 
    headers=headers
)

# Return complete transcription
if response.status_code == 200:
    response_data = response.json()
    resp=''
    for alternative in response_data:
        resp+=alternative['alternatives'][0]['transcript']
    print("Audio transcription in english:\n\n", resp)
else:
    print("Error:", response.status_code)

Audio transcription in english:

 O V H Cloud is a global player and Europe's leading cloud provider, operating over 450000 servers within around 40 data centres across nine countries. For over 20 years, O V H Cloud has developed an integrated model that provides complete control of its value chain, from the design of its servers to the construction and management of its data centre. You are now entering our data centre office in Hob, France. Well, the access is not that easy in reality as you can read on the sign restricted zone. Employees can only access this area via badge and fingerprinting process. Only authorised employees are allowed through the man trap, which even scans the weight. This ensures that only one person enters the data centre and that our colleagues stay in shape. All our data centres are secured through a security zone concept with onion skin principle. As you can see, you are now getting into the critical zone. Here you can see the heart of our data centre, our S

### 3 - NMT Pipeline

Now that you have successfully completed the `ASR` task, let's move on to the Neural Machine Translation model. 

🎯 Your task is to **translate a given text from one language to another**.

Remember that you'll need to: 

- Determine the `NMT` model you want to work with among the ones available on [AI Endpoints](https://endpoints.ai.cloud.ovh.net/)
- Get its endpoint `URL`
- Set up the necessary request headers
- Adjust the `JSON` request data to suit the newly chosen `NMT` model
- Send your request
- Print the response and analyze the translated text!

**➡️ You can find all the necessary information, such as the NMT endpoint URL, format of expected input data (we are not sending an audio file anymore), and code examples, in the [02-NMT.MD](https://github.com/eleapttn/workshop-mastering-speech-ai/blob/main/docs/02-NMT.MD) documentation file.**

⬇️ *You can code that in the cell below.* ⬇️

In [4]:
# Import the requests and os libraries into your notebook environment, if not already done
import requests
import os

# Specify the headers of your request 
# We will expect a JSON as the input and the output, since the NMT model will receive text, and generate text.
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

# Specify the input data to be sent and translated
data = {
  "model": "en_de_24x6", # en_de allows the use of the English to German NMT model
  "source_language": "en", # Input language will be English (language code: `en`)
  "target_language": "de", # Input language will be German (language code: `de`)
  "texts": ["This is the sentence I want to translate"]
}

# Send the request to the correct model endpoint and get response 
# We make sure to use the right endpoint, that targets the right model (English to German)
response = requests.post(
    os.getenv("NMT_EN_DE_ENDPOINT"), 
    json=data,
    headers=headers
)

# Return translation
if response.status_code == 200:
    response_data = response.json()
    print("Translations:\n", response_data)
else:
    print("Error:", response.status_code)

Translations:
 [{'text': 'Das ist der Satz, den ich übersetzen möchte.', 'language': 'de'}]


### 4 - TTS Pipeline

Great job with the NMT model! 

🎯 Let's explore the Text-to-Speech model now. Your goal is to **convert text to human-like speech**. The process is similar to what you have done for the `ASR` and `NMT` sections, but be aware that the `accept` headers will be different this time, as we want to generate an audio file, not a text.

**➡️ You can find all the necessary information, such as the endpoint URL, format of expected input data, and code examples, in the [03-TTS.MD](https://github.com/eleapttn/workshop-mastering-speech-ai/blob/main/docs/03-TTS.MD) documentation file.**

⬇️ *You can code that in the cell below.* ⬇️

In [5]:
# Import the requests and os libraries into your notebook environment, if not already done
import requests
import os

# Specify the headers of your request
# We will expect a JSON as the input since it is a text
# The output will be an application/octet-stream since the endpoint generates audio files (binary data format that can be read as a stream) 
headers = {
    "accept": "application/octet-stream",
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.environ.get('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

# Specify the data that will be sent to the model (text we want to synthesize, language and type of voice)
data = {
    "text": "OVHcloud provide a set of managed tools designed for building your Machine Learning projects: AI Notebooks, AI Training, AI Deploy and AI Endpoints.",  # text you want to generate speech from
    "language_code": "en-US", # language code of the generated speech (en-US for English)
    "encoding": 1,
    "sample_rate_hz": 16000,
    "voice_name": "English-US.Female-1" # or "English-US.male-1" for man voice
}

# Send the request to the correct model endpoint and get response 
# We use the "en-us" version to create English speech, given that the input text is English.
response = requests.post(
    os.getenv("TTS_EN_US_ENDPOINT"),
    json=data,
    headers=headers
)

# Return audio file
if response.status_code == 200:
    response_data = response.content
else:
    print("Error:", response.status_code)

### Read generated audio file

In [6]:
import IPython.display as ipd

ipd.Audio(response_data)

### Conclusion

Congratulations on completing this introduction to Speech AI models and AI APIs!

You have successfully:

- Discovered the `ASR`, `NMT`, and `TTS` models.
- Learned how to work with `APIs` and send `HTTP` requests.
- Provided correct input data, and used right header settings to achieve the desired output.

Now, you are ready to move on to another notebook where you will put your new skills into practice with more advanced Speech AI tasks, such as subtitle generation, voice dubbling and other exciting speech AI features.

The `/notebooks/1_GENERATE_SRT_FILE.ipynb` notebook is waiting for you !