### 0. Prepare the environment

[Overall documentation](https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-custom-speech-create-project?pivots=rest-api)

In [32]:
# import standard python libraries
import requests
import json
import time
import os
import pandas as pd
from dotenv import load_dotenv


In [49]:
#os.chdir("../Airlift/Speech/Train a Custom Model")
if not load_dotenv('./mydotenv.env'): raise Exception(".env file not found")

In [50]:
# Setup the credentials
speech_key = os.getenv("SPEECH_KEY")

# Set the API key and endpoint
endpoint = os.getenv("SPEECH_ENDPOINT")
region = os.getenv("SPEECH_REGION")

# Define the locale for the custom model
locale = "en-US"

### 1. Prepare the Dataset

#### 1.1. Get an opensource dataset

 - Audios collected from [here](http://www.voiptroubleshooter.com/open_speech/american.html)
 - Transcriptions collected from [here](http://www.cs.columbia.edu/~hgs/audio/harvard.html)

 Other sources of dataset download, from [Mozilla Common Voice project](https://commonvoice.mozilla.org/en/languages) under CC-0 license.

Check if the audio properties meet the requirements
```
cd Airlift/Speech/"Train a Custom Model"
find . -type f -name "*.wav" -print0 | xargs -0 -I {} sh -c 'ffprobe "{}" 2>&1 | grep -A1 Duration:'
```

Requirements:
![File properties](img/custom_speech_audio_format.png)

#### 1.2. Load the labeled transcriptions

In [60]:
# Create an empty dict 
content_list = []

# Specify the folder path where the .txt files are located
folder_path = './dataset/transcriptions'

# List all files in the folder
files = os.listdir(folder_path)

# Filter .txt files
txt_files = [file for file in files if file.endswith('.txt')]

# Loop through each .txt file and perform operations
for txt_file in txt_files:
    # Create the complete file path by joining folder path and file name
    file_path = os.path.join(folder_path, txt_file)
    
    # Open and read the file
    with open(file_path, 'r') as file:
        file_content = file.read()
        content_list.append({"path": txt_file.replace(".txt", ".wav"), "sentence": file_content.replace("\n", " ").replace('"', '')})

# Generate a dataframe with the labeled examples
df = pd.DataFrame(content_list)

# Save to disc
df.to_csv("./dataset/audio_files/labels.tsv", sep = "\t", index = False)

# Show first examples
df.head()

Unnamed: 0,path,sentence
0,OSR_us_000_0010_8k.wav,The birch canoe slid on the smooth planks. Glu...
1,OSR_us_000_0011_8k.wav,The boy was there when the sun rose. A rod is ...
2,OSR_us_000_0012_8k.wav,The small pup gnawed a hole in the sock. The f...
3,OSR_us_000_0013_8k.wav,Hoist the load to your left shoulder. Take the...
4,OSR_us_000_0014_8k.wav,A king ruled the state in the early days. The ...


Zip the audio files + labels
```
cd dataset/audio_files/
zip training_data.zip ./*.*
```


### 1. Create the Project

[Documentation for the API](#https://eastus.dev.cognitive.microsoft.com/docs/services/speech-to-text-api-v3-1/operations/Projects_Create)

In [58]:
# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Define the request body for a new custom speech project
create_project_request_body = {
  "displayName": "demo_airlift",
  "description": "Custom Model Training using APIs",
  "locale": locale
}

# Build and call the URI
create_project_uri = f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/projects"
create_project_response = requests.post(url = create_project_uri, headers = headers, json = create_project_request_body)

# 20* Represents a successful call 
print("HTTP Status code:", create_project_response.status_code)

# Get the project ID
project_id = json.loads(create_project_response.text)["links"]["datasets"].split("/")[-2]
print("Project ID #:", project_id)

HTTP Status code: 201
Project ID #: c1db0548-2b25-4f9e-8bed-e826bb7382e1


### 2. Create the dataset

[Documentation for the API](https://eastus.dev.cognitive.microsoft.com/docs/services/speech-to-text-api-v3-1/operations/Datasets_Create)

In [61]:
# Pointer to the blob location containig the file
training_data_location = "https://customspeechdemo.blob.core.windows.net/dataset/training_data.zip?sp=r&st=2023-10-17T14:41:35Z&se=2023-12-31T23:41:35Z&spr=https&sv=2022-11-02&sr=b&sig=rEBDPZB0fIsHNduwPMPk41CTM6UcxsBd7e10nlfiZMo%3D"

# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Define the request body for a new dataset
create_dataset_request_body = {
  "kind": "Acoustic",
  "displayName": "training_data",
  "description": "Training Data, added via API call",
  "project": {
    "self": f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/projects/{project_id}"
  },
  "contentUrl": training_data_location,
  "locale": locale
}

# Build and call the URI
create_dataset_uri = f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/datasets"
create_dataset_response = requests.post(url = create_dataset_uri, headers = headers, json = create_dataset_request_body)

# 20* Represents a successful call 
print("HTTP Status code:", create_dataset_response.status_code)

# Get the dataset ID
dataset_id = json.loads(create_dataset_response.text)["links"]["files"].split("/")[-2]
print("Dataset ID #:", dataset_id)

HTTP Status code: 201
Dataset ID #: 4ea90e2c-b3af-445b-ba08-95f4d50d3d58


### 3. Get the base models for Acoustic training

[Documentation for the API](https://eastus.dev.cognitive.microsoft.com/docs/services/speech-to-text-api-v3-1/operations/Models_GetBaseModel)


In [40]:
# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Define the request body for getting the base models
get_base_model_request_body = {
  "locale": locale
}

# Build and call the URI
get_base_model_uri = f"https://{region}.cognitiveservices.azure.com/speechtotext/v3.1/models/base/?skip=900&top=100"
get_model_response = requests.get(url = get_base_model_uri, headers = headers, json = get_base_model_request_body)

# 20* Represents a successful call 
print("HTTP Status code:", get_model_response.status_code)

# Get and print base model details
base_models = json.loads(get_model_response.text)["values"]
for model in base_models:
    if model["locale"] == locale and "Acoustic" in base_models[0]["properties"]["features"]["supportsAdaptationsWith"]:
        base_model_url = model["self"]
        base_model_display_name = model["displayName"]
        print("Acoustic model for base model:", base_model_display_name, "\n", base_model_url, "\n\n")

HTTP Status code: 200
Acoustic model for base model: 20230315 Batch Transcription 
 https://westeurope.cognitiveservices.azure.com/speechtotext/v3.1/models/base/ea09791c-0745-41dd-a79c-e6f154678de4 


Acoustic model for base model: 20230301 Batch Transcription 
 https://westeurope.cognitiveservices.azure.com/speechtotext/v3.1/models/base/eb66f70e-c649-40d8-b205-e017c87a96fe 


Acoustic model for base model: 20230915 Whisper Preview 
 https://westeurope.cognitiveservices.azure.com/speechtotext/v3.1/models/base/8cd2b1e5-985c-4c77-bfb0-e6ab1798826b 


Acoustic model for base model: 20230724 
 https://westeurope.cognitiveservices.azure.com/speechtotext/v3.1/models/base/e829074a-0c34-40d9-81af-e6cae17a4266 


Acoustic model for base model: 20231005 Batch Transcription 
 https://westeurope.cognitiveservices.azure.com/speechtotext/v3.1/models/base/65d19d52-4766-429a-870c-b263b3aebab7 




In [68]:
# Define the base model
base_model_url = "https://westeurope.cognitiveservices.azure.com/speechtotext/v3.1/models/base/801f5620-13c1-4883-9e39-9275bf97576d"

### 4. Model Training

[Documentation for the API](https://eastus.dev.cognitive.microsoft.com/docs/services/speech-to-text-api-v3-1/operations/Models_Create)

In [69]:
# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Define the request body for a new dataset
train_model_request_body = {
  "project": {
    "self": f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/projects/{project_id}"
  },
  "datasets": [
    {
      "self": f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/datasets/{dataset_id}"
    }
  ],
  "locale": locale,
  "displayName": "custom_model_airlift_demo",
  "description": "Custom Model for the Airlift Demo",
  "baseModel": {
    "self": base_model_url
  }
}

# Build and call the URI
train_model_uri = f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/models"
train_model_response = requests.post(url = train_model_uri, headers = headers, json = train_model_request_body)

# 20* Represents a successful call 
print(train_model_response.status_code)

# Get the Model URI
model_uri = json.loads(train_model_response.text)["self"]
print("Model URI:", model_uri)

201
Model URI: https://westeurope.api.cognitive.microsoft.com/speechtotext/v3.1/models/ebadfd8d-8067-4896-8c7c-330793f6044f


### 5. Deploy the model
[Documentation for the API](https://eastus.dev.cognitive.microsoft.com/docs/services/speech-to-text-api-v3-1/operations/Endpoints_Create)


In [70]:
# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Define the request body for a new dataset
deploy_model_request_body = {
  "project": {
    "self": f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/projects/{project_id}"
  },
  "properties": {
    "loggingEnabled": "true"
  },
  "displayName": "airliftdemo",
  "description": "Custom STT Model used in the Airlift demo",
  "model": {
    "self": model_uri
  },
  "locale": locale,
}

# Build and call the URI
deploy_model_uri = f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.1/endpoints"
deploy_model_response = requests.post(url = deploy_model_uri, headers = headers, json = deploy_model_request_body)

# 20* Represents a successful call 
print(deploy_model_response.status_code)

# Get up the API endpoint URL (endpoint for short audios)
endpoint_url = json.loads(deploy_model_response.text)["links"]["restInteractive"]
print("Endpoint URL:", model_uri)

201
Endpoint URL: https://westeurope.api.cognitive.microsoft.com/speechtotext/v3.1/models/ebadfd8d-8067-4896-8c7c-330793f6044f


### 6. Consuming the model

In [72]:
# Pointers to the file
example_path = "dataset/PS4_XboxOne.wav"

In [73]:
# Set up the request headers
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "audio/wav"
}

# Read the audio file
with open(example_path, "rb") as file:
    audio_data = file.read()

# Send the API request
response = requests.post(endpoint_url, headers=headers, data=audio_data)

# Get the response content as JSON
json_response = json.loads(response.content)

# Print the recognized text
if "DisplayText" in json_response:
    print(json_response["DisplayText"])
else:
    print("Error: " + json_response["RecognitionStatus"])


First there was Play Station, AKA PS1, PS2, PS3 and now PS4. And that makes sense, You think after Xbox there'd be Xbox too. But no. Next came Xbox 360 and now after 360 comes Xbox One.
