### Pre-processing

In this example, using FFmpeg to extract the audio track from a video:

```ffmpeg -i PS4_XboxOne.mp4 -vn -acodec pcm_s16le -ar 44100 -ac 1 PS4_XboxOne.wav```

**laughter track**

 - Difficult problem to solve, due to the fact that the frequencies are the same as voice

 - Attempt to use CNN to implement that here: https://github.com/jeffgreenca/laughr

Frequency:
![Frequency](img/frequency.jpg)

Spectrum:
![Spectrum](img/frequency_spectrum.jpg)

Other examples of pre-processing steps:
 - Convert a stereo audio into a mono file
 
sample command: ```ffmpeg -i input_stereo_audio.wav -ac 1 output_mono_audio.wav```

 - Denoising: Filter frequencies to keep human voice only

sample command: ```ffmpeg -i input_audio.wav -af "highpass=85, lowpass=255" -ac 1 output_denoised_audio.wav```

 - Decibel normalization

sample command: ```ffmpeg -i input_audio.wav -af "dynaudnorm=p=0.9" -ac 1 output_normalized_audio.wav```

 - Split multichannel audio into different mono files

sample command: ```ffmpeg -i input_multichannel_audio.wav -map_channel 0.0.0 -map_channel 0.0.1 -ac 2 output_stereo_audio.wav```

### Post Processing

In this example, using Azure Open AI models considering the transcription as context to rectify errors in the transcription.

The transcribed text is supposed to contain errors. Adding a layer of Gen AI to summarize or extract insights from the text can improve the overall meaning of the task -- Same approach as apply Gen AI after OCR documents, for instance.

Other examples of post processing are:
 - Using spell checker to correct mistakes
 - Split stereo audio in multiple mono files, transcribe all and check for consensus

#### Config the Speech Services

In [7]:
import requests
import json
import time
import os
from dotenv import load_dotenv

In [72]:
#os.chdir("../Airlift/Speech/")
if not load_dotenv('./mydotenv.env'): raise Exception(".env file not found")

In [9]:
# Setup the credentials
speech_key = os.getenv("SPEECH_KEY")

# Set the API key and endpoint
endpoint = os.getenv("SPEECH_ENDPOINT")
region = os.getenv("SPEECH_REGION")

In [10]:
endpoint_url = f"https://{region}.api.cognitive.microsoft.com/speechtotext/v3.2-preview.1/transcriptions"
audio_file_path = "https://airliftblob.blob.core.windows.net/raw/PS4_XboxOne.wav?sp=r&st=2023-10-07T14:57:29Z&se=2023-10-31T23:57:29Z&spr=https&sv=2022-11-02&sr=b&sig=AtyoJOrd12hvcoDE3dL7iOzyfbToJjV8CyywQiliDFM%3D"
audio_locale = "en-US"
job_name = "PS4_XboxOne"

#### Run the Batch Transcription

In [11]:
# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Define the properties for the transcription
properties = {
    "diarizationEnabled": "true",
    "diarization": {
        "speakers": {
            "minCount": 1,
            "maxCount": 2
        }
    }
}

# Define the parameters for the REST API call
params = {
    "contentUrls": [audio_file_path],
    "locale":      audio_locale,
    "displayName": job_name,
    "properties":  properties
}

# Get the return of the API
response = requests.post(url     = endpoint_url, 
                         headers = headers, 
                         data    = json.dumps(params))
stt_response = response.json()

# 20* Represents a successful call 
print("HTTP Status code:", response.status_code)

# Get the transcription URL
get_transcription_url = stt_response["self"]

HTTP Status code: 201


#### Check the status for the Batch Transcription job

In [12]:
job_status = ""

while (job_status != 'Succeeded'):
    # Define the headers for the REST API calls
    headers = {
        "Ocp-Apim-Subscription-Key": speech_key,
        "Content-Type": "application/json"
    }

    # Get the return of the API
    status_response = requests.get(url     = get_transcription_url, 
                                headers = headers)

    # 20* Represents a successful call 
    print("HTTP Status code:", status_response.status_code)

    # Get job status
    job_status = status_response.json()["status"]
    print (job_status)

    # Check every 20 seconds
    time.sleep(20)


HTTP Status code: 200
Running
HTTP Status code: 200
Running
HTTP Status code: 200
Succeeded


#### Get the transcription

In [54]:
# Define the headers for the REST API calls
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}

# Get the return of the API
transcription_response = requests.get(url     = get_transcription_url + "/files", 
                                      headers = headers)

# 20* Represents a successful call 
print("HTTP Status code:", transcription_response.status_code)

transcription_location = transcription_response.json()["values"][0]["links"]["contentUrl"]

# Set the headers and parameters for the API request
headers = {
    "Ocp-Apim-Subscription-Key": speech_key,
    "Content-Type": "application/json"
}
# Get the text
text_response = requests.get(transcription_location, headers=headers)
trasncribed_text = text_response.json()

# 20* Represents a successful call 
print("HTTP Status code:", text_response.status_code)

HTTP Status code: 200
HTTP Status code: 200


In [61]:
# Print the combined text display
trasncribed_text["combinedRecognizedPhrases"][0]["display"]

"First there was PlayStation AKA PS1. Then there's PS2, PS3 and now PS4. And that makes sense. You'd think after Xbox there'd be Xbox Two. But no. Next came Xbox 360, and now after 360 comes Xbox One. Why one? Maybe that's how many seconds of thought they put in the naming it. Can you get the butter, please? However, with the Xbox One, I can control my entire entertainment system using voice commands. Up until now I've had to use Leonard. Then get the other one. Pass the butter. Hang on, I don't feel like you're taking this dilemma seriously. Fine Sheldon, you have my undivided attention. OK, now the PS4 is more angular and sleek looking. No way it's true, but the larger size of the Xbox One may keep it from overheating. You wouldn't want your gaming system to overheat. Well, you absolutely would not. And furthermore, the Xbox One now comes with a Kinect included. Included, yes. Not sold separately, although the PS4 uses cool new GDDR 5 RAM while the Xbox One is still using the convent

#### Collect insights from transcription

In [67]:
# Get the number of speakers
speakers = []
for phrase in trasncribed_text["recognizedPhrases"]:
    if not phrase["speaker"] in speakers:
        speakers.append(phrase["speaker"])
number_of_speakers = len(speakers)

print("Number of Speakers:", number_of_speakers)

Number of Speakers: 2


In [68]:
# Split between the speakers
phrases_splited = []
for i in range(number_of_speakers):
    phrases_splited.append([])

for phrase in trasncribed_text["recognizedPhrases"]:
    if phrase["speaker"] == 1:
        phrases_splited[0].append(phrase["nBest"][0]["display"])
    elif phrase["speaker"] == 2:
        phrases_splited[1].append(phrase["nBest"][0]["display"])

In [69]:
# print phrases splitted
phrases_splited

[['First there was PlayStation AKA PS1.',
  "Then there's PS2, PS3 and now PS4.",
  'And that makes sense.',
  "You'd think after Xbox there'd be Xbox Two.",
  'But no.',
  'Next came Xbox 360, and now after 360 comes Xbox One.',
  'Why one?',
  "Maybe that's how many seconds of thought they put in the naming it.",
  'However, with the Xbox One, I can control my entire entertainment system using voice commands.',
  "Up until now I've had to use Leonard.",
  "Hang on, I don't feel like you're taking this dilemma seriously.",
  'OK, now the PS4 is more angular and sleek looking.',
  "No way it's true, but the larger size of the Xbox One may keep it from overheating.",
  'Well, you absolutely would not.',
  'And furthermore, the Xbox One now comes with a Kinect included.',
  'Not sold separately, although the PS4 uses cool new GDDR 5 RAM while the Xbox One is still using the conventional DDR3 memory.',
  'Why?',
  "See, that's what I thought.",
  'But then they go and throw in an ES RAM b

In [71]:
# Print the dialog
dialog = []

for phrase in trasncribed_text["recognizedPhrases"]:
    if phrase["speaker"] == 1:
        dialog.append("Speaker1: " + phrase["nBest"][0]["display"])
    elif phrase["speaker"] == 2:
        dialog.append("Speaker2: " + phrase["nBest"][0]["display"])

dialog

['Speaker1: First there was PlayStation AKA PS1.',
 "Speaker1: Then there's PS2, PS3 and now PS4.",
 'Speaker1: And that makes sense.',
 "Speaker1: You'd think after Xbox there'd be Xbox Two.",
 'Speaker1: But no.',
 'Speaker1: Next came Xbox 360, and now after 360 comes Xbox One.',
 'Speaker1: Why one?',
 "Speaker1: Maybe that's how many seconds of thought they put in the naming it.",
 'Speaker2: Can you get the butter, please?',
 'Speaker1: However, with the Xbox One, I can control my entire entertainment system using voice commands.',
 "Speaker1: Up until now I've had to use Leonard.",
 'Speaker2: Then get the other one.',
 'Speaker2: Pass the butter.',
 "Speaker1: Hang on, I don't feel like you're taking this dilemma seriously.",
 'Speaker2: Fine Sheldon, you have my undivided attention.',
 'Speaker1: OK, now the PS4 is more angular and sleek looking.',
 "Speaker1: No way it's true, but the larger size of the Xbox One may keep it from overheating.",
 "Speaker2: You wouldn't want yo

#### Call Azure Open AI to get more insights


In [74]:
import openai

In [75]:
# Setup the credentials
aoai_key = os.getenv("AOAI_KEY")

# Set the API key and endpoint
aoai_endpoint = os.getenv("AOAI_ENDPOINT")
aoai_region = os.getenv("AOAI_REGION")

# Set model and deployment
aoai_model = os.getenv("AOAI_MODEL")
aoai_deployment = os.getenv("AOAI_DEPLOYMENT")

In [76]:
# API config
openai.api_type = "azure"
openai.api_base = aoai_endpoint
openai.api_version = "2023-07-01-preview"
openai.api_key = aoai_key

#### Use GPT4 to discover the name of the show

In [90]:
# Define a specific prompt injecting the info about number of speakers + dialog
system_prompt = f"""You are an AI assistant that helps people find information from a transcription from a video clip from a TV show. 
  
The dialog is between {number_of_speakers} speakers.
  
The transcription is the following:
{dialog}
"""

# Add the system prompt and the ask
message_list = []
message_list.append({"role": "system","content": system_prompt})
message_list.append({"role": "user"  ,"content": "Based on this transcription, can you infer from which TV show it was extracted? Answer just the name of the TV Show?"})

# Call Azure open AI
show_name = openai.ChatCompletion.create(engine            = aoai_deployment,
                                         messages          = message_list,
                                         temperature       = 0.7,
                                         max_tokens        = 200,
                                         top_p             = 0.95,
                                         frequency_penalty = 0,
                                         presence_penalty  = 0,
                                         stop              = None)["choices"][0]["message"]["content"]

# Print the output
print(show_name)

The Big Bang Theory


#### Use GPT4 tp get the name of the characters

In [113]:
# Define a specific prompt injecting the info about number of speakers + dialog
system_prompt = f"""You are an AI assistant that helps people find information from a transcription from a video clip from {show_name}. 
  
The dialog is between {number_of_speakers} speakers.
  
The transcription is the following:
{dialog}
"""

# Add the system prompt and the ask
message_list = []
message_list.append({"role": "system", "content": system_prompt})
message_list.append({"role": "user"  , "content": "Can you infer the name of the characters and your confidence score from 1 to 10 you are correct, in the following format [['name lastname', 9], ['name lastname', 5]]"})

# Call Azure open AI
characters_name = openai.ChatCompletion.create(engine            = aoai_deployment,
                                               messages          = message_list,
                                               temperature       = 0.7,
                                               max_tokens        = 200,
                                               top_p             = 0.95,
                                               frequency_penalty = 0,
                                               presence_penalty  = 0,
                                               stop              = None)["choices"][0]["message"]["content"]

# Print the output
print(characters_name)

[['Sheldon', 10], ['Leonard', 1]]


In [116]:
# Define a specific prompt injecting the info about number of speakers + dialog
system_prompt = f"""You are an AI assistant that helps people find information from a transcription from a video clip from {show_name}. 
  
The dialog is between {number_of_speakers} speakers.
  
The transcription is the following:
{dialog}
"""

# Add the system prompt and the ask
message_list = []
message_list.append({"role": "system", "content": system_prompt})
message_list.append({"role": "user"  , "content": "Can you infer the name of the characters and your confidence score from 1 to 10 you are correct, in the following format [['name lastname', 9], ['name lastname', 5]]"})
message_list.append({"role": "system", "content": characters_name})
message_list.append({"role": "user"  , "content": "This seems to be a couple conversation. Who is the girlfriend of the character you are more confident you've identified correctly? Answer just the name"})

# Call Azure open AI
second_character_name = openai.ChatCompletion.create(engine            = aoai_deployment,
                                                     messages          = message_list,
                                                     temperature       = 0.7,
                                                     max_tokens        = 200,
                                                     top_p             = 0.95,
                                                     frequency_penalty = 0,
                                                     presence_penalty  = 0,
                                                     stop              = None)["choices"][0]["message"]["content"]

# Print the output
print(second_character_name)

Amy


In [117]:
# str to list
characters_name = eval(characters_name)

# excludes the confidence score
characters_name[0] = characters_name[0][0]
characters_name[1] = second_character_name

print(characters_name)

['Sheldon', 'Amy']


#### Edit the dialog to add the names

In [118]:
# Print the dialog
dialog_with_names = []

for phrase in trasncribed_text["recognizedPhrases"]:
    if phrase["speaker"] == 1:
        dialog_with_names.append(f"{characters_name[0]}: " + phrase["nBest"][0]["display"])
    elif phrase["speaker"] == 2:
        dialog_with_names.append(f"{characters_name[1]}: " + phrase["nBest"][0]["display"])

dialog_with_names

['Sheldon: First there was PlayStation AKA PS1.',
 "Sheldon: Then there's PS2, PS3 and now PS4.",
 'Sheldon: And that makes sense.',
 "Sheldon: You'd think after Xbox there'd be Xbox Two.",
 'Sheldon: But no.',
 'Sheldon: Next came Xbox 360, and now after 360 comes Xbox One.',
 'Sheldon: Why one?',
 "Sheldon: Maybe that's how many seconds of thought they put in the naming it.",
 'Amy: Can you get the butter, please?',
 'Sheldon: However, with the Xbox One, I can control my entire entertainment system using voice commands.',
 "Sheldon: Up until now I've had to use Leonard.",
 'Amy: Then get the other one.',
 'Amy: Pass the butter.',
 "Sheldon: Hang on, I don't feel like you're taking this dilemma seriously.",
 'Amy: Fine Sheldon, you have my undivided attention.',
 'Sheldon: OK, now the PS4 is more angular and sleek looking.',
 "Sheldon: No way it's true, but the larger size of the Xbox One may keep it from overheating.",
 "Amy: You wouldn't want your gaming system to overheat.",
 'Shel

#### The last sentence seems to contain an error; Correct that with GPT4

In [120]:
# print the last sentence
dialog_with_names[-1]

"Amy: He's fast, the buyer."

In [130]:
# Define a specific prompt injecting the info about number of speakers + dialog
system_prompt = f"""You are an AI assistant that helps people find information from a transcription from a video clip from {show_name}. 
  
The dialog is between {number_of_speakers} speakers.
  
The transcription is the following:
{dialog_with_names}
"""

# Add the system prompt and the ask
message_list = []
message_list.append({"role": "system", "content": system_prompt})
message_list.append({"role": "user"  , "content": "Can you infer the name of the characters and your confidence score from 1 to 10 you are correct, in the following format [['name lastname', 9], ['name lastname', 5]]"})
message_list.append({"role": "system", "content": str((characters_name))})
message_list.append({"role": "user"  , "content": "This seems to be a couple conversation. Who is the girlfriend of the character you are more confident you've identified correctly? Answer just the name"})
message_list.append({"role": "system", "content": second_character_name})
message_list.append({"role": "user"  , "content": f"The last sentence from {second_character_name} is not correct. During the dialog she had asked for something to Sheldon Cooper and and she never received. Can you correct the sentence to make it consistent to her previous ask?"})

# Call Azure open AI
corrected_last_sentence = openai.ChatCompletion.create(engine            = aoai_deployment,
                                                       messages          = message_list,
                                                       temperature       = 0.7,
                                                       max_tokens        = 200,
                                                       top_p             = 0.95,
                                                       frequency_penalty = 0,
                                                       presence_penalty  = 0,
                                                       stop              = None)["choices"][0]["message"]["content"]

# Print the output
print(corrected_last_sentence)

"Pass the butter, Sheldon."


In [129]:
# correct in the dialog 
dialog_with_names[-1] = f"{second_character_name}: "+ corrected_last_sentence
dialog_with_names

['Sheldon: First there was PlayStation AKA PS1.',
 "Sheldon: Then there's PS2, PS3 and now PS4.",
 'Sheldon: And that makes sense.',
 "Sheldon: You'd think after Xbox there'd be Xbox Two.",
 'Sheldon: But no.',
 'Sheldon: Next came Xbox 360, and now after 360 comes Xbox One.',
 'Sheldon: Why one?',
 "Sheldon: Maybe that's how many seconds of thought they put in the naming it.",
 'Amy: Can you get the butter, please?',
 'Sheldon: However, with the Xbox One, I can control my entire entertainment system using voice commands.',
 "Sheldon: Up until now I've had to use Leonard.",
 'Amy: Then get the other one.',
 'Amy: Pass the butter.',
 "Sheldon: Hang on, I don't feel like you're taking this dilemma seriously.",
 'Amy: Fine Sheldon, you have my undivided attention.',
 'Sheldon: OK, now the PS4 is more angular and sleek looking.',
 "Sheldon: No way it's true, but the larger size of the Xbox One may keep it from overheating.",
 "Amy: You wouldn't want your gaming system to overheat.",
 'Shel

#### Create a summary based on the dialog

In [131]:
# Define a specific prompt injecting the info about number of speakers + dialog
system_prompt = f"""You are an AI assistant that helps people find information from a transcription from a video clip from a TV show. 
  
The dialog is between {number_of_speakers} speakers.
  
The transcription is the following:
{dialog_with_names}
"""

# Add the system prompt and the ask
message_list = []
message_list.append({"role": "system", "content": system_prompt})
message_list.append({"role": "user"  , "content": "Create a short paragraph containing the summary of the conversation based on the dialog"})

# Call Azure open AI
summary_conversation = openai.ChatCompletion.create(engine            = aoai_deployment,
                                                    messages          = message_list,
                                                    temperature       = 0.7,
                                                    max_tokens        = 200,
                                                    top_p             = 0.95,
                                                    frequency_penalty = 0,
                                                    presence_penalty  = 0,
                                                    stop              = None)["choices"][0]["message"]["content"]

# Print the output
print(summary_conversation)

In this conversation, Sheldon is discussing the progression and differences between PlayStation and Xbox gaming systems. He notes the naming inconsistencies with Xbox, and how the Xbox One allows him to control his entire entertainment system through voice commands. Sheldon then compares the designs of the PS4 and Xbox One, mentioning the larger size of the Xbox One might prevent overheating. He also points out that the Xbox One includes Kinect and uses conventional DDR3 memory while the PS4 uses new GDDR5 RAM. Sheldon is confused about the choice of memory in Xbox One but realizes that they have an ES RAM buffer. Despite his extensive knowledge, Sheldon is unsure about which gaming system to choose. Throughout the conversation, Amy tries to get Sheldon to pass the butter and expresses her concern over his indecisiveness.
