# Audio Processing

This notebook shows a basic sequence of processing steps required to process audio files through Amazon Transcribe and Amazon Comprehend, to generate the data required for Sentiment Analysis of Contact Center calls. The steps themselves can be executed from any platform, the notebook is just convenient for step-by-step execution and experimentation.

## Initial setup

The libraries use some packages that are not installed by default. Install them by running the next cell.

In [55]:
!pip install -U implicits pandas

Requirement already up-to-date: implicits in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.0.2)
Requirement already up-to-date: pandas in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.1.2)
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


## Moving the data to the processing location

The data is available on a backup disk. We'll first copy the files from there into  the data folder. This will also help create the required folder structure for processing and analysing it.

In [16]:
!mkdir -p /home/ec2-user/SageMaker/transcribe-comprehend/data/
!aws s3 sync s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source /home/ec2-user/SageMaker/transcribe-comprehend/data/
!ls ../data/Call_Sentiment_Analysis_PoC/16\ KHz

download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-002.txt to ../data/dialogues/dpv-cc-v1-002.txt
download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-001.txt to ../data/dialogues/dpv-cc-v1-001.txt
download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-006.txt to ../data/dialogues/dpv-cc-v1-006.txt
download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-005.txt to ../data/dialogues/dpv-cc-v1-005.txt
download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-003.txt to ../data/dialogues/dpv-cc-v1-003.txt
download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-007.txt to ../data/dialogues/dpv-cc-v1-007.txt
download: s3://transcribe-comprehend-demo-2020-09-14-nb-audio-source/dialogues/dpv-cc-v1-008.txt to ../data/dialogues/dpv-cc-v1-008.txt
download: s3://transcribe-comprehend-demo-2020-0

The cell above should list the 16KHz audio files. Now copy the audio files to the bucket we'll use during the exercise.

In [17]:
import sagemaker as sm

bucket = sm.Session().default_bucket()
audio_source_path = "audio/contact-center/16KHz"
!aws s3 sync "/home/ec2-user/SageMaker/transcribe-comprehend/data/Call_Sentiment_Analysis_PoC/16 KHz/" s3://{bucket}/{audio_source_path}

Check if the files are in the right bucket and folders. The `get_recording_files` function will retrieve a list of all files in the specified bucket and path. There should be a list of only ".wav" files as a result. If this is your result, we are ready to start transcribing.

In [20]:
from transcribe_utils import Speech, TranscriptReport, download_transcripts, get_recording_files, move_transcripts, \
    report_transcript, \
    transcribe_recording, \
    wait_for_jobs

file_names = get_recording_files(bucket=bucket, path=audio_source_path)
print("\n".join(file_names))

0174043c-d3a1-478a-a0ea-539c7075fb54.wav
017404e1-995b-418c-beb4-9ead3f79a7b6.wav
01740574-5169-4491-834e-c0cd2cd6120a.wav
01740677-ab28-4b30-89ca-18cb17b7d56e.wav
0174067d-f3b5-47e9-a826-b8b25239e74c.wav
0174068b-59cc-475a-aec4-659cd3b2f107.wav
01740694-7812-42b3-901e-c1576493d840.wav
01740b64-94fd-4556-9cfa-2010b4187d5f.wav
01742a46-8cc5-40bc-870d-06a3efd9fead.wav
01742a48-9802-45f1-82aa-0be122bd76b9.wav
01742a4a-efc9-4826-a455-ac52d1c2ff16.wav
01742a51-965c-4acb-ad77-68e397d5ff1e.wav
01742a57-8b54-4ca8-8412-fc62379d83b5.wav
01742faa-dc9f-4e3f-af1d-88f9ae19c2dc.wav
01742fb1-0d44-4239-98d7-1b924fbe94ad.wav


## Executing the Transcription

### Custom Vocabulary Creation

Before starting the transcriptions, create a custom vocabulary to improve transcription quality. The vocabulary file was copied from the S3 backup above. Here's its contents:

In [30]:
!cat ../data/vocabularies/vocabulary\ v1.txt

Gruner-und-Jahr
Gruner-und-Jahr-Kundenservice
Geo
Brigitte
Abo
P.M.
Zeitschriften
Heften
IBAN
BIC
Auftragsnummer
Botenzustellung
Botenversand
E-Paper
Registrierungslink
Schlehenweg
Mustermannstraße
Musterhausen
Moorrege
Elli-Muster
Murmelmann
Elise-Musterfrau
Margitt

Execute the following two cells to send the vocabulary to Amazon Transcribe and wait for its creation.It should take less than 2 minutes.

In [38]:
import time
import boto3

vocabulary_name = "dpv-cc"
with open("../data/vocabularies/vocabulary v1.txt", "r") as f:
    vocab_entries = f.readlines()
vocab_entries = [entry[:-1] if entry[-1] == '\n' else entry for entry in vocab_entries]
transcribe = boto3.client('transcribe')
vocab_result = transcribe.create_vocabulary(VocabularyName=vocabulary_name, LanguageCode='de-DE', Phrases=vocab_entries)

In [39]:
while vocab_result['VocabularyState'] not in {"READY", "FAILED"}:
    time.sleep(30)
    vocab_result = transcribe.get_vocabulary(VocabularyName=vocabulary_name)
    print(f"Vocabulary {vocabulary_name} still {vocab_result['VocabularyState']}...")
    
print(f"Vocabulary {vocabulary_name} {vocab_result['VocabularyState']}.")

Vocabulary dpv-cc still PENDING...
Vocabulary dpv-cc still PENDING...
Vocabulary dpv-cc still READY...
Vocabulary dpv-cc READY...


Vocabulary creation takes a few minutes, you can check the progress on the <a href="https://console.aws.amazon.com/transcribe/home?region=us-east-1#vocabulary" target="_blank">Amazon Transcribe Console</a>. Wait until creation is complete, then cotinue the execution.

### Execute Transcription Jobs

The cell below will submit transcription jobs for all the files in the bucket and path defined above. If you want to monitor the jobs themselves, open the <a href="https://console.aws.amazon.com/transcribe/home?region=us-east-1#" target="_blank">Amazon Transcribe console</a> in a new tab **before executing it**.

If it's the first execution, the tab should be empty. Go ahead and execute the next cell to submit the audio recordings for transcription (you will need to refresh the console tab to see the jobs).

In [41]:
ts = str(int(time.time()))  # Avoiding conflicts in case of ressubmission
base_job_name = f"dpv-cc{ts}"
transcribe_jobs = []
for i, recording in enumerate(file_names):
    transcribe_jobs.append(
        transcribe_recording(
            recording, 
            job_name=f"{base_job_name}-{i + 1:03d}",
            bucket=bucket,
            path=audio_source_path,
            VocabularyName=vocabulary_name
        )
    )
    

Starting job dpv-cc1600045852-001 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/0174043c-d3a1-478a-a0ea-539c7075fb54.wav
Starting job dpv-cc1600045852-002 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/017404e1-995b-418c-beb4-9ead3f79a7b6.wav
Starting job dpv-cc1600045852-003 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/01740574-5169-4491-834e-c0cd2cd6120a.wav
Starting job dpv-cc1600045852-004 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/01740677-ab28-4b30-89ca-18cb17b7d56e.wav
Starting job dpv-cc1600045852-005 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/0174067d-f3b5-47e9-a826-b8b25239e74c.wav
Starting job dpv-cc1600045852-006 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/0174068b-59cc-475a-aec4-659cd3b2f107.wav
Starting job dpv-cc1600045852-007 for s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/01740694-7812-42b3-901e-c1576493d

The next cell will wait for transcribe to finish all jobs and move the resulting transcripts to a proper bucket

In [42]:
from pprint import pprint

transcript_dest_path = "transcribe-output"
final_job_results = wait_for_jobs(transcribe_jobs)
move_transcripts(jobs=final_job_results, dest_bucket=bucket, dest_path=transcript_dest_path)
pprint(final_job_results)

Waiting
At least one in progress, wait more
At least one in progress, wait more
At least one in progress, wait more
At least one in progress, wait more
At least one in progress, wait more
Job dpv-cc1600045852-001 finished with status COMPLETED
	Media: s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/0174043c-d3a1-478a-a0ea-539c7075fb54.wav
	Transcript: https://s3.us-east-1.amazonaws.com/sagemaker-us-east-1-160951647621/dpv-cc1600045852-001.json

Job dpv-cc1600045852-002 finished with status COMPLETED
	Media: s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/017404e1-995b-418c-beb4-9ead3f79a7b6.wav
	Transcript: https://s3.us-east-1.amazonaws.com/sagemaker-us-east-1-160951647621/dpv-cc1600045852-002.json

Job dpv-cc1600045852-003 finished with status COMPLETED
	Media: s3://sagemaker-us-east-1-160951647621/audio/contact-center/16KHz/01740574-5169-4491-834e-c0cd2cd6120a.wav
	Transcript: https://s3.us-east-1.amazonaws.com/sagemaker-us-east-1-160951647621/dpv-c

NameError: name 'pprint' is not defined

Download the transcriptions to the local storage to be able to see them.

In [44]:
local_transcripts_path = "../data/transcriptions/16KHz"
download_transcripts(bucket, transcript_dest_path, local_transcripts_path)

Downloading transcribe-output/dpv-cc1600045852-001.json into ../data/transcriptions/16KHz/dpv-cc1600045852-001.json
Downloading transcribe-output/dpv-cc1600045852-002.json into ../data/transcriptions/16KHz/dpv-cc1600045852-002.json
Downloading transcribe-output/dpv-cc1600045852-003.json into ../data/transcriptions/16KHz/dpv-cc1600045852-003.json
Downloading transcribe-output/dpv-cc1600045852-004.json into ../data/transcriptions/16KHz/dpv-cc1600045852-004.json
Downloading transcribe-output/dpv-cc1600045852-005.json into ../data/transcriptions/16KHz/dpv-cc1600045852-005.json
Downloading transcribe-output/dpv-cc1600045852-006.json into ../data/transcriptions/16KHz/dpv-cc1600045852-006.json
Downloading transcribe-output/dpv-cc1600045852-007.json into ../data/transcriptions/16KHz/dpv-cc1600045852-007.json
Downloading transcribe-output/dpv-cc1600045852-008.json into ../data/transcriptions/16KHz/dpv-cc1600045852-008.json
Downloading transcribe-output/dpv-cc1600045852-009.json into ../data/tra

You can now navigate to the `data/transcriptionns/16KHz` folder and inspect the results of the transcriptions. You'll see that they contain:
- A full text transcription
- A detailed word by word transcription for each channel, with start time, end time and confidence.
- A list of segments of speech per channel, also with start and end time and the list of each word identified.

While very detailed, this format is difficult to read. In the next section we'll generate some basic reports and add sentiment information to the transcriptions.

## Processing Transcription Outputs

The cell below will take each transcription and generate:
- A text report with human-readable rendering of it
- A sentiment analysis of the overall conversation
- A detailed sentiment analysis of each segment

In [49]:
from glob import glob
from comprehend_utils import Sentiment, analyze_sentiment

def print_transcript_report(transcript_file, report, general_sentiment, sentiments, dest_processed_file=None):
    path = os.path.abspath(dest_processed_file if dest_processed_file else os.path.dirname(transcript_file))
    base_name, ext = os.path.splitext(os.path.basename(transcript_file))
    dest_file = os.path.join(path, f"{base_name}.txt")
    with open(dest_file, "w") as dest:
        dest.write(f"Job:\t\t{report.job}\nRecording:\t{report.recording}\nTranscript:\t{base_name}{ext}\n"
                   f"Speakers:\t{sorted(report.speakers)}\n")
        dest.write(f"Full Text:\n{report.full_text}\n")
        # noinspection PyProtectedMember
        dest.write(f"Sentiment: {general_sentiment.general}"
                   f" ({', '.join(f'{k[:3]}={v:0.3f}' for (k, v) in general_sentiment._asdict().items() if k != 'general')})\n")
        dest.write("\nDialogue:\n")
        for ((_, _, speaker, speech), sentiment) in zip(report.dialogue, sentiments):
            pred_sentiment: str = sentiment.general
            if sentiment.general == "POSITIVE":
                sentiment_strength = sentiment.positive
            elif sentiment.general == "NEGATIVE":
                sentiment_strength = sentiment.negative
            elif sentiment.general == "NEUTRAL":
                sentiment_strength = sentiment.neutral
            else:
                sentiment_strength = sentiment.mixed
            dest.write(f"{speaker} ({pred_sentiment[:3].lower()}: {sentiment_strength:0.3f}): {speech}\n")

comprehend = boto3.client('comprehend')
dialogues = []
for transcript in glob(f"{local_transcripts_path}/*.json"):
    transcript_report = report_transcript(transcript)
    general_sentiment = analyze_sentiment(transcript_report.full_text[:4500])
    sentiments = analyze_sentiment([speech.speech for speech in transcript_report.dialogue])
    print_transcript_report(transcript, transcript_report, general_sentiment, sentiments)
    dialogues.append((transcript, transcript_report.dialogue, sentiments))

Now you can see a text file for each transcription, which contains a readable report of the conversation. Open a few of them to see the results.

In order to build better visualizations, create some files from the `dialogues` list. Execute the following cells to generate an HTML page, an Excel file and a Pandas dataframe, which we will use for visualization.

In [56]:
import pandas as pd

def dialogues_to_df(dialogues):
    """
    :param dialogues: List of transcribed dialogues with analysis. Each is (<transcript file name>, <dialogue>, <dialogue sentiment>)
    :return: None
    """
    data = {
        'transcript': [],
        'recording': [],
        'index': [],
        'speaker': [],
        'pred_sent': [],
        'speech': [],
        'positive': [],
        'negative': [],
        'mixed': [],
        'neutral': []
    }
    for (transcript, dialogue, sentiments) in dialogues:
        for i, (speech, sentiment) in enumerate(zip(dialogue, sentiments)):
            data['transcript'].append(os.path.basename(transcript))
            data['recording'].append(speech.recording)
            data['index'].append(i)
            data['speaker'].append(speech.speaker)
            data['pred_sent'].append(sentiment.general)
            data['speech'].append(speech.speech)
            data['positive'].append(sentiment.positive)
            data['negative'].append(sentiment.negative)
            data['mixed'].append(sentiment.mixed)
            data['neutral'].append(sentiment.neutral)
    df = pd.DataFrame(data)
    return df

def export_df(df, export_path, export_format="pickle", hdf_key=None) -> None:
    """
    Exports the dataframe in a variety of formats. Dataframe is expected to have columns `transcript` and `index`

    :param df: The dataframe to be exported
    :param export_path: Where to write the exported dataframe
    :param export_format: One of "html", "csv", "json", "parquet", "pickle", "hdf"
    :param hdf_key: if hdf format is used, key to store the df under in the HDF5 file
    """
    _, ext = os.path.splitext(export_path)
    use_ext = '.' + export_format if len(ext) == 0 else ''
    if export_format == "html":
        color_dict = {
            'POSITIVE': 'limegreen',
            'NEGATIVE': 'red',
            'NEUTRAL': 'lightgrey',
            'MIXED': 'yellow'
        }
        spk_dict = {
            'ch_0': '#F0F8FF',
            'spk_0': '#F0F8FF',
            'ch_1': '#FFF8DC',
            'spk_1': '#FFF8DC'
        }
        df.set_index(
            ['transcript', 'recording', 'speaker', 'index']
        ).to_html(export_path + use_ext, encoding='utf-8',
                  formatters={
                      'pred_sent': lambda sent: f'<span style="background-color:{color_dict[sent]}">{sent}</span>',
                      'speaker': lambda speaker: f'<span style="background-color:{spk_dict[speaker]}">{speaker}</span>'
                  }, escape=False)
    elif export_format == "csv":
        df.to_csv(export_path + use_ext)
    elif export_format == "json":
        df.to_json(export_path + use_ext)
    elif export_format == "parquet":
        df.to_parquet(export_path + use_ext)
    elif export_format == "pickle":
        df.to_pickle(export_path + use_ext)
    elif export_format == "excel":
        if len(use_ext) > 0:
            use_ext = '.xlsx'
        df.to_excel(export_path + use_ext, index=False)
    elif export_format == "hdf":
        assert hdf_key is not None, "Parameter hdf_key must be informed if export format is hdf."
        df.to_hdf(export_path + use_ext, hdf_key)
    else:
        raise ValueError(f"Unknown export format: {export_format}")

In [57]:
local_export_path = "../data/results/transcription"
df = dialogues_to_df(dialogues)
export_df(df, local_export_path, "html")
export_df(df, local_export_path, "pickle")
export_df(df, local_export_path, "excel")

With all processing done, you can open the `transcript_viz` notebook. You can also look at the code inside the `transcribe_utils` package and the `comprehend_utils` package to better understand all that was done above.