# PART 2 - Text and Metadata extraction from the audio of the video file

In the follow section of the lab, we're going to:
- Transcribe the file's audio into text using Amazon Transcribe
- Prepare the transcript data 
- Run a topic modelling job using Amazon Comprehend to extract topics
- Run an NER (Named Entity Recognition) job using Amazon Comprehend to extract names and entities (e.g. countries, places, etc)
</br>

All those metadata will then be used alongside with metadata extracted via computer vision with Rekognition to populate our knowlegegraph in part 3.
</br>

In [1]:
#load stored variable from lab0 notebook
%store -r

In [None]:
!pip install jsonlines

## Transcribe the file's audio into text
Amazon Transcribe uses machine learning to recognize speech in audio and video files and transcribe that speech into text. Practical use cases for Amazon Transcribe include transcriptions of customer-agent calls and closed captions for videos.

https://docs.aws.amazon.com/transcribe/latest/dg/transcribe-whatis.html

In [2]:
import boto3
import os
import random
import time
import urllib
import json
import csv
import tarfile
import pandas as pd
import jsonlines

In [3]:
transcribe = boto3.client('transcribe')

transcribe_job_name = "transcribe_job_knowledge_graph" + str(random.randint(0, 100000))

transcribe_job_uri = "s3://" + os.path.join(bucket, s3_video_input_path, video_file)

transcription_job = transcribe.start_transcription_job(
    TranscriptionJobName=transcribe_job_name,
    Media={'MediaFileUri': transcribe_job_uri},
    MediaFormat='mp4',
    LanguageCode='en-US',
    OutputBucketName=bucket
)

Monitoring the job's completion

In [4]:
print(transcribe_job_name)
while True:
    status = transcribe.get_transcription_job(TranscriptionJobName=transcribe_job_name)
    if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
        break
    print(".", end='')
    time.sleep(5)
print(status['TranscriptionJob']['TranscriptionJobStatus'])

transcribe_job_knowledge_graph5051
...............COMPLETED


Download the transcript file from the s3 bucket

In [5]:
s3 = boto3.Session().resource('s3')

s3_transcript_file_url = status['TranscriptionJob']['Transcript']['TranscriptFileUri']

S3_transcript_file_name = s3_transcript_file_url.split('/')[-1]

local_transcribe_file_path = os.path.join(tmp_local_folder, S3_transcript_file_name)

s3.Bucket(bucket).Object(S3_transcript_file_name).download_file(local_transcribe_file_path)

In [6]:
transcribe_file = open(local_transcribe_file_path)
transcribe_json_data = json.load(transcribe_file)

Let's have a look at the output. below is the itemised version of the transcript, word by word for the 5 first words.

In [7]:
transcript_items = transcribe_json_data['results']['items']
transcript_items[:5]

[{'start_time': '0.04',
  'end_time': '0.3',
  'alternatives': [{'confidence': '0.9671', 'content': "who's"}],
  'type': 'pronunciation'},
 {'start_time': '0.3',
  'end_time': '0.71',
  'alternatives': [{'confidence': '0.9985', 'content': 'excited'}],
  'type': 'pronunciation'},
 {'start_time': '0.71',
  'end_time': '0.81',
  'alternatives': [{'confidence': '1.0', 'content': 'for'}],
  'type': 'pronunciation'},
 {'start_time': '0.81',
  'end_time': '1.88',
  'alternatives': [{'confidence': '0.9992', 'content': 'Jamboree'}],
  'type': 'pronunciation'},
 {'alternatives': [{'confidence': '0.0', 'content': ','}],
  'type': 'punctuation'}]

</br>
Loading the file into memory to be used later when building the graph.

## Formating the transcript to be consumed by Amazon Comprehend for the following 2 jobs.
The documentation explains that we can format the input CSV file in 2 ways. Either we provide one document per file or a file containing one document per line. We're going to pick the latter option.

We have different ways of splitting that text into "blocks" of words. One logical way of doing it could be to do it sentence by sentence.</br>
We're choosing here to segment our text transcript by chunk of 1 minute.</br>
Reason being that later we're going to attach video/audio metadata to 1 minute video segments in order to have a fine grained level information on our video. </br>

TODO: implement the cut at a full stop.

In [8]:
segment_size_ms = 60000

The following function is using the timestamp from each item to break the whole transcript into 1min chunks.

In [9]:
def prepare_transcribed_text_for_topic_modelling(transcript_items, segment_size_ms=60000):

    #initiatlising current segment with segment size
    current_segment_end = segment_size_ms
    sentence_list_per_segment = []
    buffer_sentence = []
    for item in transcript_items:
        
        #filter on pronunciation, ignoring punctuation for the moment
        type_ = item['type']
        if type_ == 'pronunciation':
            start = float(item['start_time']) * 1000
            end = float(item['end_time']) * 1000
            content = item['alternatives'][0]['content']
            
            # splitting text across the different segments
            if start <= current_segment_end :
                buffer_sentence.append(content)
            else:
                if (len(buffer_sentence) > 0):
                    #appending "\r\n" at the end of each line - requirement from comprehend
                    #buffer_sentence.append("\r\n")
                    sentence_list_per_segment.append(' '.join(buffer_sentence))
                buffer_sentence = []
                current_segment_end += segment_size_ms
                
    #flush the buffer at the end
    if (len(buffer_sentence) > 0):
        #buffer_sentence.append("\r\n")
        sentence_list_per_segment.append(' '.join(buffer_sentence))
    
    return sentence_list_per_segment

In [10]:
#getting the transcript in the right format

video_transcript = prepare_transcribed_text_for_topic_modelling(transcript_items, 60000)

We're now writing the transcript in csv format in S3 to be consumed by Comprehend

In [11]:
#writing the transcript in csv format in S3 to be consumed by Comprehend
def write_list_to_csv(local_file_path, rows, bucket, path):
    filename = local_file_path.split('/')[-1]
    #create file locally
    with open(local_file_path, 'w+') as f:
        write = csv.writer(f)
        for row in rows:
            write.writerow([row])
    #upload to S3
    boto3.resource('s3').Bucket(bucket).Object(os.path.join(path, filename)).upload_file(local_file_path)
    print(f"{filename} uploaded to s3://{bucket}/{path}/{filename}")
            
transcript_filename = 'video_transcript.csv'
s3_comprehend_input_path = 'comprehend-input'

write_list_to_csv(os.path.join(tmp_local_folder, transcript_filename), 
                  video_transcript, 
                  bucket, 
                  s3_comprehend_input_path)

video_transcript.csv uploaded to s3://sagemaker-knowledge-graph-ap-southeast-2-327216439222-53755/comprehend-input/video_transcript.csv


We're just checking the number of lines in the file we just created which should correspond to the duration of our video in minutes.

In [12]:
num_lines = sum(1 for line in open(os.path.join(tmp_local_folder, transcript_filename)))
print(f'Number of lines in our file: {num_lines}')

Number of lines in our file: 5


# Comprehend - Topic detection
We're now ready to launch the first job.

You can use Amazon Comprehend to examine the content of a collection of documents to determine common themes. For example, you can give Amazon Comprehend a collection of news articles, and it will determine the subjects, such as sports, politics, or entertainment. The text in the documents doesn't need to be annotated.

Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.

https://docs.aws.amazon.com/comprehend/latest/dg/topic-modeling.html

In [13]:
comprehend = boto3.client('comprehend')

s3_output_data_comprehend = os.path.join("s3://", bucket, 'comprehend-tm-output')
s3_input_data_comprehend = os.path.join("s3://", bucket, s3_comprehend_input_path)

response = comprehend.start_topics_detection_job(
    InputDataConfig={
        'S3Uri': s3_input_data_comprehend,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_output_data_comprehend,
    },
    DataAccessRoleArn=role_arn,
    JobName='comprehend_job_knowledge_graph_' + str(random.randint(0,100000)),
    NumberOfTopics=15
)


Monitoring the progress of the job

In [14]:
while True:
    status = comprehend.describe_topics_detection_job(JobId=response['JobId'])
    if status['TopicsDetectionJobProperties']['JobStatus']  in ['COMPLETED', 'FAILED']:
        break
    print(".", end='')
    time.sleep(10)
print(comprehend.describe_topics_detection_job(JobId=response['JobId'])['TopicsDetectionJobProperties']['JobStatus'])

..................................................COMPLETED


After Amazon Comprehend processes your document collection, it returns a compressed archive containing two files, topic-terms.csv and doc-topics.csv. 

The first output file, topic-terms.csv, is a list of topics in the collection. For each topic, the list includes, by default, the top terms by topic according to their weight. 

The second file, doc-topics.csv, lists the documents associated with a topic and the proportion of the document that is concerned with the topic. If you specified ONE_DOC_PER_FILE the document is identified by the file name. If you specified ONE_DOC_PER_LINE the document is identified by the file name and the 0-indexed line number within the file. 

### Download and extract the comprehend topic detection output

In [15]:
#function to extract a tar file
def extract(tar_file, path):
    opened_tar = tarfile.open(tar_file)
     
    if tarfile.is_tarfile(tar_file):
        opened_tar.extractall(path)
        return path
    else:
        print("The tar file you entered is not a tar file")

#download
def download_and_extract_comprehend_job_output(output_s3_uri, dl_path):
    s3_bucket = output_s3_uri.split('/')[2]
    s3_file_path = '/'.join(output_s3_uri.split('/', 3)[3:])
    local_file_path = os.path.join(dl_path, output_s3_uri.split('/')[-1])

    boto3.resource('s3').Bucket(s3_bucket).Object(s3_file_path).download_file(local_file_path)
    return extract(local_file_path, dl_path)

In [16]:
topics_output_s3_uri = comprehend.describe_topics_detection_job(JobId=response['JobId'])['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']

job_comprehend_output_folder = download_and_extract_comprehend_job_output(topics_output_s3_uri, tmp_local_folder)

Looking into the 2 output files and loading this into dataframes for later use.

In [17]:
topics_file = 'doc-topics.csv'
topic_terms_file = 'topic-terms.csv'
comprehend_topics_df = pd.read_csv(os.path.join(tmp_local_folder, topics_file))
comprehend_terms_df = pd.read_csv(os.path.join(tmp_local_folder, topic_terms_file))

Displaying the 5 first documents and their topics

In [18]:
comprehend_topics_df.head(5)

Unnamed: 0,docname,topic,proportion
0,video_transcript.csv:4,1,1.0
1,video_transcript.csv:0,2,1.0
2,video_transcript.csv:3,2,1.0
3,video_transcript.csv:1,6,0.861177
4,video_transcript.csv:1,3,0.138823


Displaying top 10 words for topic 1. This will give us an idea of what this topic is about. Remember that topic modelling is not outputing a specific label but instead an unlabeled topic or grouping of documents for which we have a list of prominent words and their weight/importance.

In [19]:
comprehend_terms_df[comprehend_terms_df['topic'] == 1]

Unnamed: 0,topic,term,weight
10,1,space,0.02934
11,1,great,0.008815
12,1,moe's,0.008815
13,1,show,0.008815
14,1,field,0.008815
15,1,medicine,0.008815
16,1,eye,0.008815
17,1,crucial,0.008815
18,1,today,0.008815
19,1,conversation,0.008815


In [20]:
comprehend_terms_df[comprehend_terms_df['topic'] == 6]

Unnamed: 0,topic,term,weight
60,6,world,0.05252
61,6,science,0.014937
62,6,earth,0.014683
63,6,woman,0.013489
64,6,story,0.012959
65,6,action,0.012411
66,6,shoot,0.012411
67,6,proud,0.012409
68,6,today,0.012408
69,6,important,0.012408


## Comprehend NER Named Entity Recognition

We're now looking at extracting Named entities from the video's transcript, still using Amazon Comprehend.

An entity is a textual reference to the unique name of a real-world object such as people, places, and commercial items, and to precise references to measures such as dates and quantities.

https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html

In [21]:
response_NER = comprehend.start_entities_detection_job(
    InputDataConfig={
        'S3Uri': s3_input_data_comprehend,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_output_data_comprehend,
    },
    LanguageCode='en',
    DataAccessRoleArn=role_arn,
    JobName='comprehend_job_knowledge_graph_NER' + str(random.randint(0,100000)),
)

In [22]:
while True:
    status_NER = comprehend.describe_entities_detection_job(JobId=response_NER['JobId'])
    if status_NER['EntitiesDetectionJobProperties']['JobStatus']  in ['COMPLETED', 'FAILED']:
        break
    print(".", end='')
    time.sleep(10)
print(comprehend.describe_entities_detection_job(JobId=response_NER['JobId'])['EntitiesDetectionJobProperties']['JobStatus'])

.....................................COMPLETED


In [23]:
ner_output_s3_uri = comprehend.describe_entities_detection_job(JobId=response_NER['JobId'])['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']
job_comprehend_output_folder = download_and_extract_comprehend_job_output(ner_output_s3_uri, tmp_local_folder)

Let's look into the output of the NER job. As you can see we've got different types of entities, PERSON, DATE, QUANTITY, LOCATION, ORGANIZATION, OTHERS.

In [24]:
ner_job_data = []
with jsonlines.open(os.path.join(tmp_local_folder, 'output')) as ner_json_reader:
    for obj in ner_json_reader:
        ner_job_data.append(obj['Entities'])

In [25]:
ner_job_data[0]

[{'BeginOffset': 18,
  'EndOffset': 26,
  'Score': 0.8998913437847392,
  'Text': 'Jamboree',
  'Type': 'EVENT'},
 {'BeginOffset': 53,
  'EndOffset': 62,
  'Score': 0.8554136721519294,
  'Text': 'this year',
  'Type': 'DATE'},
 {'BeginOffset': 110,
  'EndOffset': 112,
  'Score': 0.8910259121495451,
  'Text': 'dr',
  'Type': 'PERSON'},
 {'BeginOffset': 113,
  'EndOffset': 119,
  'Score': 0.832157103610494,
  'Text': 'Prasad',
  'Type': 'PERSON'},
 {'BeginOffset': 137,
  'EndOffset': 141,
  'Score': 0.7387116718659316,
  'Text': 'Nasa',
  'Type': 'ORGANIZATION'},
 {'BeginOffset': 239,
  'EndOffset': 247,
  'Score': 0.7160083141368465,
  'Text': 'Jamboree',
  'Type': 'EVENT'},
 {'BeginOffset': 314,
  'EndOffset': 327,
  'Score': 0.8759530140572821,
  'Text': 'Mhm Education',
  'Type': 'ORGANIZATION'},
 {'BeginOffset': 456,
  'EndOffset': 459,
  'Score': 0.9801135950882811,
  'Text': 'one',
  'Type': 'QUANTITY'},
 {'BeginOffset': 467,
  'EndOffset': 480,
  'Score': 0.6408113513652112,
  'Te

In [26]:
%store segment_size_ms
%store comprehend_terms_df
%store comprehend_topics_df
%store ner_job_data

Stored 'segment_size_ms' (int)
Stored 'comprehend_terms_df' (DataFrame)
Stored 'comprehend_topics_df' (DataFrame)
Stored 'ner_job_data' (list)
