# Getting insight from customer reviews using Amazon Comprehend

## Comprehend Model Training Notebook
In the previous Notebook we performed data cleaning, exploration, and analysis. Now in this Notebook we will run a Topic Modeling job in Amazon Comprehend to get - 


1. List of words associated with each topic with high probability
2. Assignment of each document to topics



## Initialize
<a id="InitialiazeS3Data"></a>

In [None]:
# Library imports
import pandas as pd
import boto3
import json, time, tarfile

In [None]:
# Client and session information
session = boto3.Session()
s3 = boto3.resource('s3')

# Account id. Required downstream.
account_id = boto3.client('sts').get_caller_identity().get('Account')

# Initializing Comprehend client
comprehend = boto3.client(service_name='comprehend', 
                          region_name=session.region_name)

## Variables

In [None]:
# Number of topics set to 5 after having a human-in-the-loop
# This needs to be fully aligned with topicMaps dictionary in the third script 
NUMBER_OF_TOPICS = 5

In [None]:
# Input file format of one review per line
input_doc_format = "ONE_DOC_PER_LINE"

# Role arn (Hard coded- Hide)
data_access_role_arn = "arn:aws:iam::682523027102:role/service-role/AmazonSageMaker-ExecutionRole-20220525T154953"

## Input and Output

In [None]:
# Constants for S3 bucket and input data file
BUCKET = 'clothing-shoe-jewel-tm-blog'
input_s3_url = f's3://{BUCKET}/out/TransformedReviews.txt'
output_s3_url = f's3://{BUCKET}/out/output/'

In [None]:
INPUT_CONFIG={
    # The S3 URI where training output is placed.
    'S3Uri':    input_s3_url,
    # Document format
    'InputFormat': input_doc_format,
}
OUTPUT_CONFIG={
    # The S3 URI where training output is placed.
    'S3Uri':    output_s3_url,
}

## Data Check

In [None]:
# Reading the Comprehend input file just to double check if number of reviews 
# and the number of lines in the input file have an exact match.
obj = s3.Object(BUCKET, 'out/TransformedReviews.txt')
comprehend_input = obj.get()['Body'].read().decode('utf-8')
comprehend_input_lines = len(comprehend_input.split('\n'))

# Reviews where Comprehend outputs will be merged
df = pd.read_csv(f's3://{BUCKET}/out/FinalDataframe.csv')
review_df_length = df.shape[0]

# The two lengths must be equal
assert comprehend_input_lines == review_df_length

## Model Training

In [None]:
# Starts an asynchronous topic detection job
def train_topics_detection(NumberOfTopics, InputConfig, OutputConfig, DataRoleArn):
    # You can specify number of topics and Input and output config and IAM Role ARN 
    # that grants Amazon Comprehend read access to your input data. . 
    # Created The Amazon Resource Name (ARN), Job ID and Status of the topics detection job. 

    # Training takes a while to complete. 
    # You can track the current status by calling Use the DescribeTopicDetectionJob operation.
    response = comprehend.start_topics_detection_job(NumberOfTopics=NumberOfTopics,
                                                    InputDataConfig=InputConfig,
                                                    OutputDataConfig=OutputConfig,
                                                    DataAccessRoleArn=data_access_role_arn)

In [None]:
# Comprehend 
start_topics_detection_job_result = train_topics_detection( NUMBER_OF_TOPICS, 
                                                            INPUT_CONFIG, 
                                                            OUTPUT_CONFIG, 
                                                            data_access_role_arn)


In [None]:
print('start_topics_detection_job_result: ' + json.dumps(start_topics_detection_job_result))

# Job ID is required downstream for extracting the Comprehend results
job_id = start_topics_detection_job_result["JobId"]
print('job_id: ', job_id)

## Check Training Status

In [None]:
# Keeping track if Comprehend has finished its job
description = comprehend.describe_topics_detection_job(JobId=job_id)

TrainingJobStatus = description['TopicsDetectionJobProperties']["JobStatus"]
print(TrainingJobStatus)
while TrainingJobStatus != "COMPLETED" and TrainingJobStatus != "FAILED":
    time.sleep(120)
    TrainingJobStatus = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']["JobStatus"]
    print(TrainingJobStatus)

TrainingJobStatus = comprehend.describe_topics_detection_job(JobId=job_id)['TopicsDetectionJobProperties']["JobStatus"]
print(TrainingJobStatus)

## Save Output

In [None]:
# Bucket prefix where model artifacts are stored
prefix = f'out/output/{account_id}-TOPICS-{job_id}'

# Location on S3 where model artifacts are stored
target = f's3://{BUCKET}/{prefix}'

In [None]:
# List S3 files/folders where Comprehend saved its results as tar.gz
! aws s3 ls {target} --recursive

In [None]:
# Find the output file from artifacts
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(BUCKET)
comprehend_out_file = ''
# Loop through artifacts. The output files are zipped files.
for my_bucket_object in my_bucket.objects.filter(Prefix=prefix):
    if my_bucket_object.key.endswith('tar.gz'):
        comprehend_out_file = 's3://' + BUCKET + '/' + my_bucket_object.key 

In [None]:
# Copy Comprehend output from S3 to local notebook instance
! aws s3 cp {comprehend_out_file} ./comprehend-out/

In [None]:
# Unzip the Comprehend output file. 
# Two files are now saved locally- 
#       (1) comprehend-out/doc-topics.csv and 
#       (2) comprehend-out/topic-terms.csv

comprehend_tars = tarfile.open('comprehend-out/output.tar.gz')
comprehend_tars.extractall('./comprehend-out/')
comprehend_tars.close()