# Comprehend Custom Classifier using BBC dataset

In the exercise below we will do the following: We will create a custom classifier for classifying news documents into sections such as Politics, Business, Entertainment, Sports, etc. We will be using a trained data set from BBC new pubicly available. In order to achieve that, we will have to build a custom classifier in Comprehend. We can later use the custom classifier for real time analysis later in the exercise using an endpoint. One end point can be related to one model. 

This is a summary of the steps done below

   #### Train
    1. Get the sample data from BBC Bews
    2. Format and preprocess the data as per the needs of Comprehend
    3. Upload the data to S3
    4. Make sure the notebook role and the Comprehend role has the requisite permissions
    5. Start the training job
   
   #### Test/Use
    1. Build a test data set
    2. Format and preprocess the data as per the needs of Comprehend
    3. Uploadpload the data to S3
    4. Make sure the notebook role and the Comprehend role has the requisite permissions
    5. Start the prediction job
    
   #### Validate
      1. Download the results/prediction outputs
      2. Load in into a dataframe
      3. Match it with your version of truth
      4. Check accuracy/precision

### Download Labeled dataset from BBC

We download a sample dataset and unzip its contents

In [None]:
!wget http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip

In [None]:
!rm -rf bbc
!unzip -o bbc-fulltext.zip

### Process the BBC dataset into Amazon Comprehend supported format

Importing the necessary packages. Note that while using Amazon Comprehend, we dont need any ML frameworks or libraries. All of the ML workload is executed by the managed service. This notebook is a mere UI for executing the backend services.

In [None]:
import os
import pandas as pd
import boto3
import json
import datetime
from pprint import pprint
from os.path import normpath, basename
import time

Creating the dataframe object and setting the source path for the data

In [None]:
df = pd.DataFrame()
mapping = {}
source_path = "bbc/"

Iterating through the directory and finding out the trained classifiers

In [None]:
for i in sorted(os.listdir(source_path)):
    if os.path.isdir(source_path+i):
        mapping[i] = sorted(os.listdir(source_path+i))[:300]

These are the classifiers

In [None]:
mapping.keys()

There is no need to identify or preprocess stop words, lexicons, or . Amzon Comprehend will take care of the those. We sipmply replace the new lines characters, just so our data is more readable in a dataframe.

In [None]:
label = []
data = []
for i,j in mapping.items():
    for k in j:
        label.append(i)
        data.append(open(source_path+i+"/" + k,encoding="cp1252").read().replace("\n"," "))

In [None]:
print("A sample label {} : {}".format(label[:1],data[:1]))

Preparing the dataframe

In [None]:
df["label"] = label
df["document"] = data
df.head()

Shuffling the dataframe to bring in randomness

In [None]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Write the training data set to the local file system

In [None]:
df.to_csv("train.csv", index=False, header=False)

### Upload Training data to S3

Create the client to access the bucket and upload the training files to the bucket. Make sure that roles for comprehend and this notebook both have read access to the training data. The notebook needs to have write access to training and testing locations in order to upload them. The Comprehend role needs to have write access to the output locations.

Create the bucket client

In [None]:
data_classifier = 'BBCNewsDataClassifier'
BUCKET = 'comprehend-demo-20200602'
s3_file = data_classifier + '/train/train.csv'
s3 = boto3.client('s3')

Upload the training data

In [None]:
s3.upload_file('train.csv', BUCKET, s3_file)

### Create and start training job for custom classifier

The training job can be started asynchronously with a name. Status can be retrieved later using the ARN.

In [None]:
# Instantiate Boto3 Client
client = boto3.client('comprehend', region_name='us-east-1')
role = 'arn:aws:iam::951145066533:role/service-role/AmazonComprehendServiceRole-_908203_cmpd'

# Create a custom document classifier
create_response = client.create_document_classifier(
    InputDataConfig={
        'S3Uri': 's3://'+ BUCKET + '/' + s3_file
    },
    DataAccessRoleArn=role,
    DocumentClassifierName=data_classifier,
    LanguageCode='en'
)
print("Create response: %s\n", create_response)


In [None]:
create_response['DocumentClassifierArn']

In [None]:


# Check the status of the classifier
classifierArn = create_response['DocumentClassifierArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_document_classifier = client.describe_document_classifier(
        DocumentClassifierArn = classifierArn
    )
    status = describe_document_classifier["DocumentClassifierProperties"]["Status"]
    print("Custom entity recognizer: {}".format(status))
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)

if time.time() > max_time:
    client.stop_training_document_classifier(
        DocumentClassifierArn = classifierArn
    )

Listing all the classifiers in the account/region

In [None]:
# List all classifiers in account
list_response = client.list_document_classifiers()
for item in list_response['DocumentClassifierPropertiesList']:
    eval_metrics = item['ClassifierMetadata']['EvaluationMetrics']
    print('{}:\n Precision {} Recall {} Accuracy {} F1 {}'.format(item['DocumentClassifierArn'],
                                                         eval_metrics['Precision'],
                                                         eval_metrics['Recall'],
                                                         eval_metrics['Accuracy'],
                                                         eval_metrics['F1Score']
                                                        )
         )
    

### Prepare Data for Custom Classifier Job for Testing

Prepare the test dataframe by sample a random set from 301 to 350

In [None]:
# Prepare Test Data
df_test = pd.DataFrame()
df_truth = pd.DataFrame()

test_mapping = {}
for i in sorted(os.listdir(source_path)):
    if os.path.isdir(source_path+i):
        test_mapping[i] = sorted(os.listdir(source_path+i))[201:300]


Preparing 2 data frames - one for testing the labels - one for truth labels. The idea here is to test the results, but this can also be used to run actual jobs. Amazon Comprehend doesnt need any testing. The managed service is already tested and the confusion metric results like recall, precision and F1 score are already posted in the console. These re 

In [None]:
test_data = []
truth_label = []

for i,j in test_mapping.items():
    for k in j:
        truth_label.append(i)
        test_data.append(open(source_path+i+"/" + k,encoding="cp1252").read().replace("\n"," "))


The truth df has both document and label while the test df has only document.

In [None]:
df_test["document"] = test_data
df_truth["label"] = truth_label
df_truth["document"] = test_data

Introducing randomness into the dataset

In [None]:
df_test = df_test.sample(frac=1).reset_index(drop=True)
df_truth = df_truth.sample(frac=1).reset_index(drop=True)

In [None]:
df_truth.head()

In [None]:
df_test.head()

Save to CSV and upload to S3

In [None]:
#save to csv and upload to S3
df_test.to_csv("test.csv",index=False,header=False)
df_truth.to_csv("truth.csv",index=False,header=False)

s3_file_test = data_classifier + '/test/test.csv'
s3_file_truth = data_classifier + '/output/truth.csv'

s3 = boto3.client('s3')
s3.upload_file('test.csv', BUCKET, s3_file_test)
s3.upload_file('truth.csv', BUCKET, s3_file_truth)

### Run the Classification Job

In [None]:
job_name = data_classifier + "_" + (datetime.date.today().strftime("%m%d%Y"))
s3_file_results = data_classifier + 'output/results_' + job_name

start_response = client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': 's3://'+ BUCKET + '/' + s3_file_test,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': 's3://' + BUCKET + '/' +  s3_file_results
    },
    JobName=job_name,
    DataAccessRoleArn='arn:aws:iam::951145066533:role/service-role/AmazonComprehendServiceRole-_908203_cmpd',
    DocumentClassifierArn='arn:aws:comprehend:us-east-1:951145066533:document-classifier/BBCNewsDataClassifier'
)

print("Start response: %s\n", start_response)

## Validation

We will be using this function to read s3 locations

In [None]:
def split_s3_path(s3_path):
    path_parts=s3_path.replace("s3://","").split("/")
    bucket=path_parts.pop(0)
    key="/".join(path_parts)
    return bucket, key

Check the status of the job until completion

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    job_response = client.describe_document_classification_job(
        JobId=start_response['JobId']
    )
    status = job_response['DocumentClassificationJobProperties']['JobStatus']
    print("Custom entity recognizer: {}".format(status))
    
    if status == "COMPLETED" or status == "FAILED":
        break
        
    time.sleep(30)


Get the S3 Location of the output and download the file to the notebook location and unzip the contents

In [None]:
output_file_path = job_response['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri']
local_output_file_name = job_response['DocumentClassificationJobProperties']['JobName'] + '_' + basename(normpath(output_file_path))
print(local_output_file_name)

In [None]:
bucket_1, key = split_s3_path(output_file_path)
s3.download_file(bucket_1,key,local_output_file_name)

!tar zxvf BBCNewsDataClassifier_06032020_output.tar.gz

Create a DF from the test results csv file

In [None]:
predictedLabels = []
with open('predictions.jsonl','r') as f:
    for i in f:
        j = json.loads(i)['Classes']
        predictedLabels.append(j[0]['Name'])

In [None]:
df_test_results = pd.read_csv("test.csv", header=None)

In [None]:
df_test_results.columns = ["document"]
df_test_results["PredictedLabel"] = predictedLabels

In [None]:
df_test_results.head()

Join the prediction results with the truth results to match accuracy

In [None]:
mergeDf = pd.merge(df_test_results,df_truth,on=["document"], how="inner")

Display the difference of unmatched labels

In [None]:
count_match = mergeDf[mergeDf.PredictedLabel == mergeDf.label].count()
count_mismatch = mergeDf[mergeDf.PredictedLabel != mergeDf.label].count()

print("{} documents were predicted with the right results and \
      \n{} documents were predicted with the wrong results".
      format(count_match['document'],count_mismatch['document']))

## End of demo