# Custom negative custom classifier.

This notebook covers how to prepare a training dataset for a negative custom classifier in Amazon Comprehend leveraging the custom keywords that were generated from our word2vec model. 

We will build a custom negative classifier based on keywords semantically similar to the word "frustrated"



In [36]:
# library imports
import re
import numpy as np
import pandas as pd
import matplotlib
import csv
import boto3
import json
import time
import os

from sagemaker import get_execution_role
from sagemaker.session import Session

comprehend = boto3.client('comprehend')

# Specify S3 bucket and prefix that you want to use for model data
# Feel free to specify a different bucket here if you wish.
bucket = 'data-phi'
prefix = 'comprehend-custom-entity'

role = get_execution_role()
print(role)

arn:aws:iam::202860692096:role/service-role/AmazonSageMaker-ExecutionRole-20180529T141286


In this example we will re-use the dataset that we wrangled and filtered for the telco domain. 

In [12]:
colnames=['text'] 
tweets = pd.read_csv('./data/tweet_telco.csv',encoding='utf-8',names=colnames, header=None)
print(tweets.shape)
tweets.head()

(32716, 1)


Unnamed: 0,text
0,@sprintcare is the worst customer service | @1...
1,@sprintcare is the worst customer service | @1...
2,@sprintcare is the worst customer service | @1...
3,@115714 y’all lie about your “great” connectio...
4,"@115714 whenever I contact customer support, t..."


<a id='data-wrangling'></a>

In order to create our dataset we need to provide an entity list for our new class named NEGATIVITY.

In order to find relevant entities, we will be using our custom word2vec model to find semantically similar words to "frustrated". See the blazingtext_word2vec_telco_tweets.ipynb notebook for generating keywords.

In [13]:
tweets['match']=tweets['text'].str.contains(r'(Really|cheated|annoyed|unhelpful|frustrated|upset|unhappy|angry|badly|bad|dissatisfied|disappointed|disgusted)', regex=True)


  if __name__ == '__main__':



Let's add another column with our class label. This is required part of the Amazon Comprehend training dataset.

More information can be found here.

https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html


In [14]:
tweets.loc[tweets['match'] == True, 'label'] = 'negative'

Let's create our training and test file.

In [54]:
training_file = './data/negative_classifier_train.csv'
tweets.loc[tweets['label'] == 'negative', ['label', 'text']].to_csv(training_file, encoding='utf-8', index=False)

test_file = './data/telco_negative_test.csv'
tweets['text'].tail(10000).to_csv(test_file, encoding='utf-8', index=False)



In [55]:
def upload_to_s3(channel, file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = channel + '/' + file
    s3.Bucket(bucket).put_object(Key=key, Body=data)

s3_train_key = prefix + "/train/negative_classifier_train.csv" 
s3_test_key = prefix + "/test/telco_negative_test.csv"

upload_to_s3(s3_train_key, training_file)
upload_to_s3(s3_test_key, test_file)

In [57]:
s3_train_data = 's3://{}/{}/{}'.format(bucket, prefix, s3_train_key)
s3_test_job = 's3://{}/{}/{}'.format(bucket, prefix, s3_test_key)
s3_output_job = 's3://{}/{}/{}'.format(bucket, prefix, '/output/train_job')
print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://data-phi/comprehend-custom-entity/comprehend-custom-entity/train/negative_classifier_train.csv


## Training our model

I am going to use the console to submit our custom entity recognizer job. Look at the first notebook for details.




In [28]:
training_job = comprehend.create_document_classifier(
    DocumentClassifierName='Custom-Negative-Classifier',
    DataAccessRoleArn=role,
    InputDataConfig={
        'S3Uri': s3_train_data
    },
    OutputDataConfig={
        'S3Uri': s3_output_job
    },
    LanguageCode='en'
)

In [31]:
print(json.dumps(training_job, indent=2, default=str))


{
  "DocumentClassifierArn": "arn:aws:comprehend:us-east-1:202860692096:document-classifier/Custom-Negative-Classifier",
  "ResponseMetadata": {
    "RequestId": "9b4eb346-8806-4256-b7a9-28ec0ec23fda",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "x-amzn-requestid": "9b4eb346-8806-4256-b7a9-28ec0ec23fda",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "116",
      "date": "Sat, 09 Nov 2019 16:37:04 GMT"
    },
    "RetryAttempts": 0
  }
}


In [32]:
jobArn = training_job['DocumentClassifierArn']

In [52]:
print(json.dumps(comprehend.describe_document_classifier(
        DocumentClassifierArn = jobArn
    ), indent=2, default=str))

{
  "DocumentClassifierProperties": {
    "DocumentClassifierArn": "arn:aws:comprehend:us-east-1:202860692096:document-classifier/Custom-Negative-Classifier",
    "LanguageCode": "en",
    "Status": "TRAINED",
    "SubmitTime": "2019-11-09 16:37:04.860000+00:00",
    "EndTime": "2019-11-09 16:49:47.654000+00:00",
    "TrainingStartTime": "2019-11-09 16:40:10.652000+00:00",
    "TrainingEndTime": "2019-11-09 16:48:23.110000+00:00",
    "InputDataConfig": {
      "S3Uri": "s3://data-phi/comprehend-custom-entity/train/negative_classifier_training.csv"
    },
    "OutputDataConfig": {
      "S3Uri": "s3://data-phi/comprehend-custom-entity/test/custom_classifier_output/202860692096-CLR-ac06c53a0c7b058278df0184069025bf/output/output.tar.gz"
    },
    "ClassifierMetadata": {
      "NumberOfLabels": 1,
      "NumberOfTrainedDocuments": 1311,
      "NumberOfTestDocuments": 145,
      "EvaluationMetrics": {
        "Accuracy": 1.0,
        "Precision": 1.0,
        "Recall": 1.0,
        "F1Sco

In [37]:
jobArn = training_job['DocumentClassifierArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_custom_classifier = comprehend.describe_document_classifier(
        DocumentClassifierArn = jobArn
    )
    status = describe_custom_classifier["DocumentClassifierProperties"]["Status"]
    print("Custom entity recognizer: {}".format(status))
    
    if status == "TRAINED" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINING
Custom entity recognizer: TRAINED


## Confusion matrix

In [None]:
job_output = describe_custom_classifier["DocumentClassifierProperties"]["OutputDataConfig"]["S3Uri"]
print(job_output)

In [None]:
path_prefix = 's3://{}/'.format(bucket)
job_key = os.path.relpath(job_output, path_prefix)
print(job_key)

In [None]:
#Download the test output to local machine
s3 = boto3.resource('s3')
s3.Bucket(bucket).download_file(job_key, './classifier/output.tar.gz')

In [44]:
!tar xvzf ./classifier/output.tar.gz

output/
output/confusion_matrix.json


In [51]:
import json

with open('./classifier/output/confusion_matrix.json') as json_file:
    data = json.load(json_file)
print(data)

{'confusion_matrix': [[145]], 'labels': ['negative'], 'type': 'multi_class', 'all_labels': ['negative']}


## Testing our custom entity model

Let's invoke the Comprehend API to run our test job from the test file we prepared earlier.