# AWS Comprehend Custom Classification

Quickly building a custom text classifier able to assign a specific label to a given text. This action is related to the NLP (Natural Language Processing) field. As we want to be fast, we are taking a shortcut. While we could have made a custom model in Sagemaker or using a best in class NLP classification models from the research community, we are going to leverage a high-level Machine Learning service from AWS: [Comprehend Custom Classification](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html). It allows to customize a NLP model able to label a given text.

### Steps
The custom classifier workload is built in two steps:

1. Training the custom model – no particular machine learning or deep learning knowledge is necessary
2. Classifying new data

Steps to follow are relatively simple:

1. Create a bucket that will host training data
2. Create a bucket that will host training data artifacts and production results. That can be the same
3. Configure an IAM role allowing Comprehend to access newly created buckets
4. Prepare data for training
5. Upload training data in the S3 bucket
6. Launch a “Train Classifier” job from the console: “Amazon Comprehend” > “Custom Classification” > “Train Classifier”
7. Prepare data for classification (one text per line, no header, same format as training data). Some more details here
8. Launch a custom classification job
9. Gather results: a file name output.tar.gz is generated in the destination bucket. File format is JSON Line.

## Demo time!

In [None]:
import pandas as pd, boto3, sagemaker
from time import sleep
from sklearn.metrics import accuracy_score

### Step 1 - Prepare the dataset

In [None]:
%%bash
aws s3 cp s3://fast-ai-nlp/yahoo_answers_csv.tgz ./yahoo_answers_csv.tgz
gunzip yahoo_answers_csv.tgz
tar xf yahoo_answers_csv.tar
rm yahoo_answers_csv.tar

In [None]:
df_train = pd.read_csv('yahoo_answers_csv/train.csv', error_bad_lines=False, header=None)
df_test = pd.read_csv('yahoo_answers_csv/test.csv', error_bad_lines=False, header=None)

df_train.head()

In [None]:
df_train = df_train.sample(n=1000000, random_state=42)

In [None]:
df_train['text'] = df_train[1]+' '+df_train[2]+' '+df_train[3]
df_train.drop([1,2,3], axis=1, inplace=True)
df_train.dropna(inplace=True)
df_train.head()

In [None]:
df_test['text'] = df_test[1]+' '+df_test[2]+' '+df_test[3]
df_test.drop([1,2,3], axis=1, inplace=True)
df_test.dropna(inplace=True)
df_test.head()

In [None]:
df_train['text'] = df_train['text'].str.replace(',', ' ')
df_train['text'] = df_train['text'].str.replace('?', '')
df_test['text'] = df_test['text'].str.replace(',', ' ')
df_test['text'] = df_test['text'].str.replace('?', '')

In [None]:
df_train.to_csv('custom-train.csv', header=False, index=False, escapechar='\\', doublequote=False, quotechar='"')
df_test.to_csv('custom-test.csv', header=False, index=False, escapechar='\\', doublequote=False, quotechar='"')

### Step 2 - Upload the dataset to S3

In [None]:
session = sagemaker.Session()
bucket = session.default_bucket()

In [None]:
train_path = session.upload_data('custom-train.csv', bucket, 'custom-classifier')
test_path = session.upload_data('custom-test.csv', bucket, 'custom-classifier')

### Step 3 - Create the roles

In [None]:
!aws iam create-role --role-name ComprehendBucketAccessRole --assume-role-policy-document file://ComprehendBucketAccessRole-TrustPolicy.json
    

In [None]:
!aws iam put-role-policy --role-name ComprehendBucketAccessRole --policy-name BucketAccessPolicy --policy-document file://ComprehendBucketAccessRole-Permissions.json
    

In [None]:
iam = boto3.client('iam')
role = iam.get_role(RoleName='ComprehendBucketAccessRole')
role['Role']['Arn']

In [None]:
iam.attach_role_policy(
    RoleName=sagemaker.get_execution_role().split('/')[-1],
    PolicyArn='arn:aws:iam::aws:policy/ComprehendFullAccess'
)
sleep(30) # wait 30 seconds to make sure IAM policies are applied

### Step 4 - Train the classifier

In [None]:
comprehend_client = boto3.client('comprehend')
response = comprehend_client.create_document_classifier(
    DocumentClassifierName='yahoo-answers-custom-clf-demo',
    DataAccessRoleArn=role['Role']['Arn'],
    InputDataConfig={'S3Uri': train_path},
    LanguageCode='en'
)

In [None]:
describe_clf = comprehend_client.describe_document_classifier(DocumentClassifierArn=response['DocumentClassifierArn'])

In [None]:
status = describe_clf['DocumentClassifierProperties']['Status']

while (status == 'SUBMITTED') or (status == 'TRAINING'):
    sleep(15)
    describe_clf = comprehend_client.describe_document_classifier(DocumentClassifierArn=response['DocumentClassifierArn'])
    status = describe_clf['DocumentClassifierProperties']['Status']
    print(status)

if status == 'IN_ERROR':
    print(describe_clf['DocumentClassifierProperties']['Message'])

In [None]:
comprehend_client.describe_document_classifier(DocumentClassifierArn=response['DocumentClassifierArn'])

## Step 5 - Use the classifier for inference

In [None]:
clf_job = comprehend_client.start_document_classification_job(
    JobName='yahoo-answers-custom-clf-inference-demo',
    DocumentClassifierArn=response['DocumentClassifierArn'],
    InputDataConfig={
        'S3Uri': test_path,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={'S3Uri': 's3://'+bucket+'/custom-classifier/output/'},
    DataAccessRoleArn=role['Role']['Arn']
)

In [None]:
describe_clf_job = comprehend_client.describe_document_classification_job(JobId=clf_job['JobId'])

In [None]:
status_job = describe_clf_job['DocumentClassificationJobProperties']['JobStatus']

while (status_job == 'SUBMITTED') or (status_job == 'IN_PROGRESS'):
    sleep(15)
    describe_clf_job = comprehend_client.describe_document_classification_job(JobId=clf_job['JobId'])
    status_job = describe_clf_job['DocumentClassificationJobProperties']['JobStatus']
    print(status_job)

if status_job == 'FAILED':
    print(describe_clf_job['DocumentClassificationJobProperties']['Message'])

In [None]:
comprehend_client.describe_document_classification_job(JobId=clf_job['JobId'])

## Step 6 - Check the results

In [None]:
client_s3 = boto3.client('s3')
client_s3.download_file(bucket, 
                        describe_clf_job['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri'].split(bucket)[-1][1:],
                        'output.tar.gz')


In [None]:
%%bash
gunzip output.tar.gz
tar xf output.tar
rm output.tar

In [None]:
predictions = pd.read_json('predictions.jsonl', lines=True)

y_pred = []
y_true = []

for i in range(0,len(df_test)):
    y_pred += [int(predictions['Classes'][i][0]['Name'])]
    y_true += [int(df_test[0].values[i])]
    
accuracy_score(y_true, y_pred)

## Cleanup

In [None]:
comprehend_client.delete_document_classifier(DocumentClassifierArn=response['DocumentClassifierArn'])

In [None]:
%%bash
rm -rf yahoo_answers_csv
rm custom-train.csv
rm custom-test.csv
rm predictions.jsonl

In [None]:
iam.detach_role_policy(
    RoleName=sagemaker.get_execution_role().split('/')[-1],
    PolicyArn='arn:aws:iam::aws:policy/ComprehendFullAccess'
)
sleep(5) # wait 5 seconds to make sure IAM policies are applied

In [None]:
%%bash
aws iam delete-role-policy --role-name ComprehendBucketAccessRole --policy-name BucketAccessPolicy
aws iam delete-role --role-name ComprehendBucketAccessRole

In [None]:
rm custom-train.csv

In [None]:
rm custom-test.csv