## WARNING
*By design, the training and test data in this notebook is hurtful, offensive, and hateful.*

*The authors are grateful for the datasets made available by others to permit this kind of exploration, but are aware that some of this content is hurtful while other content may seem benign.  Considering some of these terms as potentially unsafe may be separately hurtful as well.*

*The goal here is to offer a framework for exploration using AWS Services, not to offer opinions on the content.*


## Introduction

For moderation of user generated text content online, customers often ask about tools available to flag potentially offensive content.  While every community is ultimately responsible for its content standards, there are some general guidelines that community stewards tend to use to find the balance between protecting free speech online and guarding against speech that primarily insults, offends, or incites violence or hatred against individuals or groups of individuals.

In practice, content moderation requires a lot more than identifying potentially offensive content; we share these links to remind you that this lab is not a substitute for a considerate content moderation program, but the techniques discussed here can be part of such a program.
* [Content Moderation is Broken. Let Us Count the Ways](https://www.eff.org/deeplinks/2019/04/content-moderation-broken-let-us-count-ways), a provocative post by the Electronic Frontier Foundation
* [The people policing the internet's most horrific content](https://www.bbc.com/news/business-49393858), a BBC News story on the working conditions for content moderators

In research, profanity detection is considered distinct from hate speech and offensive language that may be directed toward individuals (or groups of individuals).  Take a look at the [SemEval-2019 tasks](http://alt.qcri.org/semeval2019/index.php?id=tasks) to get a sense of the language used to describe these efforts.

For our lab today, we are going to focus on the narrow task of identifying profanity in tweets with a hope that participants learn how to use similar techniques to address specific needs.  Specifically, we're going to build an [Amazon Comprehend Custom Entity Recognizer](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html) to help us detect profanity in a curated collection of tweets.

### Datasets
* For a list of terms considered profane (our "entity list"), we used "a list of 1,300+ English terms that could be found offensive" made available by [Luis von Ahn's Research Group](https://www.cs.cmu.edu/~biglou/resources/).
* For training and evaluation documents, we used a human-labeled tweet [dataset](https://github.com/t-davidson/hate-speech-and-offensive-language) shared by the authors of [Automated Hate Speech Detection and the Problem of Offensive Language](https://arxiv.org/pdf/1703.04009.pdf), an investigation into lexical detection methods for the separation of hate speech from other offensive language.  
As expected, this dataset provides additional metadata to help us see what's considered hate speech or generally offensive.
  * ``count``:  number of CrowdFlower users who coded each tweet (minimum of 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CrowdFlower).
  * ``hate_speech``:  number of CF users who judged the tweet to be hate speech.
  * ``offensive_language``:  number of CF users who judged the tweet to be offensive.
  * ``neither``:  number of CF users who judged the tweet to be neither offensive nor hate speech.
  * ``class``:  class label for majority of CF users. 0 - hate speech 1 - offensive language 2 - neither

### AWS IAM Resources used in this notebook
In general, this notebook does not use the AWS API to create or manipulate IAM resources; instructions are provided instead.

The notebook code itself runs under a "SageMaker execution role" and has limited access, so participants will need to grant additional rights to be able to use Comprehend Custom.  In addition, Comprehend itself does not automatically have access to the Amazon S3 bucket used for data storage, so another IAM role ("profanity-lab-role") is used to grant access.


## Setup
Let's start off by getting to know our environment.

In [None]:
import sagemaker
from sagemaker import get_execution_role

sess = sagemaker.Session()

iamRole = get_execution_role()

print('SageMaker execution role')
print(iamRole) 
print('')
print('This is the role that SageMaker would use to leverage AWS resources')
print('(S3, CloudWatch) on your behalf.')

Next, we need to create an Amazon S3 bucket where we can upload input/output files.

Comprehend jobs will only read/write to a bucket in the same AWS region where the job is run so we use ``us-east-1`` throughout this notebook.

In [2]:
import boto3
import json

aws_region='us-east-1'

Let's instantiate a client we can use to interact with S3.

In [None]:
s3 = boto3.client(service_name='s3', region_name=aws_region)

Where should Comprehend read/write files?  To the lab bucket, of course.

And where might we find the name for this bucket?  Check the outputs for the [AWS CloudFormation](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=&filteringStatus=active&viewNested=true&hideStacks=false) stack for this lab.  We need the value for the ``DataBucket`` key.


In [None]:
bucket_name=_GET_THIS_FROM_CLOUDFORMATION_STACK_OUTPUT_

And remember that IAM role ARN we said we would need to pass around so Comprehend can read our data?  

We need the value for the ``ComprehendDataAccessRoleARN`` key here.

In [None]:
comprehend_data_access_role_arn = _GET_THIS_FROM_CLOUDFORMATION_STACK_OUTPUT_


### Training data

To create a new Custom Entity Recognizer, we must provide training data; Comprehend will use most of this data to train the model and keep some of it aside to evaluate the model. 

Let's start by uploading training data to our bucket.  Specifically, we upload:
1. a file that contains a list of values that we consider profane; and,
1. a file with examples (documents) that contain these entities.

*As the [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html) explains, though Entity Lists are often easier to use, annotated document examples yield better results.  In real life, annotated examples are seldom available, so you might start with an Entity List first and use the first few classification results to get to Annotations.*


What does our entity list look like?

In [None]:
!grep -e Text -e crap profanity-list.csv | head -n 4

And what do our annotated examples look like?  Remember that each line is considered a document here.

In [None]:
!grep -i -e ^tweet -e "aspiring crazy" profanity-training-docs-1k.txt

Let's upload our files.  
We need to provide a [minimum number of examples](https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-custom-entity-recognition) for training to succeed.  More examples will help with accuracy, but we don't have time for a 40-minute training cycle, so we start with a small dataset. 

In [None]:
s3.upload_file(
    Filename='profanity-list.csv', 
    Bucket=bucket_name, 
    Key='private/labs/training/profanity-list.csv')

s3.upload_file(
    Filename='profanity-training-docs-1k.txt',
    # Filename='profanity-training-docs-2k.txt',
    Bucket=bucket_name, 
    Key='private/labs/training/profanity-training-docs.txt')

s3.list_objects(Bucket=bucket_name, Prefix='private/labs/training')['Contents']

## Training
Now that our training data is in place, let's try to create a custom entity recognizer.

In [None]:
comprehend = boto3.client(service_name='comprehend', region_name=aws_region)

In [None]:
result=comprehend.create_entity_recognizer(
    RecognizerName='profanity',
    DataAccessRoleArn=comprehend_data_access_role_arn,
    InputDataConfig={
        'EntityTypes': [ { 
            'Type': 'PROFANITY' }],
        'EntityList': { 
            'S3Uri': 's3://' + bucket_name + '/private/labs/training/profanity-list.csv' },
        'Documents': { 
            'S3Uri': 's3://' + bucket_name + '/private/labs/training/profanity-training-docs.txt' }
    },
    LanguageCode='en')

In [None]:
print(json.dumps(result, indent=2))

Is the training job complete?  (It should take about 20 minutes to run.)

In [None]:
comprehend.describe_entity_recognizer(
    EntityRecognizerArn=result['EntityRecognizerArn'])['EntityRecognizerProperties']['Status']

Let's wait for it to finish...

In [None]:
import time

status = comprehend.describe_entity_recognizer(
    EntityRecognizerArn=result['EntityRecognizerArn'])['EntityRecognizerProperties']['Status']
while (status in ['SUBMITTED', 'TRAINING']):
    status = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=result['EntityRecognizerArn'])['EntityRecognizerProperties']['Status']
    print('.', end='')
    time.sleep(60)

print('')
print(json.dumps(comprehend.describe_entity_recognizer(
    EntityRecognizerArn=result['EntityRecognizerArn'])['EntityRecognizerProperties']['RecognizerMetadata'],
     indent=2))

#### Are we done yet?

While you're waiting, check out this blog post to learn how to [Build a custom entity recognizer using Amazon Comprehend](https://aws.amazon.com/blogs/machine-learning/build-a-custom-entity-recognizer-using-amazon-comprehend/).  (Look Ma -- no notebooks!)

Or, check out some of the other [entity recognition related offerings on the AWS Marketplace](https://aws.amazon.com/marketplace/search/results?x=0&y=0&searchTerms=Entity+Recognition).

Should the training job fail (status is ``IN_ERROR``), please correct the issue, delete the recognizer, and try again.

```
comprehend.describe_entity_recognizer(
    EntityRecognizerArn=result['EntityRecognizerArn'])['EntityRecognizerProperties']['Message']
comprehend.delete_entity_recognizer(EntityRecognizerArn=result['EntityRecognizerArn'])
```

Did it work?  Do we have a reasonable recognizer here?

In [None]:
print(json.dumps(
    comprehend.describe_entity_recognizer(
        EntityRecognizerArn=result['EntityRecognizerArn'])
    ['EntityRecognizerProperties']['RecognizerMetadata']['EntityTypes'], indent=2))

As a rule of thumb, we would like an F1Score greater than 80 with precision and recall to match.  For more information on these metrics and other suggestions to improve accuracy, check out the [documentation](https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html).

## Inference

Now that we have an entity recognizer, let's try it out with a test dataset.

First, let's create a small evaluation dataset.

Here's what the hate speech dataset looks like.

In [None]:
!head -n 2 profanity-test-docs.csv

Let's clean up the quotes and newlines.

In [None]:
import csv

with open('profanity-test-docs.txt', 'w+') as outfile:
    with open('profanity-test-docs.csv', 'r') as csvfile:
        reader = csv.reader(csvfile, quotechar='"')
        for row in reader:
            if (row[-1] != 'tweet'):
                outfile.write(row[-1].replace('\n', ' ').replace('"', ''))
                outfile.write('\n')

In [None]:
! wc -l profanity-test-docs.*

# line counts may not match up because the source file may have embedded newlines

Now, let's upload our test data to S3.

In [None]:
s3.upload_file(
    Filename='profanity-test-docs.txt', 
    Bucket=bucket_name, 
    Key='private/labs/test/profanity-test-docs.txt')

s3.list_objects(Bucket=bucket_name, Prefix='private/labs/test/profanity-test-docs.txt')['Contents']

Let's start an entity recognition job with this set.

In [None]:
job_result = comprehend.start_entities_detection_job(
    JobName='profanity-2',
    EntityRecognizerArn=result['EntityRecognizerArn'],
    InputDataConfig={
        'S3Uri': 's3://' + bucket_name + '/private/labs/test/profanity-test-docs.txt',
        'InputFormat': 'ONE_DOC_PER_LINE' },
    OutputDataConfig={ 'S3Uri': 's3://' + bucket_name + '/private/labs/test/profanity-1/' },
    DataAccessRoleArn=comprehend_data_access_role_arn,
    LanguageCode='en'
)

print(json.dumps(job_result, indent=2))

Let's wait for the job to finish.  This should take 7-10 minutes.

In [None]:
status = comprehend.describe_entities_detection_job(
    JobId=job_result['JobId'])['EntitiesDetectionJobProperties']['JobStatus']

while (status == 'IN_PROGRESS'):
    status = comprehend.describe_entities_detection_job(
        JobId=job_result['JobId'])['EntitiesDetectionJobProperties']['JobStatus']
    print('.', end='')
    time.sleep(60)

print('')
print(comprehend.describe_entities_detection_job(JobId=job_result['JobId']))

#### Are we done yet??

While you're waiting, consider how you might include humans in the loop to confirm what the machine thinks.  Check out this blog post on [Verifying and adjusting your data labels to create higher quality training datasets with Amazon SageMaker Ground Truth](https://aws.amazon.com/blogs/machine-learning/verifying-and-adjusting-your-data-labels-to-create-higher-quality-training-datasets-with-amazon-sagemaker-ground-truth/).

Get the output from S3 so we can inspect it.

In [None]:
s3URI = comprehend.describe_entities_detection_job(
    JobId=job_result['JobId'])['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']

In [None]:
from urllib.parse import urlparse

p = urlparse(s3URI, allow_fragments=False)

print('Downloading {}'.format(s3URI))
s3.download_file(Bucket=p.netloc, Key=p.path.lstrip('/'), Filename='profanity_output.tar.gz')

The archive contains an output file in the [JSON Lines format](http://jsonlines.org/).

In [None]:
! tar -xvzf profanity_output.tar.gz && mv -v output profanity_output.jsonl

In [None]:
!tail -n 2 profanity_output.jsonl
!tail -n 2 profanity-test-docs.txt

Now let's compare the output from Comprehend with what we expect.  First, let's create a dataframe with the results.

In [None]:
import pandas as pd 

results_df = pd.read_json(path_or_buf='profanity_output.jsonl', lines=True)

results_df.head(2)

And let's also load up our test data...

In [None]:
test_df = pd.read_csv(
    filepath_or_buffer="profanity-test-docs.csv", 
    quotechar='"', quoting=csv.QUOTE_MINIMAL, escapechar='\\', engine='python', 
    warn_bad_lines=True)

test_df.head(1)

Let's merge our datasets so we can compare results with expectations.

In [None]:
merged_results_df = pd.merge(
    left=test_df, right=results_df, left_index=True, right_index=True, how='left')

Finally, we can filter our merged dataset in different ways to see if we detect profanity as expected.  As a reminder, the counts represent the number of users who found the content to be hateful or offensive.

This is a particularly hard dataset, so we limit our explorations, but you can uncomment the source below for yourself.  

Try these out:
* Can you find entries using a "contains" match?
* What about terms we missed?  Terms that should have been considered profane, but weren't.  (False negatives)
* Find any entries that were incorrectly classified as profane?  (False positives)

In [None]:
with pd.option_context('display.max_colwidth', -1):
    #display(merged_results_df.drop(columns=['id', 'neither', 'File', 'Line']).head(25))
    #display(merged_results_df[merged_results_df['tweet'].str.contains('stf')].drop(
    #    columns=['id', 'neither', 'File', 'Line']).head(3))
    display(merged_results_df[
        (merged_results_df['class'] == 1) &
        (merged_results_df['offensive_language'] > 2) & 
        (merged_results_df['Entities'].str.len() == 0)].drop(
        columns=['id', 'neither', 'File', 'Line']).head(1))

If you have time, it would be interesting to create a classifier with the 2k dataset to see if it improves any of your results.

## Conclusions
As you can see from the examples above, our profanity detector works well with unambiguous examples, but could use more work with additional terms, acronyms, deliberate obfuscation, and contextual resolution.

### Further explorations
* Though entity lists offer an easy way to get started with custom entity recognizers, [https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation.html](annotations) do a better job of treating terms in context (e.g., "gung ho") and often yield better results.  With this in mind, it would be interesting to review the output from this first entity recognizer to create an annotated data set for a more sophisticated recognizer.
* It might be extremely useful to build a [Custom Classifier](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) to detect hate speech.
* To go deeper, consider [Sentia Labs' AWS SageMaker post](https://www.sentialabs.io/2019/01/30/SageMaker-In-Action.html) on text classification using the BlazingText algorithm.