# How to prepare a dataset and submit a custom entity recognizer for Amazon Comprehend

This notebook walks through how to prepare a training dataset for custom entities in Amazon Comprehend

More information on how to create a custom entity recognizer model can be found here.

https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html





In [1]:
# library imports
import re
import numpy as np
import pandas as pd
import matplotlib
import csv


In this example we will be using the following twitter dataset. https://www.kaggle.com/thoughtvector/customer-support-on-twitter
Download the dataset and save it in the ./data folder.


In [4]:
tweets = pd.read_csv('./data/twcs.csv',encoding='utf-8')
print(tweets.shape)
tweets.head()

(2811774, 7)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


<a id='data-wrangling'></a>

## Data Wrangling

This is a very interesting tweet data set, about 3 million tweets, and we have information on the author of the tweets and whether the tweet was a query or a response (the "inbound" column). If the tweet was a query, the response_tweet_id gives the response made by the support team.

It would be interesting to modify this dataframe to get query - response pairs in every row.
The following code, to do just what we want, was pulled from [this kernel](https://www.kaggle.com/soaxelbrooke/first-inbound-and-response-tweets)

In [5]:
first_inbound = tweets[pd.isnull(tweets.in_response_to_tweet_id) & tweets.inbound]

QnR = pd.merge(first_inbound, tweets, left_on='tweet_id', 
                                  right_on='in_response_to_tweet_id')

# Filter to only outbound replies (from companies)
QnR = QnR[QnR.inbound_y ^ True]
print(f'Data shape: {QnR.shape}')
QnR.head()

Data shape: (794299, 14)


Unnamed: 0,tweet_id_x,author_id_x,inbound_x,created_at_x,text_x,response_tweet_id_x,in_response_to_tweet_id_x,tweet_id_y,author_id_y,inbound_y,created_at_y,text_y,response_tweet_id_y,in_response_to_tweet_id_y
0,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,6,sprintcare,False,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...,57.0,8.0
1,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,9,sprintcare,False,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...,,8.0
2,8,115712,True,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,9610,,10,sprintcare,False,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...,,8.0
3,18,115713,True,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,17,,17,sprintcare,False,Tue Oct 31 19:59:13 +0000 2017,@115713 H there! We'd definitely like to work ...,16.0,18.0
4,20,115715,True,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",19,,19,sprintcare,False,Tue Oct 31 22:10:10 +0000 2017,@115715 Please send me a private message so th...,,20.0


In [6]:
#making sure the dataframe contains only the needed columns
QnR = QnR[["author_id_x","created_at_x","text_x","author_id_y","created_at_y","text_y"]]
QnR.head(5)

Unnamed: 0,author_id_x,created_at_x,text_x,author_id_y,created_at_y,text_y
0,115712,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,sprintcare,Tue Oct 31 21:46:24 +0000 2017,@115712 Can you please send us a private messa...
1,115712,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,sprintcare,Tue Oct 31 21:46:14 +0000 2017,@115712 I would love the chance to review the ...
2,115712,Tue Oct 31 21:45:10 +0000 2017,@sprintcare is the worst customer service,sprintcare,Tue Oct 31 21:45:59 +0000 2017,@115712 Hello! We never like our customers to ...
3,115713,Tue Oct 31 19:56:01 +0000 2017,@115714 y’all lie about your “great” connectio...,sprintcare,Tue Oct 31 19:59:13 +0000 2017,@115713 H there! We'd definitely like to work ...
4,115715,Tue Oct 31 22:03:34 +0000 2017,"@115714 whenever I contact customer support, t...",sprintcare,Tue Oct 31 22:10:10 +0000 2017,@115715 Please send me a private message so th...


## Filter to only telco tweets
In our example, we want to create a custom entity to recognize smartphones devices. Let's filer our dataframe to only incclude the T-Mobile and Sprint tweets.

In [7]:
tweet_telco = QnR[QnR["author_id_y"].isin(["TMobileHelp", "sprintcare"])]

Let's concatenate the question and response into one column.

In [None]:
tweet_telco['text'] = tweet_telco['text_x']+ ' | ' + tweet_telco['text_y']

Let's save our telco tweets as a csv file.

In [8]:

tweet_telco['text'].to_csv('./data/tweet_telco.csv', encoding='utf-8', index=False)


In order to create our dataset we need to provide an entity list for our new class named DEVICE.

In order to find relevant entities, you can load a corpus into a word2vec model and generate a list of keywords that are similar. This technique will be used in the second example.

For our purpose of finding devices, we will generate a list of different spellings of smartphones.

In [12]:
sphones = ['iPhone X', 'iPhoneX', 'iphoneX', 'Samsung Galaxy', 'Samsung Note', 'iphone', 'iPhone', 'android', 'Android']

df_entity_list = pd.DataFrame(sphones, columns=['Text'])


Let's add another column with our class label. This is required part of the Amazon Comprehend training dataset.

More information can be found here.

https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html


In [14]:
df_entity_list['Type'] = 'DEVICE'


In [15]:
df_entity_list.head()

Unnamed: 0,Text,Type
0,iPhone X,DEVICE
1,iPhoneX,DEVICE
2,iphoneX,DEVICE
3,Samsung Galaxy,DEVICE
4,Samsung Note,DEVICE


Let's create our training file. 

In [19]:
tweet_telco['text'].to_csv('./data/raw_txt.csv', encoding='utf-8', index=False)


In [20]:
!head ./data/raw_txt.csv

"@sprintcare is the worst customer service | @115712 Can you please send us a private message, so that I can gain further details about your account?"
@sprintcare is the worst customer service | @115712 I would love the chance to review the account and provide assistance.
@sprintcare is the worst customer service | @115712 Hello! We never like our customers to feel like they are not valued.
"@115714 y’all lie about your “great” connection. 5 bars LTE, still won’t load something. Smh. | @115713 H there! We'd definitely like to work with you on this, how long have you been experiencing this issue? -AA"
"@115714 whenever I contact customer support, they tell me I have shortcode enabled on my account, but I have never in the 4 years I've tried https://t.co/0G98RtNxPK | @115715 Please send me a private message so that I can send you the link to access your account. -FR"
"@115913 @115911 just called in to switch from AT&amp;T. They wanted $75 to switch 3 phones! I said no way! Inconsist

Let's create the entity list file

In [21]:
df_entity_list.to_csv('./data/entity_list.csv', encoding='utf-8', index=False)


In [22]:
!head ./data/entity_list.csv

Text,Type
iPhone X,DEVICE
iPhoneX,DEVICE
iphoneX,DEVICE
Samsung Galaxy,DEVICE
Samsung Note,DEVICE
iphone,DEVICE
iPhone,DEVICE
android,DEVICE
Android,DEVICE


Let's create a test file from our original telco tweet dataset.

In [23]:
tweet_telco['text'].tail(10000).to_csv('./data/telco_device_test.csv', encoding='utf-8', index=False)

## Training our model

I am going to use the console to submit custom entity recognizer job.

My custom entity configuration looks like this.

![title](./img/config.png)

## Testing our custom entity model

Let's invoke the Comprehend API to run our test job from the test file we prepared earlier.

In [None]:
aws comprehend start-entities-detection-job \
     --entity-recognizer-arn "arn:aws:comprehend:us-east-1:202860692096:entity-recognizer/Twitter-Device-copy" \
     --job-name Test \
     --data-access-role-arn "arn:aws:iam::202860692096:role/service-role/AmazonComprehendServiceRole-AmazonComprehendServiceRole" \
     --language-code en \
     --input-data-config "S3Uri=s3://data-phi/telco_device_test.csv" \
     --output-data-config "S3Uri=s3://data-phi/telco_device_test.json" \
     --region "us-east-1"

The output will be a json file specified in my --output-data-config.
I am going to use Glue and Athena to inspect our results.

Here are the results for the following query.

"SELECT col3, count(col3) FROM "comprehend"."telco_device_test_json" group by col3;"

![title](./img/test.png)


Note that "ipone" was not part of the list of term we used to tag our dataset but our comprehend was able to pick with a certain number of confidence.