# Entity Recognition Analysis

In [1]:
import pandas as pd
import feather
import spacy

from collections import defaultdict, Counter

nlp = spacy.load('en')

In [2]:
# load in dataframe with feather
df = feather.read_dataframe('./data/executed_offenders_last_statements.dat')

# just to get a sense of what the data looks like
df.head()

Unnamed: 0,age,county,date,execution_no,first_name,last_name,offender_information_link,race,tdcj_no,last_statement
0,46,Bexar,2017-07-27,543,Taichin,Preyor,http://www.tdcj.state.tx.us/death_row/dr_info/...,Black,999494,"First and foremost I'd like to say, ""Justice h..."
1,61,Tarrant,2017-03-14,542,James,Bigby,http://www.tdcj.state.tx.us/death_row/dr_info/...,White,997,"Yes, I do, Grace Kehler is that you? I have gi..."
2,44,Bexar,2017-03-07,541,Rolando,Ruiz,http://www.tdcj.state.tx.us/death_row/dr_info/...,Hispanic,999145,"“Yes sir, I would first like to say to the San..."
3,43,Dallas,2017-01-26,540,Terry,Edwards,http://www.tdcj.state.tx.us/death_row/dr_info/...,Black,999463,"Yes, I made peace with God. I hope y'all make ..."
4,48,Tarrant,2017-01-11,539,Christopher,Wilkins,http://www.tdcj.state.tx.us/death_row/dr_info/...,White,999533,


Since this notebook is mainly about NLP, let us first examine the  `last_statements`:

In [3]:
df.last_statement

0      First and foremost I'd like to say, "Justice h...
1      Yes, I do, Grace Kehler is that you? I have gi...
2      “Yes sir, I would first like to say to the San...
3      Yes, I made peace with God. I hope y'all make ...
4                                                   None
5      I don't have anything to say, you can proceed ...
6      I just want to tell my family thank you, my mo...
7      I would like to thank everyone that has showed...
8      Yeah, first off, I want to say that I am sorry...
9      To my family, to my mom, I love you. God bless...
10                                                 None.
11     Sending me to a better place.  I am alright wi...
12     Yes, I would like to thank all of my supporter...
13     Shelby, God bless your heart.  To my family, I...
14     (Spanish) To the Solano family, I want to tell...
15     “I would like to thank you.  I hope this execu...
16     “Yes, I would like to thank my family and frie...
17     "Much has been written a

From a quick inspection of the text we find that there are some offenders who did not have a last statement, so we will fill those with empty strings.

I can identify two scenarios from this inspection:
- `None` (the offender did not prepare a last statement)
- `This offender declined to make a last statement.` (this is pretty self explanatory)

Since the absence of last statements typically result in very short text (I'm guessing <= 50 characters), we'll filter out the long text which we can assume are definitely last statements.

In [4]:
# filter out text longer than 50 characters
short_last_statements = [
    ls for ls in df.last_statement
    if len(ls) <= 50
]

# unique short last statements
set(short_last_statements)

{'(Mumbled.) Tell Mama I love her.',
 "Bye, I'm Ready.",
 'He spoke in Irish, translating to "Goodbye."',
 'High Flight (aviation poem)',
 'I am ready for the final blessing.',
 'I deserve this. Tell everyone I said goodbye.',
 'I have no last words. I am ready.',
 'I hope Mrs. Howard can find peace in this.',
 'I just love everybody, and that’s it.',
 "I love ya'll and I'm gonna miss ya'll.",
 'I love you Israel.',
 'I love you, Mom. Goodbye.',
 'I wish everybody a good life. Everything is O.K.',
 "I'll see you.",
 "I'm ready to go home.",
 'I’m ready to be released. Release me.',
 'I’m ready, Warden.',
 "Love ya'll, see you on the other side.",
 'No',
 'No last statement.',
 'No, I have no final statement.',
 'None',
 'None.',
 'Peace.',
 'Profanity directed toward staff.',
 'Santajaib Singh Ji.',
 'Thanked his family.',
 'There’s love and peace in Islam.',
 'This offender declined to make a last statement.',
 'Well, my friends in my heart, I’m ready –',
 "Yes, Ain't no way fo' fo', 

From here we can see that there are a couple scenarios that could be identified as an absence of a last statement:
- No
- No last statement.
- No, I have no final statement.
- None
- None.
- This offender declined to make a last statement.

All of these could be classified as an absence of a last statement, however, I will choose to consider *"No last statement."* and *"No, I have no final statement."* as last statements themselves, seeing that it is possible the offenders were going for a witty remark about having no last statements -- but that's still a last statement in my book!

Now onto empty strings!

In [5]:
ABSCENCE_OF_LAST_STATEMENTS = {
    'No',
    'None',
    'None.',
    'This offender declined to make a last statement.',
}


df.last_statement = df.last_statement.apply(
    lambda t: '' if t in ABSCENCE_OF_LAST_STATEMENTS else t
)

## What's mentioned the most?

One of the most common analysis in NLP involves finding the most common words in the text. However, with spacy's out-of-the-box NER (Named Entity Recognition) tool, I'm more interested in seeing what entities are being mentioned.

### Case Sensitivity in NER

There is one thing to note: it is common practice in NLP to transform all text to lowercase for case insentitivity; however, spacy's NER apparently relies on case sensitivity.

Below is a quick demonstration:

In [6]:
# list out all entites for the text 'United States'
for ent in nlp('United States').ents:
    print("{} - {}".format(ent.label_, ent.text))

GPE - United States


*NOTE: **GPE** means **Geo-Political Entity**. A full list of built-in entity types from spacy's NER can be found in  [spacy's documentation on Entity recognition](https://spacy.io/docs/usage/entity-recognition#entity-types).*

Now what happens if we tell spacy to identify entities in the text "united states"?

In [7]:
# list out all entities for the text 'united states'
for ent in nlp('united states').ents:
    print("{} - {}", ent.label_, ent.text)

There is no output because there are no entities recognized. Here's another example with "UNITED STATES" (all caps):

In [8]:
# list out all entities for the text 'UNITED STATES'
for ent in nlp('UNITED STATES').ents:
    print("{} - {}", ent.label_, ent.text)

Similarly, no entities are recognized for "UNITED STATES".

### Post Entity Recognition Transformation

Now that we know that case sensitivity matters when recognizing entities, we will have to preserve the text in its original form. Only when spacy successfully identifies an entity do we transform the text by:
- lowercasing all characters
- stripping leading and trailing space

In [9]:
# analyze entities
all_entities = defaultdict(list)

for last_statement in df.last_statement:
    parsed_ls = nlp(last_statement)
    for entity in parsed_ls.ents:
        # organize each entity identified with a label, using a dictionary
        et = entity.text
        # remove punctuations and stopwords
        if not (nlp.vocab[et].is_punct or nlp.vocab[et].is_space):
            # transform entity text here
            all_entities[entity.label_].append(et.lower().strip())
            
# print all keys of `all_entities`:
all_entities.keys()


dict_keys(['ORG', 'GPE', 'LANGUAGE', 'CARDINAL', 'LOC', 'PERSON', 'EVENT', 'TIME', 'DATE', 'MONEY', 'QUANTITY', 'LAW', 'FAC', 'NORP', 'WORK_OF_ART', 'ORDINAL', 'PRODUCT'])

Here is a sample of persons identified in all the last statements:

In [10]:
set(all_entities['PERSON'][:15])

{'coretta scott king',
 'crain',
 'grace kehler',
 'jesus christ',
 'johnson',
 'jones',
 'kehler',
 'lord',
 'lord jesus',
 'sanchez',
 'warden',
 'warden jones'}

Let us observe the most popular words for each category:

In [11]:
most_common_entities = defaultdict(Counter)
for entity_label, entity_names in all_entities.items():
    most_common_entities[entity_label] = Counter(entity_names)
    # for printing purposes
    print(entity_label)
    for common_entity in most_common_entities[entity_label].most_common(5):
        entity_name, count = common_entity
        print("  - {} ({})".format(entity_name, count))
    print()


ORG
  - lord (11)
  - christ (9)
  - the state of texas (7)
  - mama (6)
  - jesus (5)

GPE
  - allah (40)
  - texas (18)
  - warden (18)
  - america (13)
  - heaven (10)

LANGUAGE
  - english (6)
  - french (1)
  - chapter (1)
  - spanish (1)

CARDINAL
  - one (54)
  - two (16)
  - three (10)
  - 1 (3)
  - zero (2)

LOC
  - earth (5)
  - west (2)
  - christ jesus (2)
  - north (1)
  - north america (1)

PERSON
  - warden (72)
  - lord (44)
  - father (38)
  - jesus christ (17)
  - irene (15)

EVENT
  - my cup (1)

TIME
  - tonight (36)
  - night (4)
  - hour (1)
  - 3 minutes (1)
  - a few minutes (1)

DATE
  - today (47)
  - ya'll (36)
  - the years (16)
  - this day (9)
  - ya’ll (6)

MONEY
  - as much as (1)
  - 903 (1)

QUANTITY
  - 180 pounds (1)

LAW
  - article 19.83 of the texas penal code of (1)

FAC
  - warden okay (1)
  - st. thomas (1)
  - ya'll (1)

NORP
  - spanish (9)
  - christ (5)
  - american (5)
  - romans (5)
  - christian (4)

WORK_OF_ART
  - love (15)
  - bye aun

### Misclassification of Entities

We can see that certain entities are sometimes grouped into different categories. Usually, this makes sense because "French" can either be interpreted as `NORP`, `LANGUAGE`, or `PERSON` (i.e. the French).

However, note the following:
- "jesus christ" was categorized as `PERSON`
- "christ" and "jesus" as separate terms were categorized as `ORG`
- "christ jesus" was categorized as `LOC` (location).

There are more examples, such as:
- "warden okay" being categorized as `FAC` (facility)
- "ya'll" being categorized as `FAC` and `DATE`

This is seen as a shortcoming of spacy's NER (Named Entity Recognition), so it might be more useful to just see the most common entities across all text:

In [12]:
# flatten all words from defaultdict.values
all_words = [word for label in list(all_entities.values()) for word in label]
bag_of_words = Counter(all_words)

# display the most common 25 entities mentioned
bag_of_words.most_common(25)

[('warden', 90),
 ('lord', 55),
 ('one', 54),
 ('first', 53),
 ('today', 47),
 ("ya'll", 43),
 ('allah', 42),
 ('father', 38),
 ('tonight', 36),
 ('jesus', 28),
 ('jesus christ', 21),
 ('texas', 18),
 ('christ', 18),
 ('love', 17),
 ('two', 16),
 ('the years', 16),
 ('irene', 15),
 ('lord jesus', 13),
 ('america', 13),
 ('heaven', 10),
 ('three', 10),
 ('god', 10),
 ('spanish', 10),
 ('jack', 10),
 ('mama', 9)]