In [1]:
import pandas as pd
import feather
import spacy

from collections import defaultdict, Counter

nlp = spacy.load('en')

In [2]:
# load in dataframe with feather
df = feather.read_dataframe('./data/executed_offenders_last_statements.dat')

# just to get a sense of what the data looks like
df.head()

Unnamed: 0,age,county,date,execution_no,first_name,last_name,offender_information_link,race,tdcj_no,last_statement
0,46,Bexar,2017-07-27,543,Taichin,Preyor,http://www.tdcj.state.tx.us/death_row/dr_info/...,Black,999494,"First and foremost I'd like to say, ""Justice h..."
1,61,Tarrant,2017-03-14,542,James,Bigby,http://www.tdcj.state.tx.us/death_row/dr_info/...,White,997,"Yes, I do, Grace Kehler is that you? I have gi..."
2,44,Bexar,2017-03-07,541,Rolando,Ruiz,http://www.tdcj.state.tx.us/death_row/dr_info/...,Hispanic,999145,"“Yes sir, I would first like to say to the San..."
3,43,Dallas,2017-01-26,540,Terry,Edwards,http://www.tdcj.state.tx.us/death_row/dr_info/...,Black,999463,"Yes, I made peace with God. I hope y'all make ..."
4,48,Tarrant,2017-01-11,539,Christopher,Wilkins,http://www.tdcj.state.tx.us/death_row/dr_info/...,White,999533,


## What's mentioned the most?

One of the most common analysis in NLP is to see the most common words found in the text. However, with spacy's out-of-the-box NER (Named Entity Recognition) tool, I'm more interested in seeing what entities are being mentioned.

### Case Sensitivity in NER

There is one thing to note: it is common practice in NLP to transform all text to lowercase for case insentitivity; however, spacy's NER apparently relies on case sensitivity.

Below is a quick demonstration:

In [3]:
# list out all entites for the text 'United States'
for ent in nlp('United States').ents:
    print("{} - {}".format(ent.label_, ent.text))

GPE - United States


*NOTE: **GPE** means **Geo-Political Entity**. A full list of built-in entity types from spacy's NER can be found in  [spacy's documentation on Entity recognition](https://spacy.io/docs/usage/entity-recognition#entity-types).*

Now what happens if we tell spacy to identify entities in the text "united states"?

In [4]:
# list out all entities for the text 'united states'
for ent in nlp('united states').ents:
    print("{} - {}", ent.label_, ent.text)

There is no output because there are no entities recognized. Here's another example with "UNITED STATES" (all caps):

In [5]:
# list out all entities for the text 'UNITED STATES'
for ent in nlp('UNITED STATES').ents:
    print("{} - {}", ent.label_, ent.text)

Similarly, no entities are recognized for "UNITED STATES".

### Post Entity Recognition Transformation

Now that we know that case sensitivity matters when recognizing entities, we will have to preserve the text in its original form. Only when spacy successfully identifies an entity do we transform the text by:
- lowercasing all characters
- stripping leading and trailing space

In [6]:
# analyze entities
all_entities = defaultdict(list)

for last_statement in df.last_statement:
    parsed_ls = nlp(last_statement)
    for entity in parsed_ls.ents:
        # organize each entity identified with a label, using a dictionary
        et = entity.text
        # remove punctuations and stopwords
        if not (nlp.vocab[et].is_punct or nlp.vocab[et].is_space):
            # transform entity text here
            all_entities[entity.label_].append(et.lower().strip())
            
# print all keys of `all_entities`:
all_entities.keys()


dict_keys(['EVENT', 'CARDINAL', 'NORP', 'LAW', 'QUANTITY', 'GPE', 'PRODUCT', 'TIME', 'MONEY', 'WORK_OF_ART', 'LANGUAGE', 'PERSON', 'ORG', 'FAC', 'DATE', 'ORDINAL', 'LOC'])

Here is a sample of persons identified in all the last statements:

In [7]:
set(all_entities['PERSON'][:15])

{'coretta scott king',
 'crain',
 'grace kehler',
 'jesus christ',
 'johnson',
 'jones',
 'kehler',
 'lord',
 'lord jesus',
 'sanchez',
 'warden',
 'warden jones'}

Let us observe the most popular words for each category:

In [8]:
most_common_entities = defaultdict(Counter)
for entity_label, entity_names in all_entities.items():
    most_common_entities[entity_label] = Counter(entity_names)
    # for printing purposes
    print(entity_label)
    for common_entity in most_common_entities[entity_label].most_common(5):
        entity_name, count = common_entity
        print("  - {} ({})".format(entity_name, count))
    print()


EVENT
  - my cup (1)

CARDINAL
  - one (54)
  - two (16)
  - three (10)
  - 1 (3)
  - zero (2)

NORP
  - spanish (9)
  - romans (5)
  - american (5)
  - christ (5)
  - christian (4)

LAW
  - article 19.83 of the texas penal code of (1)

QUANTITY
  - 180 pounds (1)

GPE
  - allah (40)
  - warden (18)
  - texas (18)
  - america (13)
  - heaven (10)

PRODUCT
  - chaplain (1)

TIME
  - tonight (36)
  - night (4)
  - the night of nov 11? (1)
  - goodnight (1)
  - 13:13 (1)

MONEY
  - as much as (1)
  - 903 (1)

WORK_OF_ART
  - love (15)
  - god living with us 24 hours (1)
  - the (1)
  - dear heavenly father (1)
  - bye aunt helen, luise, joanna and (1)

LANGUAGE
  - english (6)
  - french (1)
  - spanish (1)
  - chapter (1)

PERSON
  - warden (72)
  - lord (44)
  - father (38)
  - jesus christ (17)
  - irene (15)

ORG
  - lord (11)
  - christ (9)
  - the state of texas (7)
  - mama (6)
  - jesus (5)

FAC
  - ya'll (1)
  - st. thomas (1)
  - warden okay (1)

DATE
  - today (47)
  - ya'll (3

#### Misclassification of Entities

We can see that certain entities are sometimes grouped into different categories. Usually, this makes sense because "French" can either be interpreted as `NORP`, `LANGUAGE`, or `PERSON` (i.e. the French).

However, note the following:
- "jesus christ" was categorized as `PERSON`
- "christ" and "jesus" as separate terms were categorized as `ORG`
- "christ jesus" was categorized as `LOC` (location).

This is seen as a shortcoming of spacy's NER (Named Entity Recognition), so it might be more useful to just see the most common entities across the entire text:

In [9]:
# flatten all words from defaultdict.values
all_words = [word for label in list(all_entities.values()) for word in label]
bag_of_words = Counter(all_words)

# display the most common 25 entities mentioned
bag_of_words.most_common(25)

[('last', 109),
 ('warden', 90),
 ('lord', 55),
 ('one', 54),
 ('first', 53),
 ('today', 47),
 ("ya'll", 43),
 ('allah', 42),
 ('father', 38),
 ('tonight', 36),
 ('jesus', 28),
 ('jesus christ', 21),
 ('texas', 18),
 ('christ', 18),
 ('love', 17),
 ('two', 16),
 ('the years', 16),
 ('irene', 15),
 ('america', 13),
 ('lord jesus', 13),
 ('god', 10),
 ('heaven', 10),
 ('jack', 10),
 ('three', 10),
 ('spanish', 10)]