# Chapter 11: Named Entity Recognition

## Data loading and exploration

The dataset used in this notebook is `All the news` dataset from the Kaggle website: you can download it [here](https://www.kaggle.com/snapcrack/all-the-news). The code uses `articles1.csv` file.

Let's start by loading, reading, and inspecting the data.

In [1]:
import pandas as pd

path = "all-the-news/"
df = pd.read_csv(path + "articles1.csv")

Check how many documents are loaded:

In [2]:
df.shape

(50000, 10)

What does the data contain? Check the first several rows:

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


Which news sources are covered?

In [4]:
sources = df["publication"].unique()
print(sources)

['New York Times' 'Breitbart' 'CNN' 'Business Insider' 'Atlantic']


Select only specific publications – e.g., the first $1000$ articles from the `New York Times`:

In [5]:
condition = df["publication"].isin(["New York Times"])
content_df = df.loc[condition, :]["content"][:1000]
content_df.shape

(1000,)

You can also check what is contianed in this new, selected set:

In [6]:
content_df.head()

0    WASHINGTON  —   Congressional Republicans have...
1    After the bullet shells get counted, the blood...
2    When Walt Disney’s “Bambi” opened in 1942, cri...
3    Death may be the great equalizer, but it isn’t...
4    SEOUL, South Korea  —   North Korea’s leader, ...
Name: content, dtype: object

And print out the full content of some articles (e.g., the first couple):

In [8]:
for article in content_df[:2]:
    print(article)

WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been d

## Extract information with spaCy

Collect named entities from the texts:

In [9]:
import spacy
nlp = spacy.load("en_core_web_md")

def collect_entites(data_frame):
    named_entities = {}
    processed_docs = []

    for item in data_frame:
        doc = nlp(item)
        processed_docs.append(doc)

        for ent in doc.ents:
            entity_text = ent.text # e.g., Apple
            entity_type = str(ent.label_) # e.g., ORG
            current_ents = {} # e.g., [Apple: 1, Facebook: 2, ...]
            if entity_type in named_entities.keys():
                current_ents = named_entities.get(entity_type)
            current_ents[entity_text] = current_ents.get(entity_text, 0) + 1                
            named_entities[entity_type] = current_ents
    return named_entities, processed_docs

named_entities, processed_docs = collect_entites(content_df)

Print out the most frequent 10 entities per type:

In [10]:
def print_out(named_entities):
    for key in named_entities.keys():
        print(key)
        entities = named_entities.get(key)
        sorted_keys = sorted(entities, key=entities.get, reverse=True)
        for item in sorted_keys[:10]:
            if (entities.get(item) > 1):
                print("   " + item + ": " + str(entities.get(item)))

print_out(named_entities)

GPE
   the United States: 1079
   Russia: 526
   China: 514
   Washington: 496
   New York: 364
   America: 364
   Iran: 294
   Mexico: 266
   Britain: 237
   California: 204
NORP
   American: 971
   Republicans: 524
   Republican: 472
   Democrats: 398
   Russian: 337
   Chinese: 288
   Americans: 268
   British: 181
   Democrat: 166
   Muslim: 161
PERSON
   Trump: 3703
   Obama: 824
   Donald J. Trump: 209
   Clinton: 187
   Spicer: 136
   Sessions: 123
   Gorsuch: 121
   Kushner: 111
   Hillary Clinton: 110
   Barack Obama: 110
ORG
   Trump: 744
   Senate: 374
   Congress: 348
   Twitter: 348
   White House: 235
   The New York Times: 230
   the White House: 216
   House: 202
   Facebook: 175
   Times: 169
MONEY
   1: 64
   2: 23
   100: 21
   10: 19
   5: 19
   3: 19
   millions of dollars: 18
   4: 18
   billions of dollars: 17
   $1 billion: 15
CARDINAL
   one: 1235
   two: 896
   three: 343
   One: 331
   000: 252
   four: 169
   seven: 168
   1: 151
   five: 126
   thousands: 1

Get overall statistics – the named entity types, the number of unique entries of each type, and the total number of occurrences of each type:

In [11]:
rows = []
rows.append(["Type:", "Entries:", "Total:"])
for ent_type in named_entities.keys():
    rows.append([ent_type, str(len(named_entities.get(ent_type))), 
                 str(sum(named_entities.get(ent_type).values()))])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row))))
    

 Type:        Entries:  Total: 
 GPE          1782      14978  
 NORP         512       7384   
 PERSON       9768      29623  
 ORG          5089      15859  
 MONEY        662       1221   
 CARDINAL     1211      8518   
 DATE         3122      15152  
 LAW          131       411    
 LOC          501       1527   
 FAC          724       1357   
 ORDINAL      72        1724   
 TIME         594       1594   
 QUANTITY     360       479    
 PERCENT      271       677    
 WORK_OF_ART  1337      1924   
 EVENT        290       671    
 PRODUCT      262       372    
 LANGUAGE     21        92     


Explore the contexts of use for a particular entity of a specified entity type:

In [12]:
entity = "The New York Times"
sentences = ["The New York Times wrote about Apple"]

def extract_span(sent, entity):
    indexes = []
    for ent in sent.ents:
        if ent.text==entity:
            for i in range(int(ent.start), int(ent.end)):
                indexes.append(i)
    return indexes

    
def extract_information(sent, entity, indexes):
    actions = []
    action = ""
    participant1 = ""
    participant2 = ""
        
    for token in sent:
        if token.pos_=="VERB" and token.dep_=="ROOT":  
            subj_ind = -1
            obj_ind = -1
            action = token.text
            children = [child for child in token.children]   
            for child1 in children:
                if child1.dep_=="nsubj":
                    participant1 = child1.text
                    subj_ind = int(child1.i)
                if child1.dep_=="prep":
                    participant2 = ""
                    child1_children = [child for child in child1.children]
                    for child2 in child1_children:
                        if child2.pos_ == "NOUN" or child2.pos_ == "PROPN":
                            participant2 = child2.text
                            obj_ind = int(child2.i)
                    if not participant2=="":
                        if subj_ind in indexes:
                            actions.append(entity + " " + action + " " + child1.text + " " + participant2)
                        elif obj_ind in indexes:
                            actions.append(participant1 + " " + action + " " + child1.text + " " + entity)
                if child1.dep_=="dobj" and (child1.pos_ == "NOUN"
                                            or child1.pos_ == "PROPN"):
                    participant2 = child1.text
                    obj_ind = int(child1.i)
                    if subj_ind in indexes:
                        actions.append(entity + " " + action + " " + participant2)
                    elif obj_ind in indexes:
                        actions.append(participant1 + " " + action + " " + entity)
                    
    if not len(actions)==0:
        print (f"\nSentence = {sent}")
        for item in actions:
            print(item)


for sent in sentences:
    doc = nlp(sent)
    indexes = extract_span(doc, entity)
    print(indexes)
    extract_information(doc, entity, indexes)

[0, 1, 2, 3]

Sentence = The New York Times wrote about Apple
The New York Times wrote about Apple


Detect sentences with the specified entity:

In [13]:
def entity_detector(processed_docs, entity, ent_type):
    output_sentences = []
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type]:
                output_sentences.append(sent)
    return output_sentences   

entity = "Apple"

ent_sentences = entity_detector(processed_docs, entity, "ORG")
print(len(ent_sentences))

61


Extract information from the sentences with the specified entity:

In [14]:
for sent in ent_sentences:
    indexes = extract_span(sent, entity)
    extract_information(sent, entity, indexes)


Sentence = Apple, complying with what it said was a request from Chinese authorities, removed news apps created by The New York Times from its app store in China late last month.
Apple removed apps

Sentence = Apple removed both the   and   apps from the app store in China on Dec. 23.
Apple removed apps
Apple removed on Dec.

Sentence = Apple has previously removed other, less prominent media apps from its China store.
Apple removed apps

Sentence = It puts Apple and Google in a difficult position.
It puts Apple

Sentence = Russia required Apple and Google to remove the LinkedIn app from their local stores.
Russia required Apple

Sentence = On Friday, Apple, its longtime partner, sued Qualcomm over what it said was $1 billion in withheld rebates.
Apple sued Qualcomm

Sentence = Apple sued three days after the  Federal Trade Commission accused Qualcomm of using anticompetitive practices to guarantee its high royalty payments for advanced wireless technology.
Apple sued days

Sentence =

Check how this works for a multi-word named entity:

In [15]:
entity = "The New York Times"

ent_sentences = entity_detector(processed_docs, entity, "ORG")
print(len(ent_sentences))

for sent in ent_sentences:
    indexes = extract_span(sent, entity)
    extract_information(sent, entity, indexes)

230

Sentence = During negotiations with Mr. Corzine last year, the commission also strengthened aspects of the deal after some of the agency’s commissioners questioned it, The New York Times reported at the time.
The New York Times reported at time

Sentence = The New York Times spoke to five people in the   to    age group, a small sample of millennial savers.
The New York Times spoke to people
The New York Times spoke to group

Sentence = Times Insider delivers    insights into life at The New York Times.
Insider delivers at The New York Times

Sentence = The New York Times called Armstrong one of the “great inventive geniuses in electrical engineering” after his death in 1954.
The New York Times called Armstrong
The New York Times called after death

Sentence = “It is now heavy rain and melting snow, which is causing flooding in the camp,” Mr. Kempson wrote to The New York Times, via Facebook, describing the conditions in the camp on Wednesday.
Kempson wrote to The New York Times



## Apply visualization with displaCy

Familialize yourself with the visualization functionality. This is an example from the [`spaCy`'s webpage](https://spacy.io/usage/visualizers):

In [16]:
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, \
        few people outside of the company took him seriously."

doc = nlp(text)
displacy.render(doc, style="ent")

Visualize entity types in sentences containing the specified entity:

In [17]:
def visualize(processed_docs, entity, ent_type):
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type]:
                displacy.render(sent, style="ent")

visualize(processed_docs, "Apple", "ORG")

Find sentences where the particular named entity is used alongside other entities of the same type:

In [18]:
def count_ents(sent, ent_type):
    return len([ent.text for ent in sent.ents if ent.label_==ent_type])   

def entity_detector_custom(processed_docs, entity, ent_type):
    output_sentences = []
    for doc in processed_docs:
        for sent in doc.sents:
            if entity in [ent.text for ent in sent.ents if ent.label_==ent_type and 
                          count_ents(sent, ent_type)>1]:
                output_sentences.append(sent)
    return output_sentences

output_sentences = entity_detector_custom(processed_docs, "Apple", "ORG")
print(len(output_sentences))

41


Now visualize the results – named entities of specified type (you can change the colors by selecting the codes from [`https://htmlcolorcodes.com/color-chart/`](https://htmlcolorcodes.com/color-chart/):

In [19]:
def visualize_type(sents, entity, ent_type):
    colors = {"ORG": "linear-gradient(90deg, #64B5F6, #E0F7FA)"}
    options = {"ents": ["ORG"], "colors": colors}
    for sent in sents:
        displacy.render(sent, style="ent", options=options)
                
visualize_type(output_sentences, "Apple", "ORG")