# NLP Exercise 4: Looking for Relations with NER

In this exercise we will use spaCy's named entity recognition (NER) algorithm to find relations between different entities in the Brown corpus.

## Part 1: Basic entity extraction

The Brown corpus is a well-known corpus of English developed at Brown University, containing text from many different sources. We will use entity extraction on a subset of the Brown corpus covering a few categories.

We can use spaCy to find entities in a basic sentence as follows:

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
sample_sentence = "The White House is located in Washington D.C."
sample_doc = nlp(sample_sentence)
print([(ent.text, ent.label_) for ent in sample_doc.ents])

[('The White House', 'ORG'), ('Washington D.C.', 'GPE')]


To see what an entity label means:

In [2]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

And to display the entities in a document using displaCy:

In [3]:
from spacy import displacy
displacy.render(sample_doc, style='ent', jupyter = True)

Now let's load sentences from the Brown corpus for a few categories:

In [4]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
sentences = brown.sents(categories = ['news', 'editorial', 'reviews'])

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


**Questions:**
  1. Use displaCy to display the entities in the first three sentences of this corpus. What are some entities that are tagged, and what do their entity labels mean?
  2. What are the five most common people mentioned in the corpus for these categories? What are the five most common buildings? (Hint: Recall the table of OntoNotes entities from lecture.)

In [5]:
#1.
for sentence in sentences[:3]:
  all_sent = ' '.join(sentence)
  nlp_sentence = nlp(all_sent)
  displacy.render(nlp_sentence, style = 'ent', jupyter = True)


Dates like 'Friday','September-October' are labeled.

Persons like 'Durwood Pye','Ivan Allen Jr.' are labeled.

Institutions like 'The Fulton County Grand Jury','the City Executive Committee','Fulton Superior Court' are labeled.

In [6]:
spacy.explain('GPE')

'Countries, cities, states'

In [7]:
spacy.explain('DATE')

'Absolute or relative dates or periods'

In [8]:
spacy.explain('PERSON')

'People, including fictional'

In [9]:
#2.
from tqdm import tqdm
all_people = []
all_fac = []
for sentence in tqdm(sentences) :
  nlp_sent = nlp(' '.join(sentence))
  for ent in nlp_sent.ents :
    if ent.label_ == 'PERSON':
      all_people.append(ent.text)
    elif ent.label_ == 'FAC':
      all_fac.append(ent.text)


100%|██████████| 9371/9371 [01:42<00:00, 91.23it/s] 


In [10]:
from collections import Counter
counter_people = Counter(all_people)
counter_people.most_common(5)

[('Kennedy', 113),
 ('Khrushchev', 53),
 ('Maris', 36),
 ('Eisenhower', 30),
 ('Podger', 22)]

'Kennedy','Khrushchev','Maris','Eisenhower','Podger' are the most frequent people in the corpus.

In [11]:
counter_fac = Counter(all_fac)
counter_fac.most_common(5)

[('Broadway', 16),
 ('Negro', 6),
 ('the White House', 5),
 ('Beardens', 5),
 ('Washington Square', 5)]

'Broadway','Negro','the White House','Beardens','Washington Square' are the most frequent buildings in the corpus.

## Part 2: Finding relations

Now we will look at pairs of entities in sentences in the corpus and try to identify relations between them.

**Questions**:
  3. We would like to know where organizations are located. Try to find all occurences of organization-location where the organization (ORG) comes before the location (GPE) in the sentence, with no other entity in between, and the word "in" appears somewhere between them. Put this in a Pandas Dataframe with three columns: ORG (organization name), GPE (location name), and context (words in between the organization and location).  How many of these are there?
  
  Hint: use entity.start and entity.end to get the starting and ending indices for an entity in the sentence.
  
  4. How much does this data tell us about what organizations are located where? In what cases can we be more or less certain?
  
  5. What is another example of a pair of entity labels and context word that would give us useful information? Try running your code to find this new relation.

In [12]:
#3.
import numpy as np
from tqdm import tqdm
org = []
gpe = []
in_betw = []
ents = []
for sentence in tqdm(sentences):
  nlp_sent = nlp(' '.join(sentence))
  ents = []
  for ent in nlp_sent.ents :
    ents.append(ent)
  for index,ent in enumerate(ents):
    if ent.label_ == 'ORG':
      if index != len(ents)-1:
        if ents[index + 1].label_ == 'GPE':
          in_between = nlp_sent[ent.end:ents[index + 1].start].text
          if ' in ' in in_between:
            in_betw.append(in_between)
            org.append(ent.text)
            gpe.append(ents[index + 1].text)



100%|██████████| 9371/9371 [01:35<00:00, 98.04it/s] 


In [13]:
import pandas as pd
df = pd.DataFrame({'ORG':org,'between':in_betw,'GPE':gpe})

In [14]:
df

Unnamed: 0,ORG,between,GPE
0,the State Welfare Department,`` has seen fit to distribute these funds thro...,Fulton County
1,NATO,committee has been set up so that in the futur...,Angola
2,State Department,"officials explain , now is mainly interested i...",Laos
3,U.N.,army in the,Congo
4,U.N.,forces in the,Congo
5,UN,bury its head in the sand on the,Congo
6,Commerce,Secretary Hodges seems to have been cast in th...,Washington
7,the House Foreign Affairs Committee,", but he is the grandson of the man who was in...",the United States
8,the Warwick Police Department,", thereby eliminating duplication of facilitie...",Cowessett
9,U.N.,troops in the,Congo


In [15]:
df.shape

(12, 3)

We have 12 organization followed by the location with 'in' between them.

4. When we have many words in the 'between' feature, it has less chance that the GPE is related to the location of the organization. However, when we have a few words like 'army in the','forces in the', we can assume with more certainty that those labels are related.

In [16]:
#5.
person = []
work_of_art = []
in_betw = []
ents = []
for sentence in tqdm(sentences):
  nlp_sent = nlp(' '.join(sentence))
  ents = []
  for ent in nlp_sent.ents :
    ents.append(ent)
  for index,ent in enumerate(ents):
    if ent.label_ == 'WORK_OF_ART':
      if index != len(ents)-1:
        if ents[index + 1].label_ == 'PERSON':
          in_between = nlp_sent[ent.end:ents[index + 1].start].text

          #if ' by ' in in_between:
            #print('ok')
          in_betw.append(in_between)
          work_of_art.append(ent.text)
          person.append(ents[index + 1].text)

100%|██████████| 9371/9371 [01:34<00:00, 99.18it/s] 


In [17]:
df_pers_woa = pd.DataFrame({'work_of_art':work_of_art,'between':in_betw,'person':person})

In [18]:
df_pers_woa

Unnamed: 0,work_of_art,between,person
0,`` Campaigning on the carcass,of,Eisenhower
1,Kelsey,is very doubtful for the,Rice
2,"Pall Mall , Sterling Township",", surprised",Kowalski
3,The news of their experiments reaches,"the farmers who , forgetting that birds are th...",Buchheister
4,Golden Arrow '',directed by,Noel Coward
5,The Forsythe Saga '',and `` Mrs.,Miniver
6,Alba Madonna '',and `` texture '' in a,Monet
7,Sometimes in the minors '',",",Maris
8,The cannery '',", said Mrs.",Lewellyn Lundeen
9,Bible,is more easily understandable to the general r...,the King James


We get relations between books,songs,etc and the owner.