# NER Workshop Exercise 1: Looking for Relations

In this exercise we will use spaCy's named entity recognition (NER) algorithm to find relations between different entities in the Brown corpus.

## Part 1: Basic entity extraction

The Brown corpus is a well-known corpus of English developed at Brown Univeristy, containing text from many different sources. We will use entity extraction on a subset of the Brown corpus covering a few categories.

We can use spaCy to find entities in a basic sentence as follows:

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')
sample_sentence = "The White House is located in Washington D.C."
sample_doc = nlp(sample_sentence)
print([(ent.text, ent.label_) for ent in sample_doc.ents])

[('The White House', 'ORG'), ('Washington', 'GPE'), ('D.C.', 'GPE')]


To see what an entity label means:

In [3]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

And to display the entities in a document using displaCy:

In [4]:
from spacy import displacy
displacy.render(sample_doc, style='ent', jupyter = True)

Now let's load sentences from the Brown corpus for a few categories:

In [5]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
sentences = brown.sents(categories = ['news', 'editorial', 'reviews'])

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


**Questions:**
  1. Use displaCy to display the entities in the first three sentences of this corpus. What are some entities that are tagged, and what do their entity labels means?
  2. What are the five most common people mentioned in the corpus for these categories? What are the five most common buildings? (Hint: See [this page](https://spacy.io/usage/linguistic-features#section-named-entities) under "built-in entity types")

In [6]:
for my_sentence in sentences[0:3]:
  displacy.render(nlp(" ".join(my_sentence)), style='ent', jupyter = True)

In [7]:
spacy.explain("DATE")

'Absolute or relative dates or periods'

In [8]:
spacy.explain("GPE")

'Countries, cities, states'

In [9]:
spacy.explain("PERSON")

'People, including fictional'

We have the entity labels: "ORG, DATE, GPE, PERSON"

ORG is an organization, Date is a date entity, GPE is a Geopolitical entity


In [10]:
sample_sentence

'The White House is located in Washington D.C.'

In [0]:
from tqdm import tqdm

In [13]:
person_list = [ent.text for my_sentence in tqdm(sentences) for ent in nlp(" ".join(my_sentence)).ents if ent.label_ == "PERSON"]

100%|██████████| 9371/9371 [01:51<00:00, 84.31it/s]


In [14]:
from collections import Counter
Counter(person_list).most_common(5)

[('Kennedy', 113),
 ('Khrushchev', 40),
 ('Maris', 35),
 ('Eisenhower', 26),
 ('Podger', 22)]

In [15]:
building_list = [ent.text for my_sentence in tqdm(sentences) for ent in nlp(" ".join(my_sentence)).ents if ent.label_ == "FAC"]

100%|██████████| 9371/9371 [01:52<00:00, 82.97it/s]


In [16]:
Counter(building_list).most_common(5)

[('Broadway', 14),
 ('the White House', 6),
 ('Capitol', 4),
 ('Washington Square', 4),
 ('Lewisohn Stadium', 4)]

## Part 2: Finding relations

Now we will look at pairs of entities in sentences in the corpus and try to identify relations between them.

**Questions**:
  3. We would like to know where organizations are located. Try to find all occurences of organization-location where the organization (ORG) comes before the location (GPE) in the sentence, and the word "in" appears somewhere between them. Put this in a Pandas Dataframe with three columns: ORG (organization name), GPE (location name), and context (words in between the organization and location).  How many of these are there?
  
  Hint: use entity.start and entity.end to get the starting and ending indices for an entity in the sentence.
  
  4. How much does this data tell us about what organizations are located where? In what cases can we be more or less certain?
  
  5. What is another example of a pair of entity labels and context word that would give us useful information? Try running your code to find this new relation.

In [0]:
import pandas as pd

In [26]:
#3
docs = [nlp(' '.join(s)) for s in tqdm(sentences)]
def extract_relations(doc, type1, type2, context_regex):
  df=pd.DataFrame([(e1.text, e2.text, doc[e1.end:e2.start].text.strip()) for e1, e2 in zip(doc.ents, doc.ents[1:]) if e1.label_ == type1 and e2.label_ == type2
                  ], columns = [type1, type2, 'context'])
  return df[df.context.str.contains(context_regex)]
df=pd.concat([extract_relations(doc, 'ORG', 'GPE', r'\bin\b') for doc in tqdm(docs)])
print(df.shape[0])
df.sample(n=5)




  0%|          | 0/9371 [00:00<?, ?it/s][A[A[A


  0%|          | 7/9371 [00:00<02:28, 63.16it/s][A[A[A


  0%|          | 15/9371 [00:00<02:20, 66.65it/s][A[A[A


  0%|          | 23/9371 [00:00<02:16, 68.38it/s][A[A[A


  0%|          | 32/9371 [00:00<02:08, 72.57it/s][A[A[A


  0%|          | 41/9371 [00:00<02:03, 75.45it/s][A[A[A


  1%|          | 49/9371 [00:00<02:01, 76.58it/s][A[A[A


  1%|          | 58/9371 [00:00<01:59, 77.78it/s][A[A[A


  1%|          | 67/9371 [00:00<01:58, 78.57it/s][A[A[A


  1%|          | 75/9371 [00:00<01:59, 77.86it/s][A[A[A


  1%|          | 83/9371 [00:01<01:59, 77.91it/s][A[A[A


  1%|          | 92/9371 [00:01<01:55, 80.00it/s][A[A[A


  1%|          | 101/9371 [00:01<01:53, 81.41it/s][A[A[A


  1%|          | 110/9371 [00:01<01:55, 80.51it/s][A[A[A


  1%|▏         | 119/9371 [00:01<01:54, 80.81it/s][A[A[A


  1%|▏         | 128/9371 [00:01<01:59, 77.52it/s][A[A[A


  1%|▏         | 137/9371 [

96


Unnamed: 0,ORG,GPE,context
0,the Scottish Rite,Colorado,in
1,the Santa Cecilia Orchestra,Rome,in
0,Georgetown University,Washington,in
0,U.N.,New York,headquarters in
0,Rusk,Dulles,"belief in balanced defense , replacing the"


In [28]:
print(df.shape[0])

96


4.

It can certainly tell us where each organization is located. However it has limitations. We can be more sure that the words are associated with eachother the shorter the length of the words betweeen the GPE and ORG. We can see for example that for Rusk the there is no relation between the location and entity name because the context is much larger. 

In [27]:
#5
pd.concat([extract_relations(doc, 'PERSON', 'GPE', r'\bin\b') for doc in tqdm(docs)])




  0%|          | 0/9371 [00:00<?, ?it/s][A[A[A


  0%|          | 27/9371 [00:00<00:34, 269.18it/s][A[A[A


  1%|          | 55/9371 [00:00<00:34, 269.61it/s][A[A[A


  1%|          | 83/9371 [00:00<00:34, 270.09it/s][A[A[A


  1%|          | 109/9371 [00:00<00:34, 265.75it/s][A[A[A


  1%|▏         | 140/9371 [00:00<00:33, 276.53it/s][A[A[A


  2%|▏         | 170/9371 [00:00<00:32, 281.90it/s][A[A[A


  2%|▏         | 199/9371 [00:00<00:32, 281.64it/s][A[A[A


  2%|▏         | 226/9371 [00:00<00:33, 275.43it/s][A[A[A


  3%|▎         | 253/9371 [00:00<00:33, 270.42it/s][A[A[A


  3%|▎         | 280/9371 [00:01<00:35, 258.97it/s][A[A[A


  3%|▎         | 306/9371 [00:01<00:36, 248.96it/s][A[A[A


  4%|▎         | 331/9371 [00:01<00:37, 240.90it/s][A[A[A


  4%|▍         | 355/9371 [00:01<00:38, 236.72it/s][A[A[A


  4%|▍         | 379/9371 [00:01<00:38, 233.21it/s][A[A[A


  4%|▍         | 403/9371 [00:01<00:38, 230.15it/s][A[A[A


  5

Unnamed: 0,PERSON,GPE,context
0,Ivan Allen Jr.,Sept.,", who became a candidate in the"
0,Henry C. Grover,Houston,", who teaches history in the"
0,Clark,Oklahoma,has served as teacher and principal in
0,Karns,East St. Louis,", who is a City judge in"
0,Richard M. Nixon,Detroit,in
1,Kennedy,Laos,administration would be held responsible if th...
0,Nixon,Cuba,", for his part , would oppose intervention in"
0,Mitchell,Washington,is against the centralization of government in
0,Lawrence E. Gerosa,Bronx,", who lives in the"
0,Screvane,Queens,", who lives in"
