# Information Extraction: NLTK and Spacy

Script Sources:

* **NLTK**: Tsilimos, Maria. Python: Introduction to Natural Language Processing (NLP). IT Central, University of Zurich.
* **Spacy**: https://spacy.io/usage/spacy-101

**Information Extraction (IE)** consists on transforming **Natural Language unstructured data** (written or spoken) into **structured data** ready to be used by machines. 

In this notebook we are going to learn two different IE methods: **Part of Speech Tagging (POS)** and **Name Entity Recognition (NER)**.

There are many excellent Python libraries out there to write scripts that will allow us to do both things. In this notebook we will learn how to use **NLTK** and **Spacy** and understand the advantages and disadvantages of both!

# 1. Importing our data

Let's begin by using the first chapter of **Around the World in Eighty Days** by Jules Verne.

If you remember, in the previous chapter we did 4 steps of cleaning and pre-processing:

* Tokenization
* Lowercasing
* Removing Punctuation
* Removing Stopwords

Now **we are not going to do any of those things**. We need to do **POS tagging**, and for that, it is necessary to keep punctuation and stopwords to avoid confusing the parser. 

The only thing that we are going to remove are the noisy characters "\r\n".

For that, we are going to use this script: **re.sub(r"\r\n", " ", data")**. (in case you want to replicate it on your own dataset). 

For efficiency purposes a clean first chapter has been created for you with that process already incorporated.

In [1]:
with open("chapter_1_80.txt", "r", encoding = "utf-8") as f:
    data = f.read()

# 2. Understanding Information Extraction Architecture: NLTK

### A. We import the libraries

In [2]:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk import pos_tag
from nltk import ne_chunk
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk.draw import draw_trees

### B. We initialize the Information Extracture Pipeline:

1. Sentence Segmentation
2. Tokenization
3. POS Tagging
4. Chunking
5. NER

#### 1. Sentence Segmentation

In [3]:
sentences = sent_tokenize(data) 
sentences

['CHAPTER I.',
 'IN WHICH PHILEAS FOGG AND PASSEPARTOUT ACCEPT EACH OTHER, THE ONE AS MASTER, THE OTHER AS MAN\nMr. Phileas Fogg lived, in 1872, at No.',
 '7, Saville Row, Burlington Gardens, the house in which Sheridan died in 1814.',
 'He was one of the most noticeable members of the Reform Club, though he seemed always to avoid attracting attention; an enigmatical personage, about whom little was known, except that he was a polished man of the world.',
 'People said that he resembled Byron—at least that his head was Byronic; but he was a bearded, tranquil Byron, who might live on a thousand years without growing old.',
 'Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner.',
 'He was never seen on ’Change, nor at the Bank, nor in the counting-rooms of the “City”; no ships ever came into London docks of which he was the owner; he had no public employment; he had never been entered at any of the Inns of Court, either at the Temple, or Lincoln’s Inn, or Gr

#### 2. Tokenization

In [4]:
token_sentences = [word_tokenize(sentence) for sentence in sentences] 

In [5]:
print(token_sentences)

[['CHAPTER', 'I', '.'], ['IN', 'WHICH', 'PHILEAS', 'FOGG', 'AND', 'PASSEPARTOUT', 'ACCEPT', 'EACH', 'OTHER', ',', 'THE', 'ONE', 'AS', 'MASTER', ',', 'THE', 'OTHER', 'AS', 'MAN', 'Mr.', 'Phileas', 'Fogg', 'lived', ',', 'in', '1872', ',', 'at', 'No', '.'], ['7', ',', 'Saville', 'Row', ',', 'Burlington', 'Gardens', ',', 'the', 'house', 'in', 'which', 'Sheridan', 'died', 'in', '1814', '.'], ['He', 'was', 'one', 'of', 'the', 'most', 'noticeable', 'members', 'of', 'the', 'Reform', 'Club', ',', 'though', 'he', 'seemed', 'always', 'to', 'avoid', 'attracting', 'attention', ';', 'an', 'enigmatical', 'personage', ',', 'about', 'whom', 'little', 'was', 'known', ',', 'except', 'that', 'he', 'was', 'a', 'polished', 'man', 'of', 'the', 'world', '.'], ['People', 'said', 'that', 'he', 'resembled', 'Byron—at', 'least', 'that', 'his', 'head', 'was', 'Byronic', ';', 'but', 'he', 'was', 'a', 'bearded', ',', 'tranquil', 'Byron', ',', 'who', 'might', 'live', 'on', 'a', 'thousand', 'years', 'without', 'growin

#### 3. POS Tagging

In [6]:
pos_sentences = [nltk.pos_tag(sentence) for sentence in token_sentences ] 

In [7]:
pos_sentences

[[('CHAPTER', 'NN'), ('I', 'PRP'), ('.', '.')],
 [('IN', 'NNP'),
  ('WHICH', 'NNP'),
  ('PHILEAS', 'NNP'),
  ('FOGG', 'NNP'),
  ('AND', 'NNP'),
  ('PASSEPARTOUT', 'NNP'),
  ('ACCEPT', 'NNP'),
  ('EACH', 'NNP'),
  ('OTHER', 'NNP'),
  (',', ','),
  ('THE', 'NNP'),
  ('ONE', 'NNP'),
  ('AS', 'NNP'),
  ('MASTER', 'NNP'),
  (',', ','),
  ('THE', 'NNP'),
  ('OTHER', 'NNP'),
  ('AS', 'NNP'),
  ('MAN', 'NNP'),
  ('Mr.', 'NNP'),
  ('Phileas', 'NNP'),
  ('Fogg', 'NNP'),
  ('lived', 'VBD'),
  (',', ','),
  ('in', 'IN'),
  ('1872', 'CD'),
  (',', ','),
  ('at', 'IN'),
  ('No', 'DT'),
  ('.', '.')],
 [('7', 'CD'),
  (',', ','),
  ('Saville', 'NNP'),
  ('Row', 'NNP'),
  (',', ','),
  ('Burlington', 'NNP'),
  ('Gardens', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('house', 'NN'),
  ('in', 'IN'),
  ('which', 'WDT'),
  ('Sheridan', 'NNP'),
  ('died', 'VBD'),
  ('in', 'IN'),
  ('1814', 'CD'),
  ('.', '.')],
 [('He', 'PRP'),
  ('was', 'VBD'),
  ('one', 'CD'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('most', 'RB

#### 4. Chunking and NER

#### Chunking

In [8]:
sentence = pos_sentences[7]

In [9]:
sentence

[('He', 'PRP'),
 ('certainly', 'RB'),
 ('was', 'VBD'),
 ('not', 'RB'),
 ('a', 'DT'),
 ('manufacturer', 'NN'),
 (';', ':'),
 ('nor', 'CC'),
 ('was', 'VBD'),
 ('he', 'PRP'),
 ('a', 'DT'),
 ('merchant', 'NN'),
 ('or', 'CC'),
 ('a', 'DT'),
 ('gentleman', 'JJ'),
 ('farmer', 'NN'),
 ('.', '.')]

In [10]:
grammar = "NP: {<DT>?<JJ>*<NN>}" 

In [11]:
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence) 
print(result)

(S
  He/PRP
  certainly/RB
  was/VBD
  not/RB
  (NP a/DT manufacturer/NN)
  ;/:
  nor/CC
  was/VBD
  he/PRP
  (NP a/DT merchant/NN)
  or/CC
  (NP a/DT gentleman/JJ farmer/NN)
  ./.)


In [12]:
result.draw()

#### NER

In [13]:
chunked_sentences = nltk.ne_chunk_sents(pos_sentences)

In [14]:
chunked_sentences

<generator object ParserI.parse_sents.<locals>.<genexpr> at 0x0000021947CD3540>

In [15]:
for sent in chunked_sentences:
    for chunk in sent: 
        if hasattr(chunk,'label'): 
            print(chunk.label(), ' '.join(c[0] for c in chunk))

ORGANIZATION PHILEAS
ORGANIZATION THE
ORGANIZATION ONE
ORGANIZATION THE
PERSON Mr. Phileas Fogg
ORGANIZATION No
PERSON Saville Row
PERSON Burlington Gardens
ORGANIZATION Sheridan
ORGANIZATION Reform Club
PERSON People
GPE Byronic
PERSON Byron
PERSON Phileas Fogg
GPE Londoner
ORGANIZATION Bank
GPE London
ORGANIZATION Inns
GPE Court
GPE Temple
PERSON Lincoln
PERSON Inn
PERSON Gray
GPE Inn
GPE Chancery
ORGANIZATION Exchequer
ORGANIZATION Queen
PERSON Bench
ORGANIZATION Ecclesiastical Courts
ORGANIZATION Royal Institution
ORGANIZATION London Institution
ORGANIZATION Artisan
GPE English
ORGANIZATION Harmonic
PERSON Phileas
PERSON Fogg
GPE Reform
ORGANIZATION Barings
GPE Was
PERSON Phileas Fogg
PERSON Mr. Fogg
PERSON Phileas Fogg
GPE London
PERSON Mr.
PERSON Fogg
PERSON Phileas
PERSON Fogg
GPE Saville Row
ORGANIZATION Reform
GPE Saville Row
GPE American
GPE Saville Row
PERSON Phileas Fogg
PERSON James Forster
PERSON Phileas
PERSON Fogg
PERSON Mr. Fogg
PERSON Saville Row
GPE Reform
PERSON Phi

And now let's transform that into a list!

Source = https://nanonets.com/blog/named-entity-recognition-with-nltk-and-spacy/

In [16]:
chunked_sentences = nltk.ne_chunk_sents(pos_sentences)

In [17]:
named_entities = []

In [18]:
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label"):
            named_entities.append((chunk.label(), ' '.join(c[0] for c in chunk)))

In [19]:
named_entities

[('ORGANIZATION', 'PHILEAS'),
 ('ORGANIZATION', 'THE'),
 ('ORGANIZATION', 'ONE'),
 ('ORGANIZATION', 'THE'),
 ('PERSON', 'Mr. Phileas Fogg'),
 ('ORGANIZATION', 'No'),
 ('PERSON', 'Saville Row'),
 ('PERSON', 'Burlington Gardens'),
 ('ORGANIZATION', 'Sheridan'),
 ('ORGANIZATION', 'Reform Club'),
 ('PERSON', 'People'),
 ('GPE', 'Byronic'),
 ('PERSON', 'Byron'),
 ('PERSON', 'Phileas Fogg'),
 ('GPE', 'Londoner'),
 ('ORGANIZATION', 'Bank'),
 ('GPE', 'London'),
 ('ORGANIZATION', 'Inns'),
 ('GPE', 'Court'),
 ('GPE', 'Temple'),
 ('PERSON', 'Lincoln'),
 ('PERSON', 'Inn'),
 ('PERSON', 'Gray'),
 ('GPE', 'Inn'),
 ('GPE', 'Chancery'),
 ('ORGANIZATION', 'Exchequer'),
 ('ORGANIZATION', 'Queen'),
 ('PERSON', 'Bench'),
 ('ORGANIZATION', 'Ecclesiastical Courts'),
 ('ORGANIZATION', 'Royal Institution'),
 ('ORGANIZATION', 'London Institution'),
 ('ORGANIZATION', 'Artisan'),
 ('GPE', 'English'),
 ('ORGANIZATION', 'Harmonic'),
 ('PERSON', 'Phileas'),
 ('PERSON', 'Fogg'),
 ('GPE', 'Reform'),
 ('ORGANIZATION', 

In [20]:
person = []

for a,b in named_entities:
    if a == "PERSON":
        person.append([a, b])

In [21]:
person

[['PERSON', 'Mr. Phileas Fogg'],
 ['PERSON', 'Saville Row'],
 ['PERSON', 'Burlington Gardens'],
 ['PERSON', 'People'],
 ['PERSON', 'Byron'],
 ['PERSON', 'Phileas Fogg'],
 ['PERSON', 'Lincoln'],
 ['PERSON', 'Inn'],
 ['PERSON', 'Gray'],
 ['PERSON', 'Bench'],
 ['PERSON', 'Phileas'],
 ['PERSON', 'Fogg'],
 ['PERSON', 'Phileas Fogg'],
 ['PERSON', 'Mr. Fogg'],
 ['PERSON', 'Phileas Fogg'],
 ['PERSON', 'Mr.'],
 ['PERSON', 'Fogg'],
 ['PERSON', 'Phileas'],
 ['PERSON', 'Fogg'],
 ['PERSON', 'Phileas Fogg'],
 ['PERSON', 'James Forster'],
 ['PERSON', 'Phileas'],
 ['PERSON', 'Fogg'],
 ['PERSON', 'Mr. Fogg'],
 ['PERSON', 'Saville Row'],
 ['PERSON', 'Phileas Fogg'],
 ['PERSON', 'James Forster'],
 ['PERSON', 'Phileas Fogg'],
 ['PERSON', 'John'],
 ['PERSON', 'Jean Passepartout'],
 ['PERSON', 'Leotard'],
 ['PERSON', 'Blondin'],
 ['PERSON', 'Monsieur Phileas Fogg'],
 ['PERSON', 'Mr. Fogg'],
 ['PERSON', 'Passepartout'],
 ['PERSON', 'Mr. Fogg'],
 ['PERSON', 'Pardon'],
 ['PERSON', 'Phileas'],
 ['PERSON', 'Fogg

That looks good so far! Let's now check **Geopolitical Entities (GPE)**

In [22]:
GPE = []

for a,b in named_entities:
    if a == "GPE":
        GPE.append([a, b])

In [23]:
GPE

[['GPE', 'Byronic'],
 ['GPE', 'Londoner'],
 ['GPE', 'London'],
 ['GPE', 'Court'],
 ['GPE', 'Temple'],
 ['GPE', 'Inn'],
 ['GPE', 'Chancery'],
 ['GPE', 'English'],
 ['GPE', 'Reform'],
 ['GPE', 'Was'],
 ['GPE', 'London'],
 ['GPE', 'Saville Row'],
 ['GPE', 'Saville Row'],
 ['GPE', 'American'],
 ['GPE', 'Saville Row'],
 ['GPE', 'Reform'],
 ['GPE', 'Frenchman'],
 ['GPE', 'Paris'],
 ['GPE', 'France'],
 ['GPE', 'England'],
 ['GPE', 'Passepartout'],
 ['GPE', 'Saville']]

That also looks quite good! However we observe some **issues**: is American or Londoner a person or a GPE?

In [24]:
organization = []

for a,b in named_entities:
    if a == "ORGANIZATION":
        organization.append([a, b])

In [25]:
organization

[['ORGANIZATION', 'PHILEAS'],
 ['ORGANIZATION', 'THE'],
 ['ORGANIZATION', 'ONE'],
 ['ORGANIZATION', 'THE'],
 ['ORGANIZATION', 'No'],
 ['ORGANIZATION', 'Sheridan'],
 ['ORGANIZATION', 'Reform Club'],
 ['ORGANIZATION', 'Bank'],
 ['ORGANIZATION', 'Inns'],
 ['ORGANIZATION', 'Exchequer'],
 ['ORGANIZATION', 'Queen'],
 ['ORGANIZATION', 'Ecclesiastical Courts'],
 ['ORGANIZATION', 'Royal Institution'],
 ['ORGANIZATION', 'London Institution'],
 ['ORGANIZATION', 'Artisan'],
 ['ORGANIZATION', 'Harmonic'],
 ['ORGANIZATION', 'Barings'],
 ['ORGANIZATION', 'Reform'],
 ['ORGANIZATION', 'United Kingdom']]

# Exercise 1

# Spacy

And now let's try Spacy. Spacy does not follow the same architecture as NLTK: we don´t need to follow the 4 step pipeline (sentence segmentation, tokenization, POS tagging, NER chunking). All of that is implemented in their code! Have a look at: https://spacy.io/usage/linguistic-features#named-entities

You may need to install the Spacy pipeline. If so, remove the #symbol in the following cells.

In [26]:
#!pip install spacy

In [28]:
#!python -m spacy download en_core_web_sm

In [26]:
import spacy

In [27]:
nlp = spacy.load("en_core_web_sm")

In [28]:
doc = nlp(data)

Let's first have a look at the existing Entity Labels

In [29]:
nlp.get_pipe('ner').labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [30]:
for ent in doc.ents:
    print(ent.text, ent.label_)

CHAPTER I. ORG
Phileas Fogg PERSON
1872 DATE
Saville Row PERSON
Burlington Gardens LOC
Sheridan PERSON
1814 DATE
the Reform Club ORG
Byron ORG
Byronic ORG
Byron ORG
a thousand years DATE
Bank ORG
London GPE
the Inns of Court ORG
Temple GPE
Lincoln ORG
Gray’s Inn ORG
the Court of Chancery ORG
Exchequer ORG
the Ecclesiastical Courts ORG
the Royal Institution ORG
the London Institution ORG
the Artisan’s Association ORG
the Institution of Arts and Sciences ORG
English NORP
Harmonic LOC
Entomologists NORP
Fogg PERSON
Reform ORG
Barings ORG
Fogg PERSON
daily DATE
thousand CARDINAL
second ORDINAL
London GPE
many years DATE
Fogg PERSON
Fogg PERSON
Saville Row GPE
hours TIME
Reform ORG
ten hours TIME
twenty-four CARDINAL
Saville Row GPE
mosaic PRODUCT
twenty CARDINAL
American NORP
Saville Row GPE
this very 2nd of October DATE
James Forster PERSON
eighty-four CARDINAL
Fahrenheit WORK_OF_ART
eighty-six DATE
between eleven and half-past DATE
the hours TIME
the minutes, the seconds TIME
the days DA

In [31]:
for ent in doc.ents:
    if ent.label_ == "PERSON":
        print(ent.text, ent.label_)

Phileas Fogg PERSON
Saville Row PERSON
Sheridan PERSON
Fogg PERSON
Fogg PERSON
Fogg PERSON
Fogg PERSON
James Forster PERSON
Fogg PERSON
Saville Row PERSON
Phileas Fogg PERSON
James Forster PERSON
Phileas Fogg PERSON
John PERSON
Jean PERSON
Fogg PERSON
Passepartout PERSON
Fogg PERSON
Phileas Fogg PERSON
James Forster PERSON


In [32]:
for ent in doc.ents:
    if ent.label_ == "GPE":
        print(ent.text, ent.label_)

London GPE
Temple GPE
London GPE
Saville Row GPE
Saville Row GPE
Saville Row GPE
Blondin GPE
Paris GPE
France GPE
England GPE
the United Kingdom GPE
Passepartout GPE
Saville Row GPE


In [33]:
for ent in doc.ents:
    if ent.label_ == "ORG":
        print(ent.text, ent.label_)

CHAPTER I. ORG
the Reform Club ORG
Byron ORG
Byronic ORG
Byron ORG
Bank ORG
the Inns of Court ORG
Lincoln ORG
Gray’s Inn ORG
the Court of Chancery ORG
Exchequer ORG
the Ecclesiastical Courts ORG
the Royal Institution ORG
the London Institution ORG
the Artisan’s Association ORG
the Institution of Arts and Sciences ORG
Reform ORG
Barings ORG
Reform ORG
gymnastics ORG


We have a winner!

# Exercise 2