# 1. Named Entity Recognition (NER)

Named entity recognition (NER) task aims at identifying real-world entities, such as names of people, organizations, and locations within historical documents. The term of *named entity (NE)*, widely used in Information Extraction (IE) or other Natural Language Processing (NLP) applications, was born in the Message Understanding Conferences (MUC) which influenced the IE research between 1988 and 1996. Since 1999, the yearly conference on
Natural Language Learning (CoNLL) covers a large framework of topics about NLP, mostly through machine learning approaches.

### (a) What are named entities?

Named entities are generally proper nouns that refer to specific entities that can be a person, organization, location, date, etc. If we consider this sentence as an example: *Mount Everest is the tallest mountain above sea level*, NER should detect *Mount Everest* as a named entity of type location as it refers to a specific entity.

Some other examples of named entities are listed in the following table:


|  | Named Entity  |  
|-----|---|
|  ORGANIZATION   | United Nations Organization, UNICEF, Microsoft |
|  PERSON   | Novak Djokovic, Beyoncé, Scarlett Johansson |
|  LOCATION   |  Mount Everest, River Nile, Machu Picchu Archaeological Park  |
|  DATE   |  3rd April 1988, 7 June  |
|  TIME   | 8:45 A.M., one-thirty am |
|  GPE   |  France, Liechtenstein, Democratic Republic of Congo |
|  MONEY   |  7 million dollars, 73.01 INR |

What should be considered as a named entity (NE) in a text is quite open for discussion and depends on the kind of information one wants to extract. However, the set of named entity classes that is widely used contains the three fundamental entity types, person (PER), location (LOC), and organization (ORG), collectively referred to as the
enamex since the MUC-6 competition ([Grishman et al 1996](https://aclanthology.org/C96-1079.pdf)).

### (b) Why are named entities important? (case studies)

The detection of entities can be considered as a first step in the exploration of data collections. 



#### Classifying content for news providers

News publishers generate large amounts of content on a daily basis and managing them correctly is very important to get the most use of each article. NER can automatically scan an entire collection of articles and reveal which are the major people, organizations, and places discussed in them. Knowing these relevant information may help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery. This could also save a lot of time and boost the efficiency of teams.

#### Automating customer support

There are a number of ways to make the process of customer feedback handling smooth and NER could be one of them. For example, the customer support department of an electronic store should handle multiple branches worldwide, thus it needs to go through a number mentions in the customer feedback comments. NER could provide entities as locations and products, and these can be then used to categorize the complaint and assign it to the relevant department within the organization that should be handling this.


#### Exploring historical documents

Historical newspapers are considered more and more as an important source of historical knowledge. As the amount of digitized data accumulates, tools for harvesting the data are needed to gather information. Tools like NER can be extremely valuable to researchers, historians, or librarians for adding structure to the volumes of unstructured data and for improving access to the historical digitized collections.  For example, a simple keyword search can already provide a historian with a sense of whether a collection contains material relevant for their research, thus saving many hours of visiting archives and skimming through pages. NER task can be used to detect person names and locations, these entities having an equally significant presence in the news domain, in which people are often at the core of the events reported in articles. For exampl, the EU's digital platform for cultural heritage, [Europeana](http://www.europeana-newspapers.eu/named-entity-recognition-for-digitised-newspapers/), is using NER to make historical newspapers searchable.

#### Extracting valuable information from medical documents

Electronic health records are a valuable source of routinely collected health data that can be used for secondary purposes, including clinical and epidemiological research. They typically contain information on consultations, admissions, symptoms, clinical examinations, test results, diagnoses, treatments, and outcomes. NER can clinic letters or discharge summaries can ease the process of information extraction from free-text sources of prescription information, such as clinic letters or discharge summaries. In this case, the NER task can involve extracting different types of entities: drug, strength, duration, route, form, dosage, frequency, reason of administration, etc. NER can also recognize and match demographic factors that could provide analysts/doctors deeper insights.

Similar to MUC, another known competition initiated in 2004 by the Informatics for Integrating Biology and the Bedside ([i2b2](https://www.i2b2.org/)) was designed to encourage the development of NLP techniques for the extraction of medication-related information from narrative patient records, in order to accelerate the translation of clinical findings into novel diagnostics and prognostics.


#### Aiding risk assessment for financial institutions

Risk assessment is a crucial activity for financial institutions because it helps them to determine the amount of capital they should hold to assure their stability. Manual extraction of relevant information from text-based financial documents is expensive and time-consuming. NER can extract credit risk attributes from a large volume of *live* financial documents, numbering in the millions of documents for a large bank financial documents. In the financial domain, example named entity types are: lenger, borrower, amount, date, etc.

#### Easing the research process

An online journal or conference publication site could hold millions of research papers and scholarly articles. There can be hundreds of papers on a single topic with slight modifications. Organizing all this data in a well-structured manner can be complex. Segregating the papers on the basis of the relevant entities it holds can save the trouble of going through the plethora of information on the subject matter. For instance, if the articles have in their metadata different types of entities (for example, NER can detect fields of study as *Named Entity Recognition* and *Information Extraction*), one can quickly find the articles where the use of *named entity recognition in historical documents* is discussed. This, NER could enable students and researchers to find relevant material faster by summarizing papers and archive material and highlighting key terms, topics, and themes.

### (c) Named Entity Recognition with NLTK & spaCy

Now, we explore the task of Named Entity Recognition (NER) *tagging* of sentences. *Tagging* (or *labelling*) means the detection of entities in text and the correct assignment of an entity type of them. In NLP, an *entity* is a sequence of one or more words (*tokens*). Thus, the task is to tag each token in a given sentence with an appropriate tag such as Person, Location, etc.

For detecting entities with NLP programming techniques, we continue with the usage of two libraries: NLTK and spaCy. [NLTK](https://www.nltk.org/) is a widely used standard Natural Language Processing (NLP) and Computational Linguistics (CL) Python library with prebuilt functions and utilities for the ease of use and implementation. [spaCy](https://spacy.io/) is an open-source software Python library for advanced natural language processing that covers multiple NLP tasks (part-of-speech tagging, named entity recognition, etc).

In [40]:
phrase1 = "In 1979, after more than a century of vaccination campaigns around the planet, the World Health Organization certified that smallpox had been eradicated"
    
phrase2 = "Beginning in February 1965, there were 8 weeks of unbroken bombing by U.S. forces of targets in North Vietnam. Over the next three years, the Unites States dropped more bombs than were dropped over Asia and Europe during World War II."

In [41]:
phrase1

'In 1979, after more than a century of vaccination campaigns around the planet, the World Health Organization certified that smallpox had been eradicated'

In [42]:
phrase2

'Beginning in February 1965, there were 8 weeks of unbroken bombing by U.S. forces of targets in North Vietnam. Over the next three years, the Unites States dropped more bombs than were dropped over Asia and Europe during World War II.'

### i. NER with [NLTK](https://www.nltk.org/)



In [48]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag
from nltk import ne_chunk

sentences = sent_tokenize(phrase1)

for sentence in sentences:
    print("Sentence:", sentence)
    words = word_tokenize(sentence)
    pos_tags_sentence = pos_tag(words)
    ne_chunks_sentence = ne_chunk(pos_tags_sentence)
    print(ne_chunks_sentence)

Sentence: In 1979, after more than a century of vaccination campaigns around the planet, the World Health Organization certified that smallpox had been eradicated
(S
  In/IN
  1979/CD
  ,/,
  after/IN
  more/JJR
  than/IN
  a/DT
  century/NN
  of/IN
  vaccination/NN
  campaigns/NNS
  around/IN
  the/DT
  planet/NN
  ,/,
  the/DT
  (ORGANIZATION World/NNP)
  Health/NNP
  Organization/NNP
  certified/VBD
  that/IN
  smallpox/NN
  had/VBD
  been/VBN
  eradicated/VBN)


In [49]:
sentences = sent_tokenize(phrase2)

for sentence in sentences:
    print("Sentence:", sentence)
    words = word_tokenize(sentence)
    pos_tags_sentence = pos_tag(words)
    ne_chunks_sentence = ne_chunk(pos_tags_sentence)
    print(ne_chunks_sentence)

Sentence: Beginning in February 1965, there were 8 weeks of unbroken bombing by U.S. forces of targets in North Vietnam.
(S
  Beginning/VBG
  in/IN
  February/NNP
  1965/CD
  ,/,
  there/EX
  were/VBD
  8/CD
  weeks/NNS
  of/IN
  unbroken/JJ
  bombing/NN
  by/IN
  (GPE U.S./NNP)
  forces/NNS
  of/IN
  targets/NNS
  in/IN
  (GPE North/NNP Vietnam/NNP)
  ./.)
Sentence: Over the next three years, the Unites States dropped more bombs than were dropped over Asia and Europe during World War II.
(S
  Over/IN
  the/DT
  next/JJ
  three/CD
  years/NNS
  ,/,
  the/DT
  (GPE Unites/NNP States/NNPS)
  dropped/VBD
  more/JJR
  bombs/NNS
  than/IN
  were/VBD
  dropped/VBN
  over/IN
  (GPE Asia/NNP)
  and/CC
  (GPE Europe/NNP)
  during/IN
  World/NNP
  War/NNP
  II/NNP
  ./.)


### ii. NER with [spaCy](https://spacy.io/)


spaCy currently provides support for the following [languages](https://spacy.io/usage/models). For applying NER for an English text, we need to download a spaCy NER model for [English](https://spacy.io/models/en). We choose to download `en_core_web_sm` because it is the smallest one in regards to download size (12Mb).

In [None]:
!python -m spacy download en_core_web_sm

In [51]:
import spacy
from spacy import displacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

In [52]:
doc = nlp(phrase1)

for entity in doc.ents:
    print('Entity:', entity.text, '---', 'Entity Type (tag/label):', entity.label_)

Entity: 1979 --- Entity Type (tag/label): DATE
Entity: more than a century --- Entity Type (tag/label): DATE
Entity: the World Health Organization --- Entity Type (tag/label): ORG


In [53]:
doc = nlp(phrase2)

for entity in doc.ents:
    print('Entity:', entity.text, '---',  'Entity Type:', entity.label_)

Entity: February 1965 --- Entity Type: DATE
Entity: 8 weeks --- Entity Type: DATE
Entity: U.S. --- Entity Type: GPE
Entity: North Vietnam --- Entity Type: LOC
Entity: the next three years --- Entity Type: DATE
Entity: the Unites States --- Entity Type: GPE
Entity: Asia --- Entity Type: LOC
Entity: Europe --- Entity Type: LOC
Entity: World War II --- Entity Type: EVENT


In [54]:
for i, sent in enumerate(doc.sents):
    print("Sentence", i, ':', sent)

Sentence 0 : Beginning in February 1965, there were 8 weeks of unbroken bombing by U.S. forces of targets in North Vietnam.
Sentence 1 : Over the next three years, the Unites States dropped more bombs than were dropped over Asia and Europe during World War II.


The IOB format (short for **i**nside, **o**utside, **b**eginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named entity recognition) to describe the entity boundaries. In IOB, the I- prefix before a tag indicates that the tag is inside a chunk. An O tag indicates that a token belongs to no chunk. The B- prefix before a tag indicates that the tag is the beginning of a chunk that immediately follows another chunk without O tags between them.

|  | Named Entity  |  
|-----|---|
|  I   | An inner token of a multi-token entity |
|  O   |  A non-entity token  |
|  B   | The beginning token of a multi-token entity |

| U.S.  | dropped | more | bombs | than | over | Asia  | and | Europe | during | World | War | II
|---------|--------|---------|------|-------|------|---------|------|------|-------|-----|--------|--------|
|  I-GPE       |    O    |    O     |  O    |    O   |   O   |   I-LOC      |   O   |  I-LOC    |    O   |   I-EVENT  |   I-EVENT     |    I-EVENT    |


Another similar format which is widely used is IOB2 format, which is the same as the IOB format except that the B- tag is used in the beginning of every chunk (i.e. all chunks start with the B- tag). 

| U.S.  | dropped | more | bombs | than | over | Asia  | and | Europe | during | World | War | II
|---------|--------|---------|------|-------|------|---------|------|------|-------|-----|--------|--------|
|  B-GPE       |    O    |    O     |  O    |    O   |   O   |   B-LOC      |   O   |  B-LOC    |    O   |   B-EVENT  |   I-EVENT     |    I-EVENT    |


Other tagging scheme is BIOES/BILOU, where 'E' and 'L' denotes Last or Ending token is such a sequence and 'S' denotes Single element or 'U' Unit element. 

|  | Named Entity  |  
|-----|---|
|  B   | The first token of a multi-token entity |
|  I   | An inner token of a multi-token entity |
|  L/E   |  The last/ending token of a multi-token entity  |
|  O   |  A non-entity token  |
|  U/S   | A single token entity |

| U.S.  | dropped | more | bombs | than | over | Asia  | and | Europe | during | World | War | II
|---------|--------|---------|------|-------|------|---------|------|------|-------|-----|--------|--------|
|  S-GPE       |    O    |    O     |  O    |    O   |   O   |   S-LOC      |   O   |  S-LOC    |    O   |   B-EVENT  |   I-EVENT     |    E-EVENT    |



In [56]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [nltk.ne_chunk(sent) for sent in sentences]
    return sentences

### iii. How to build or train a NER model

### General principles (ML, training data, etc)

### (d) State-of-the-art examples (list of SotA papers)

The first end-to-end systems for sequence labeling tasks are based on pre-trained word and character embeddings encoded either by a bidirectional Long Short Term Memory (BiLSTM) network or a Convolutional Neural Network (CNN) ([Lample et al., 2016](https://arxiv.org/abs/1603.01360) and [Ma and Hovy 2016](https://arxiv.org/abs/1603.01354)), along with a Conditional Random Fields (CRF) decoder. One shortcoming of this type of model is that they were based on a single context-independent representation for each word. This problem has been further attenuated by methods based on language model pre-training that produced context-dependent word representations.
These recent large-scale language models methods such as ELMo ([Peters et al. 2017](https://arxiv.org/abs/1705.00108)) and BERT ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805)) further enhanced the performance of NER, yielding [state-of-the-art performances](http://nlpprogress.com/english/named_entity_recognition.html).

- bidirectional Long Short Term Memory (BiLSTM) network [Lample et al., 2016](https://arxiv.org/abs/1603.01360)
- Convolutional Neural Network (CNN) [Ma and Hovy 2016](https://arxiv.org/abs/1603.01354))
- ..

### (e) Entity linking

### (d) Use-case (mapping locations on a map)