# Named Entity Recognition

## References

* [Named Entities and Named Entity Tagging](https://web.stanford.edu/~jurafsky/slp3/8.pdf), Chapter 8.3, Speech and Language Processing
* [Aho–Corasick Algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm)


## Contents

* [Named Entity Tags](#Named-Entity-Tags)
* [Token-level Recognition](#Token-level-Recognition)
* [BIO Notation](#BIO-Notation)
* [BILOU Notation](#BILOU-Notation)
* [NER with Gazetteer](#NER-with-Gazetteer)

## Named Entity Tags

A named entity is a sequence of tokens representing the name of any entity:

```
Jinho is a professor at Emory University in the United States
```

* "Jinho" &rarr; Person
* "Emory University" &rarr; Organization
* "the United States" &rarr; Location


The [OntoNotes](https://github.com/emorynlp/swne/blob/main/docs/ontonotes_named_entity_guidelines-v14.pdf) defines a fine-grained tagset that comprises 17 named entities:

| Tag | Description |
|:---|:---|
| `PER`     | Person |
| `NORP`    | Nationality, other, religion, political |
| `FAC`     | Facilities: man-made infrastructures |
| `ORG`     | Organizations: names of companies, educational institutions, sports teams, terrorist groups |
| `GPE`     | Geographical political entities: names of countries, cities, provinces |
| `LOC`     | Locations |
| `REL`     | Named religions or political leanings |
| `PRODUCT` | Product names or numbers |
| `DATE`    | References to dates or periods longer than 24 hours |
| `TIME`    | References to periods of time less than 24 hours |
| `MONEY`   | Monetary units but not units in rate expressions |
| `QUAN`    | Quantity: measurements with explicit standardized units |
| `CARD`    | Cardinal: numbers that do not fall under measurements, money, date, time |
| `EVENT`   | Named hurricanes, wars, sports events, battles |
| `WOA`     | Work of art: titles of books, songs, television programs, awards |
| `LAW`     | Documents that have been made into law |
| `LANG`    | Language |

## Token-level Recognition

It is possible to label each token with a named entity tag (`O`: not named entity):

```
Jinho         PER
is            O
a             O
professor     O
at            O
Emory         ORG
University    ORG
in            O
the           O
United        GPE
States        GPE
of            GPE
America       GPE
```

Once you label each token, same techniques used to classify POS tags can be adapted to predict the named entity tag of every token.

## BIO Notation

Let us consider the following example where named entities appear in consecutive order without punctuation, which happens quite often in colloquial writing (e.g., social media) or spoken language:


```
I        O
met      O
Sam      PER
John     PER
and      O
Sarah    PER
```

With this notation, it is not possible to distinguish whether `Sam` and `John` should be one or two entities.

We can decorate each named entity tag with the following prefixes:

* `B-`: the beginning token of an entity
* `I-`: an inside token of an entity

Given this notation, the above examples can be tagged as follows:

```
I        O
met      O
Sam      B-PER
John     B-PER
and      O
Sarah    B-PER
```

With the BIO-notation, it is possible to recognize `Sam` and `John` as two separate entities since they both are indicated as the beginning tokens.

The previous example can be relabeled as follows:

```
Jinho         B-PER
is            O
a             O
professor     O
at            O
Emory         B-ORG
University    I-ORG
in            O
the           O
United        B-GPE
States        I-GPE
of            I-GPE
America       I-GPE
```

### Examples

```
Emory       ORG
University  ORG
Whitehall   LOC | PER

Emory       B-ORG
University  I-ORG
Whitehall   B-LOC | B-PER

I           O
went        O
to          O
Emory       B-ORG
University  I-ORG
and         O
University  B-ORG
of          I-ORG
Georgia     I-ORG

I           O
went        O
to          O
Emory       ORG
University  ORG
University  ORG
of          ORG
Georgia     ORG

I           O
went        O
to          O
Emory       B-ORG
University  I-ORG
University  B-ORG
of          I-ORG
Georgia     I-ORG

I           O
went        O
to          O
Emory       B-ORG
University  I-ORG | B-ORG
of          I-ORG
Georgia     I-ORG
```

## BILOU Notation

[Ratinov and Roth, 2009](https://www.aclweb.org/anthology/W09-1119/) proposed to decorate the BIO notation further by adding the following two prefixes:

* `L`: the last token of an entity
* `U`: an unit entity consisting of only one token

Given the BILOU notation, the previous example can be relabeled as follows:

```
Jinho         U-PER
is            O
a             O
professor     O
at            O
Emory         B-ORG
University    L-ORG
in            O
the           O
United        B-GPE
States        I-GPE
of            I-GPE
America       L-GPE
```

### Examples

```
United        B-GPE
States        L-GPE
of            I-GPE
America       L-GPE
```

## NER with Gazetteer

A gazetteer is a dictionary of lexicons indicating entity groups.

### Exercise

Write the function `recognize_ngram()` that takes a sequence of tokens and a gazetteer and returns a list of entities where each entity is represented by a tuple consisting of the following 4 items:

* Index of the beginning token (inclusive)
* Index of the ending token (exclusive)
* Text span representing the entity (e.g., "Emory University")
* Set of named entity tags for the entity

In [8]:
from typing import Dict, List, Tuple, Set

def recognize_ngram(tokens: List[str], gazetteer: Dict[str, Set[str]]) -> List[Tuple[int, int, str, Set[str]]]:
    """
    :param tokens: a sequence of input tokens.
    :param gazetteer: a dictionary whose key is the text span of a named entity (e.g., "Emory University") and the value is the set of named entity tags for the entity.
    :return: a list of entities where each entity is represented by a tuple consisting of the following 4 items:
             - Index of the beginning token (inclusive)
             - Index of the ending token (exclusive)
             - Text span representing the entity (e.g., "Emory University")
             - Set of named entity tags for the entity
    """
    entities = []
    # To be filled
    return entities

In [12]:
GAZETTEER = {
    'Jinho': {'PER'},
    'Jinho Choi': {'PER'},
    'Emory': {'PER', 'ORG'},
    'Emory University': {'ORG'},
    'United States': {'GPE'},
    'United States of America': {'GPE'},
}

In [13]:
text = 'Jinho Choi is a professor at Emory University in the United States of America'
tokens = text.split()

entities = recognize_ngram(tokens, GAZETTEER)
for entity in entities: print(entity)

(0, 1, 'Jinho', {'PER'})
(0, 2, 'Jinho Choi', {'PER'})
(6, 7, 'Emory', {'PER', 'ORG'})
(6, 8, 'Emory University', {'ORG'})
(10, 12, 'United States', {'GPE'})
(10, 14, 'United States of America', {'GPE'})
