<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Named-Entity-Recognition-(NER)" data-toc-modified-id="Named-Entity-Recognition-(NER)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Named Entity Recognition (NER)</a></span><ul class="toc-item"><li><span><a href="#Spacy" data-toc-modified-id="Spacy-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Spacy</a></span></li></ul></li><li><span><a href="#Reference" data-toc-modified-id="Reference-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

In [3]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn.model_selection import train_test_split

# change default style figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib,spacy

Ethen 2018-11-28 20:12:46 

CPython 3.6.4
IPython 6.4.0

numpy 1.14.1
pandas 0.23.0
sklearn 0.19.1
matplotlib 2.2.2
spacy 2.0.16


# Named Entity Recognition (NER)

Just to clarify the difference between **Part of Speech Tagging (POS)** and **Named Entity Recognition (NER)**.

**Part of Speech Tagging (POS)** aims to identify which grammatical group a word belongs to, so whether the word is a noun, adjective, verb, etc., based on the context. This means it looks for relationships within the sentence and gives each word in a sentence a corresponding tag.

**Named Entity Recognition (NER)** on the other hand, tries to find out whether a word is a named entity or not. The named entity could be something like a person, location, organization, etc. So we could argue that when is recognized as a named entity, the word might also be recognized as a noun by the POS tagger.

The difference with regards to implementation is: Each POS tag is attached to a single word, while NER tags can be attached to multiple words. So NER is involves not only detecting the type of Named Entity, but also the word boundaries. The tagging mechanism used for NER is called **IOB (I: Token is inside an entity, O: Token is outside an entity and B: Token is the beginning of an entity)**. For example, the following example shows a sentence with the corresponding POS tags and IOB tags.

- Original sentence: Albert Einstein was born in Ulm, Germany in 1879.
- POS tagged: Albert/NNP Einstein/NNP was/VBD born/VBN in/IN Ulm/NNP ,/, Germany/NNP in/IN 1879/CD ./.
- NER with IOB tags: Albert/B-PER Einstein/I-PER was/O born/O in/O Ulm/B-PLACE ,/O Germany/B-PLACE in/O 1879/B-DATE ./O

## Spacy

- https://spacy.io/usage/linguistic-features#section-named-entities

In [15]:
nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English at 0x10db25860>

In [18]:
# an example sentence
sentence = ('European authorities fined Google a record $5.1 billion on Wednesday '
            'for abusing its power in the mobile phone market '
            'and ordered the company to alter its practices.')

doc = nlp(sentence)

pprint([(ent.text, ent.label_) for ent in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


After passing our sentence through spacy, the standard way to access the entity annotation is via the `.ents` property then we can access the entity type using the `.label_` property. The following [link](https://spacy.io/api/annotation#section-named-entities) contains the detailed description for each entity label.

We can also access this information for every token in our sentence/corpus. In the next code chunk, we loop through each token and access its entity, POS tagging and IOB scheme.

In [20]:
pprint([(token.orth_, token.ent_iob_, token.ent_type_, token.pos_) for token in doc])

[('European', 'B', 'NORP', 'ADJ'),
 ('authorities', 'O', '', 'NOUN'),
 ('fined', 'O', '', 'VERB'),
 ('Google', 'B', 'ORG', 'PROPN'),
 ('a', 'O', '', 'DET'),
 ('record', 'O', '', 'NOUN'),
 ('$', 'B', 'MONEY', 'SYM'),
 ('5.1', 'I', 'MONEY', 'NUM'),
 ('billion', 'I', 'MONEY', 'NUM'),
 ('on', 'O', '', 'ADP'),
 ('Wednesday', 'B', 'DATE', 'PROPN'),
 ('for', 'O', '', 'ADP'),
 ('abusing', 'O', '', 'VERB'),
 ('its', 'O', '', 'ADJ'),
 ('power', 'O', '', 'NOUN'),
 ('in', 'O', '', 'ADP'),
 ('the', 'O', '', 'DET'),
 ('mobile', 'O', '', 'ADJ'),
 ('phone', 'O', '', 'NOUN'),
 ('market', 'O', '', 'NOUN'),
 ('and', 'O', '', 'CCONJ'),
 ('ordered', 'O', '', 'VERB'),
 ('the', 'O', '', 'DET'),
 ('company', 'O', '', 'NOUN'),
 ('to', 'O', '', 'PART'),
 ('alter', 'O', '', 'VERB'),
 ('its', 'O', '', 'ADJ'),
 ('practices', 'O', '', 'NOUN'),
 ('.', 'O', '', 'PUNCT')]


Another cool feature that spacy provides is being able to visualize the named entities in our sentence.

In [25]:
spacy.displacy.render(doc, style='ent', jupyter=True)

# Reference

- [Quora: What is the difference between POS Tag and Named Entity Recognition?](https://www.quora.com/What-is-the-difference-between-POS-Tag-and-Named-Entity-Recognition)
- [Blog: Named Entity Recognition with NLTK and SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)