# NER-Tagging and Its Applications

## What is NER-tagging?

- Named Entity Recogitnio
- name entity is a real-wordl object with a proper name
    - ex: France, Donald Duck, and Twitter.
    - France is a country identified as **GPE** Geopolitical Entity
    - Donald Duck is a **PER**(  person)
    - Twitter is a company  **ORG**
    
![](https://i.imgur.com/fF1EFTJ.png)



![](https://i.imgur.com/PmqvReT.png)


> Rome is the capital of Italy

- NER-tagger regognize Rome as a plac(GPE as well as Italy
- it knows that Rome is a city in italy and not a person by **Named Entity Disambiguatoin(NED)**

- Perform POS tagging before NER tagging


---

Now that we have our features ready there are many ML models we can use to train our model

- **CRF** Conditional Random fields
    - popular choise for NER- taggin

## Ner tagging in python


B-{CHUNK_TYPE}- for the word in te begining chinck

![](https://i.imgur.com/OqfZ0uW.png)
Fig 6.3 spaCy's own BILOU system for its NER tags

---
Even though we will largely use spaCy, let's briefly discuss NLTK: NLTK uses these chunks as part of a tree-like system to do its tagging, though it also has a tagger which follows an IOB system. Here are some code snippets explaining how to use both, and how to convert between them:


In [1]:
from nltk.chunk import conlltags2tree, tree2conlltags 
from nltk import pos_tag 
from nltk import word_tokenize
from nltk.chunk import ne_chunk

Our imports, where these models are trained on the CoNLL (from the CoNLL conference) corpus in NLTK. Since we already did our tokenizing, POS-tagging and chunking, all we need to do for the tree-based tagging is to use the conlltags2tree method to see our tags.

In [3]:
sentence = "Clement and Mathieu are working at Apple." 
ne_tree = ne_chunk(pos_tag(word_tokenize(sentence))) 

iob_tagged = tree2conlltags(ne_tree)
iob_tagged

[('Clement', 'NN', 'B-GPE'),
 ('and', 'CC', 'O'),
 ('Mathieu', 'NNP', 'B-PERSON'),
 ('are', 'VBP', 'O'),
 ('working', 'VBG', 'O'),
 ('at', 'IN', 'O'),
 ('Apple', 'NNP', 'B-ORGANIZATION'),
 ('.', '.', 'O')]

Notice here how we first tokenized our sentence, then POS-tagged it, and chunked it before passing it to the tree-based tagger. Our output is each word tagged appropriately with both the part of speech and named entity class.

In [7]:
ne_tree = conlltags2tree(iob_tagged)

print(ne_tree)

(S
  (GPE Clement/NN)
  and/CC
  (PERSON Mathieu/NNP)
  are/VBP
  working/VBG
  at/IN
  (ORGANIZATION Apple/NNP)
  ./.)


## NER-tagging with spaCy

In [8]:
import spacy
nlp = spacy.load('en')

In [9]:
sent_0 = nlp(u'Donald Trump visited at the government headquarters in France today.') 
 
sent_1 = nlp(u'Emmanuel Jean-Michel Frédéric Macron is a French politician serving as President of France and ex officio Co-Prince of Andorra since 14 May 2017.') 
 
sent_2 = nlp(u"He studied philosophy at Paris Nanterre University, completed a Master's of Public Affairs at Sciences Po, and graduated from the École nationale d'administration (ÉNA) in 2004.") 
 
sent_3 = nlp(u'He worked at the Inspectorate General of Finances, and later became an investment banker at Rothschild & Cie Banque.')

In [12]:
# sent_0
for token in sent_0:
    print((token.text, token.ent_type_))

('Donald', 'PERSON')
('Trump', 'PERSON')
('visited', '')
('at', '')
('the', '')
('government', '')
('headquarters', '')
('in', '')
('France', 'GPE')
('today', 'DATE')
('.', '')


In [13]:
# sent_1
for token in sent_1:
    print((token.text, token.ent_type_))

('Emmanuel', 'PERSON')
('Jean', 'PERSON')
('-', 'PERSON')
('Michel', 'PERSON')
('Frédéric', 'PERSON')
('Macron', 'PERSON')
('is', '')
('a', '')
('French', 'NORP')
('politician', '')
('serving', '')
('as', '')
('President', '')
('of', '')
('France', 'GPE')
('and', '')
('ex', '')
('officio', '')
('Co', 'LOC')
('-', 'LOC')
('Prince', 'LOC')
('of', 'LOC')
('Andorra', 'LOC')
('since', '')
('14', 'DATE')
('May', 'DATE')
('2017', 'DATE')
('.', '')


In [14]:
# sent_2
for token in sent_2:
    print((token.text, token.ent_type_))

('He', '')
('studied', '')
('philosophy', '')
('at', '')
('Paris', 'ORG')
('Nanterre', 'ORG')
('University', 'ORG')
(',', '')
('completed', '')
('a', '')
('Master', 'WORK_OF_ART')
("'s", 'WORK_OF_ART')
('of', 'WORK_OF_ART')
('Public', 'WORK_OF_ART')
('Affairs', 'WORK_OF_ART')
('at', 'WORK_OF_ART')
('Sciences', 'WORK_OF_ART')
('Po', 'WORK_OF_ART')
(',', '')
('and', '')
('graduated', '')
('from', '')
('the', '')
('École', '')
('nationale', '')
("d'administration", 'PRODUCT')
('(', 'PRODUCT')
('ÉNA', 'ORG')
(')', '')
('in', '')
('2004', 'DATE')
('.', '')


In [15]:
# sent_3
for token in sent_3:
    print((token.text, token.ent_type_))

('He', '')
('worked', '')
('at', '')
('the', 'ORG')
('Inspectorate', 'ORG')
('General', 'ORG')
('of', 'ORG')
('Finances', 'ORG')
(',', '')
('and', '')
('later', '')
('became', '')
('an', '')
('investment', '')
('banker', '')
('at', '')
('Rothschild', 'ORG')
('&', 'ORG')
('Cie', 'ORG')
('Banque', 'ORG')
('.', '')


In [25]:
for ent in sent_3.ents:
    print((ent.text, ent.label_))

('the Inspectorate General of Finances', 'ORG')
('Rothschild & Cie Banque', 'ORG')


##  Training our own NER-taggers


In [33]:
#!/usr/bin/env python
# coding: utf8
"""Example of training spaCy's named entity recognizer, starting off with an
existing model or a blank model.

For more details, see the documentation:
* Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities

Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy


# training data
TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]


@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int))

def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe('ner')

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)

    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
        print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
            print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

In [36]:
if __name__ == '__main__':
    plac.call(main)

    # Expected output:
    # Entities [('Shaka Khan', 'PERSON')]
    # Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
    # ('Khan', 'PERSON', 1), ('?', '', 2)]
    # Entities [('London', 'LOC'), ('Berlin', 'LOC')]
    # Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
    # ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]

usage: ipykernel_launcher.py [-h] [-m None] [-o None] [-n 100]
ipykernel_launcher.py: error: unrecognized arguments: -f /home/frank/.local/share/jupyter/runtime/kernel-6463ef05-1e4b-4402-b66e-b116a21c1474.json


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
