<h2 align='center'>NLP Tutorial: Named Entity Recognition (NER)</h2>

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [4]:
from spacy import displacy

displacy.render(doc, style="ent")

<h3>List down all the entities</h3>

In [5]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

List of entities are also documented on this page: https://spacy.io/models/en

In [6]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


In [7]:
displacy.render(doc, style="ent")

Above it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER?text=Michael+Bloomberg+founded+Bloomberg+in+1982

Here also go through 3 sample examples for NER 

In [10]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Michael Bloomberg founded Bloomberg in 1982"

ner_results = nlp(example)
print(ner_results)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity': 'B-PER', 'score': np.float32(0.9997526), 'index': 1, 'word': 'Michael', 'start': 0, 'end': 7}, {'entity': 'I-PER', 'score': np.float32(0.9995964), 'index': 2, 'word': 'Bloomberg', 'start': 8, 'end': 17}, {'entity': 'B-ORG', 'score': np.float32(0.9982083), 'index': 4, 'word': 'Bloomberg', 'start': 26, 'end': 35}]


In [11]:
ner_results

[{'entity': 'B-PER',
  'score': np.float32(0.9997526),
  'index': 1,
  'word': 'Michael',
  'start': 0,
  'end': 7},
 {'entity': 'I-PER',
  'score': np.float32(0.9995964),
  'index': 2,
  'word': 'Bloomberg',
  'start': 8,
  'end': 17},
 {'entity': 'B-ORG',
  'score': np.float32(0.9982083),
  'index': 4,
  'word': 'Bloomberg',
  'start': 26,
  'end': 35}]

In [13]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  PERSON  |  30 | 41
$45 billion  |  MONEY  |  46 | 57


<h3>Setting custom entities</h3>

In [14]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PERSON
$45 billion  |  MONEY


In [15]:
s = doc[2:5]
s

going to acquire

In [16]:
type(s)

spacy.tokens.span.Span

In [17]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [18]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY
