<a href="https://colab.research.google.com/github/danie-bit/nlp-learnings/blob/main/8_NER/nlp_tutorial_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align='center'>NLP Tutorial: Named Entity Recognition (NER)</h2>

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [5]:
from spacy import displacy

displacy.render(doc, style="ent")   ## here it could not identify twitter bcz spacy follows some pattern(Inc) to recognise

In [13]:
doc = nlp("Tesla Inc is going to acquire TwitterInc for $45 billion")
displacy.render(doc, style="ent")

<h3>List down all the entities</h3>

In [22]:
lis = nlp.pipe_labels['ner']

In [24]:
for i in lis:
  print(i,'|',spacy.explain(i))

CARDINAL | Numerals that do not fall under another type
DATE | Absolute or relative dates or periods
EVENT | Named hurricanes, battles, wars, sports events, etc.
FAC | Buildings, airports, highways, bridges, etc.
GPE | Countries, cities, states
LANGUAGE | Any named language
LAW | Named documents made into laws.
LOC | Non-GPE locations, mountain ranges, bodies of water
MONEY | Monetary values, including unit
NORP | Nationalities or religious or political groups
ORDINAL | "first", "second", etc.
ORG | Companies, agencies, institutions, etc.
PERCENT | Percentage, including "%"
PERSON | People, including fictional
PRODUCT | Objects, vehicles, foods, etc. (not services)
QUANTITY | Measurements, as of weight or distance
TIME | Times smaller than a day
WORK_OF_ART | Titles of books, songs, etc.


List of entities are also documented on this page: https://spacy.io/models/en

In [25]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


Above it made a mistake in identifying Bloomberg the company. Let's try hugging face for this now.

https://huggingface.co/dslim/bert-base-NER?text=Michael+Bloomberg+founded+Bloomberg+in+1982

Here also go through 3 sample examples for NER

In [29]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char," | ", ent.start, "|", ent.end)

Tesla Inc  |  ORG  |  0 | 9  |  0 | 2
Twitter Inc  |  PERSON  |  30 | 41  |  6 | 8
$45 billion  |  MONEY  |  46 | 57  |  9 | 12


<h3>Setting custom entities</h3>

In [30]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PERSON
$45 billion  |  MONEY


In [31]:
s = doc[2:5]
s

going to acquire

In [32]:
type(s)

spacy.tokens.span.Span

In [36]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [39]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PERSON
$45 billion  |  MONEY


In [45]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents :
  print(ent.text,'|',ent.label_,'|',spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


In [46]:
from spacy.tokens import Span

s1 = Span(doc, 0,2,label ="PERSON")
s2 = Span(doc, 3,4,label ="ORG")

doc.set_ents([s1,s2],default = 'unmodified')

In [47]:
for ent in doc.ents:
  print(ent.text,'|',ent.label_,'|',spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | ORG | Companies, agencies, institutions, etc.
1982 | DATE | Absolute or relative dates or periods
