spacy ner example from:   
https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/   
https://towardsdatascience.com/named-entity-recognition-ner-using-spacy-nlp-part-4-28da2ece57c6   

In [None]:
# !spacy download en

# example 1

In [None]:
import spacy
from spacy import displacy
NER = spacy.load("en_core_web_sm")

In [3]:
raw_text="The Indian Space Research Organisation or is the national space agency of India, \
    headquartered in Bengaluru. It operates under Department of Space which is directly \
    overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."

In [4]:
raw_text2 = "The Mars Orbiter Mission (MOM), informally known as Mangalyaan, \
    was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organisation (ISRO) \
    and has entered Mars orbit on 24 September 2014. India thus became the first country to enter \
    Mars orbit on its first attempt. It was completed at a record low cost of $74 million."

In [5]:
text = NER(raw_text2)

In [6]:
for word in text.ents:
    print(word.text,word.label_)

The Mars Orbiter Mission (MOM PRODUCT
Mangalyaan PERSON
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
$74 million MONEY


In [7]:
# list of ner labels
for ner_label in NER.get_pipe('ner').labels:
    print(f"* {ner_label}: {spacy.explain(ner_label)}")

* CARDINAL: Numerals that do not fall under another type
* DATE: Absolute or relative dates or periods
* EVENT: Named hurricanes, battles, wars, sports events, etc.
* FAC: Buildings, airports, highways, bridges, etc.
* GPE: Countries, cities, states
* LANGUAGE: Any named language
* LAW: Named documents made into laws.
* LOC: Non-GPE locations, mountain ranges, bodies of water
* MONEY: Monetary values, including unit
* NORP: Nationalities or religious or political groups
* ORDINAL: "first", "second", etc.
* ORG: Companies, agencies, institutions, etc.
* PERCENT: Percentage, including "%"
* PERSON: People, including fictional
* PRODUCT: Objects, vehicles, foods, etc. (not services)
* QUANTITY: Measurements, as of weight or distance
* TIME: Times smaller than a day
* WORK_OF_ART: Titles of books, songs, etc.


In [8]:
displacy.render(text,style="ent",jupyter=True)

# another example

In [49]:
# function to display basic entity info: 
def show_ents(doc):
    print(f"original doc: {doc.text}")
    if doc.ents: 
        for ent in doc.ents: 
            print(f"entity: {ent.text : >13} | start_char: {ent.start_char: 3} | end_char: {ent.end_char: 3} | label: {ent.label_} - {spacy.explain(ent.label_)}")
            # print(ent.label)  # entity type's hash value
            # print(ent.start)  # token span's start index position (word index)
            # print(ent.end)  # token span's stop index position (word index)
    else: print('No named entities found.')

In [50]:
doc1 = NER("Apple is looking at buying U.K. startup for $1 billion") 
show_ents(doc1)

original doc: Apple is looking at buying U.K. startup for $1 billion
entity:         Apple | start_char:   0 | end_char:   5 | label: ORG - Companies, agencies, institutions, etc.
entity:          U.K. | start_char:  27 | end_char:  31 | label: GPE - Countries, cities, states
entity:    $1 billion | start_char:  44 | end_char:  54 | label: MONEY - Monetary values, including unit


## document level

In [51]:
doc = NER("San Francisco considers banning sidewalk delivery robots") 
# document level 
for e in doc.ents: 
    print(e.text, e.start_char, e.end_char, e.label_) 

# OR 
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] 
print(ents)

San Francisco 0 13 GPE
[('San Francisco', 0, 13, 'GPE')]


## token level

In [52]:
# token level 
# doc[0], doc[1] ...will have tokens stored. 

ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_] 
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
ent_considers = [doc[2].text, doc[2].ent_iob_, doc[2].ent_type_] 
print(ent_san) 
print(ent_francisco)
print(ent_considers)

# token.ent_iob indicates whether an entity starts continues or ends on the tag
# I - Token is inside an entity. 
# O - Token is outside an entity. 
# B - Token is the beginning of an entity.

['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']
['considers', 'O', '']


## User-Defined Named Entity and Adding it to a Span

### Example 1

In [53]:
doc = NER(u'Tesla to build a U.K. factory for $6 million.')
show_ents(doc)

original doc: Tesla to build a U.K. factory for $6 million.
entity:          U.K. | start_char:  17 | end_char:  21 | label: GPE - Countries, cities, states
entity:    $6 million | start_char:  34 | end_char:  44 | label: MONEY - Monetary values, including unit


In [54]:
from spacy.tokens import Span

In [55]:
# get the hash value of the ORG entity label
ORG = doc.vocab.strings[u"ORG"]
print(ORG)

# create a span for the new entity
new_ent = Span(doc=doc, start=0, end=1, label=ORG)
print(new_ent)

# add the entity to the existing doc object
doc.ents = list(doc.ents) + [new_ent]
print(doc.ents)

383
Tesla
(Tesla, U.K., $6 million)


In [56]:
show_ents(doc)

original doc: Tesla to build a U.K. factory for $6 million.
entity:         Tesla | start_char:   0 | end_char:   5 | label: ORG - Companies, agencies, institutions, etc.
entity:          U.K. | start_char:  17 | end_char:  21 | label: GPE - Countries, cities, states
entity:    $6 million | start_char:  34 | end_char:  44 | label: MONEY - Monetary values, including unit


## Adding Named Entities to All Matching Spans

In [57]:
doc = NER(u'Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.') 
show_ents(doc) 

original doc: Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.
entity:         first | start_char:  99 | end_char:  104 | label: ORDINAL - "first", "second", etc.


In [58]:
# Import PhraseMatcher and create a matcher object: 
from spacy.matcher import PhraseMatcher 
matcher = PhraseMatcher(NER.vocab)

In [59]:
# Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [NER(text) for text in phrase_list]
print(phrase_list)
print(phrase_patterns)

['vacuum cleaner', 'vacuum-cleaner']
[vacuum cleaner, vacuum-cleaner]


In [60]:
# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

In [61]:
# Apply the matcher to our Doc object:
matches = matcher(doc)
#See what matches occur: 
matches 

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

In [62]:
# Here we create Spans from each match, and create named entities from them: 
from spacy.tokens import Span 
PROD = doc.vocab.strings[u'PRODUCT'] 
new_ents = [Span(doc, match[1], match[2],label=PROD) for match in matches]
new_ents

[vacuum cleaner, vacuum cleaner]

In [63]:
# match[1] contains the start index of the the token and match[2] the stop index (exclusive) of the token in the doc. 
doc.ents = list(doc.ents) + new_ents 
show_ents(doc)

original doc: Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.
entity: vacuum cleaner | start_char:  37 | end_char:  51 | label: PRODUCT - Objects, vehicles, foods, etc. (not services)
entity: vacuum cleaner | start_char:  72 | end_char:  86 | label: PRODUCT - Objects, vehicles, foods, etc. (not services)
entity:         first | start_char:  99 | end_char:  104 | label: ORDINAL - "first", "second", etc.


## counting entities

In [64]:
doc = NER(u"originally priced at $29.50, now it's marked down to five dollars")
show_ents(doc)

original doc: originally priced at $29.50, now it's marked down to five dollars
entity:         29.50 | start_char:  22 | end_char:  27 | label: MONEY - Monetary values, including unit
entity:  five dollars | start_char:  53 | end_char:  65 | label: MONEY - Monetary values, including unit


In [65]:
len([ent for ent in doc.ents if ent.label_ == "MONEY"])

2

## Visualizing NER

In [None]:
from spacy import displacy

In [72]:
doc = NER(u"Tesla to build a U.K. factory for $6 million. "
          u"originally priced at $29.50, now it's marked down to five dollars")
displacy.render(doc, style="ent", jupyter=True)


In [74]:
# line by line
for sent in doc.sents:
    displacy.render(sent, style="ent", jupyter=True)

In [79]:
# viewing specific entries
options = {"ents": ["MONEY"]}
displacy.render(doc, style="ent", jupyter=True, options = options)