**Named Entity Recognition (NER)** Using SpaCy

The idea is to extract the person, values, location and so on.



In [1]:
import spacy

In [6]:
!python -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [7]:
# we need to download the models from SpaCy
## model names: Lang_Type_Genre_Size 
#
nlp = spacy.load('en_core_web_md')

In [8]:
txt = ("Tesla (TSLA) cut prices on its full U.S. lineup for the third time in 2023, "
       "slashing up to $5,000 off of its EVs. Tom Narayan, RBC Capital Markets Lead Equity Analyst"
       " for Global Autos joined Yahoo Finance to discuss the cuts."
      "Narayan explained that Tesla’s lower costs makes it easier for the company to cut"
      "prices without sacrificing profits. “I do think that this strategy of cutting prices gonna lead"
      "to a higher sales, and fortunately for them, they do have some unique characteristics that make it so"
       "they don't have to sacrifice too much on profitability,” he told Yahoo Finance."
       )

In [9]:
nlp(txt)

Tesla (TSLA) cut prices on its full U.S. lineup for the third time in 2023, slashing up to $5,000 off of its EVs. Tom Narayan, RBC Capital Markets Lead Equity Analyst for Global Autos joined Yahoo Finance to discuss the cuts.Narayan explained that Tesla’s lower costs makes it easier for the company to cutprices without sacrificing profits. “I do think that this strategy of cutting prices gonna leadto a higher sales, and fortunately for them, they do have some unique characteristics that make it sothey don't have to sacrifice too much on profitability,” he told Yahoo Finance.

In [10]:
doc = nlp(txt)

In [11]:
type(doc)

spacy.tokens.doc.Doc

In [13]:
from spacy import displacy

In [17]:
displacy.render(doc, style = 'ent', jupyter = True)

In [18]:
spacy.explain('GPE') # geo political entities

'Countries, cities, states'

In [22]:
spacy.explain('ORG') # organization

'Companies, agencies, institutions, etc.'

In [23]:
doc

Tesla (TSLA) cut prices on its full U.S. lineup for the third time in 2023, slashing up to $5,000 off of its EVs. Tom Narayan, RBC Capital Markets Lead Equity Analyst for Global Autos joined Yahoo Finance to discuss the cuts.Narayan explained that Tesla’s lower costs makes it easier for the company to cutprices without sacrificing profits. “I do think that this strategy of cutting prices gonna leadto a higher sales, and fortunately for them, they do have some unique characteristics that make it sothey don't have to sacrifice too much on profitability,” he told Yahoo Finance.

In [24]:
help(doc)

Help on Doc object:

class Doc(builtins.object)
 |  Doc(Vocab vocab, words=None, spaces=None, user_data=None, *, tags=None, pos=None, morphs=None, lemmas=None, heads=None, deps=None, sent_starts=None, ents=None)
 |  A sequence of Token objects. Access sentences and named entities, export
 |      annotations to numpy arrays, losslessly serialize to compressed binary
 |      strings. The `Doc` object holds an array of `TokenC` structs. The
 |      Python-level `Token` and `Span` objects are views of this array, i.e.
 |      they don't own the data themselves.
 |  
 |      EXAMPLE:
 |          Construction 1
 |          >>> doc = nlp(u'Some text')
 |  
 |          Construction 2
 |          >>> from spacy.tokens import Doc
 |          >>> doc = Doc(nlp.vocab, words=["hello", "world", "!"], spaces=[True, False, False])
 |  
 |      DOCS: https://spacy.io/api/doc
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |      Doc.__bytes__(self)
 |  
 |  __getitem__(...)
 |      Get a `Token

In [25]:
doc.ents # get the list of entities

(Tesla,
 U.S.,
 third,
 2023,
 up to $5,000,
 Tom Narayan,
 RBC Capital Markets Lead Equity Analyst,
 Global Autos,
 Yahoo Finance,
 Narayan,
 Tesla,
 Yahoo Finance)

In [27]:
help(doc.ents[0])

Help on Span object:

class Span(builtins.object)
 |  A slice from a Doc object.
 |  
 |  DOCS: https://spacy.io/api/span
 |  
 |  Methods defined here:
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getitem__(...)
 |      Get a `Token` or a `Span` object
 |      
 |      i (int or tuple): The index of the token within the span, or slice of
 |          the span to get.
 |      RETURNS (Token or Span): The token at `span[i]`.
 |      
 |      DOCS: https://spacy.io/api/span#getitem
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __iter__(...)
 |      Iterate over `Token` objects.
 |      
 |      YIELDS (Token): A `Token` object.
 |      
 |      DOCS: https://spacy.io/api/span#iter
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(...)
 |      Get the number of tokens in the span.
 |      
 |      RETURN

In [28]:
doc.ents[0].label_

'ORG'

In [29]:
for ent in doc.ents:
  print(ent.label_)

ORG
GPE
ORDINAL
DATE
MONEY
PERSON
ORG
ORG
ORG
PERSON
ORG
ORG


In [30]:
for ent in doc.ents:
  print(f"{ent.label_} : {ent.text}" )

ORG : Tesla
GPE : U.S.
ORDINAL : third
DATE : 2023
MONEY : up to $5,000
PERSON : Tom Narayan
ORG : RBC Capital Markets Lead Equity Analyst
ORG : Global Autos
ORG : Yahoo Finance
PERSON : Narayan
ORG : Tesla
ORG : Yahoo Finance


In [36]:
org_list = []
for ent in doc.ents:
  if ent.label_ == 'ORG':
    org_list.append(ent.text)
print(org_list)

['Tesla', 'RBC Capital Markets Lead Equity Analyst', 'Global Autos', 'Yahoo Finance', 'Tesla', 'Yahoo Finance']


In [35]:
org_list = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
org_list

['Tesla',
 'RBC Capital Markets Lead Equity Analyst',
 'Global Autos',
 'Yahoo Finance',
 'Tesla',
 'Yahoo Finance']