### Intro to Named Entity Recognition (NER) using SpaCy

In [1]:
import spacy
from spacy import displacy

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m71.0 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:26[0m:17[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
nlp = spacy.load('en_core_web_sm')

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [7]:
#using SpaCy model for NER
doc = nlp(txt)

In [8]:
displacy.render(doc, style='ent')

#### Understanding entities

Immediately we're able to produce not perfect, but pretty good NER. We are using the en_core_web_sm model - en referring to English and sm small.

The model is accurately identifying ARK as an organization. It does also classify ETF (exchange traded fund) as an organization, which is not the case (an ETF is a grouping of securities on the markets), but it's easy to see why this is being classified as one. The other tag we can see is WORK_OF_ART, it isn't inherently clear what exactly this means, so we can get more information using spacy.explain:

In [10]:
spacy.explain('ORdG')

'Companies, agencies, institutions, etc.'

In [12]:
#GEO-POLITICAL Entities (GPE)
spacy.explain('GPE')

'Countries, cities, states'

We have a visual output from our tagged text, but this won't be particularly useful programatically. What we need is a way to extract the relevant tags (the organizations) from our text. To do that we can use doc.ents which will return a list of all identified entities.

Each item in this entity list contains two attributes that we are interested in, label_ and text:

In [14]:
for entity in doc.ents:
    print(f'{entity.label_}: {entity.text}')

GPE: ARK
ORG: The Bear Cave](https://thebearcave.substack.com/p
ORG: ARK
ORG: ARK
ORG: ETF


We're almost there. Now, we need to filter out any entities that are not ORG entities, and append those remaining ORGs to an organization list:

In [15]:
org_list = []

for entity in doc.ents:
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

In [16]:
org_list

['The Bear Cave](https://thebearcave.substack.com/p', 'ARK', 'ARK', 'ETF']

#### NER Assignment

In [17]:
txt = "Apple reached an all-time high stock price of 143 dollars this January"

In [18]:
test = nlp(txt)

In [19]:
displacy.render(test, style='ent')

In [21]:
org_list = []

for entity in test.ents:
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

In [22]:
org_list

['Apple']