# Named Entity Recognition (NER) With SpaCy

We will be performing NER on threads from the **Investing** subreddit, but first let's test SpaCy for named entity recognition (NER) using an example from */r/investing*.

In [1]:
import spacy
from spacy import displacy

In [11]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
You should consider upgrading via the '/home/natasha/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [5]:
doc = nlp(txt)

In [6]:
displacy.render(doc, style='ent')
# displacy.serve(doc, style='ent') if not running in a notebook

Immediately we're able to produce not perfect, but pretty good NER. We are using the [`en_core_web_sm`](https://spacy.io/models/en) model - `en` referring to English and `sm` small.

The model is accurately identifying ARK as an organization. It does also classify ETF (exchange traded fund) as an organization, which is not the case (an ETF is a grouping of securities on the markets), but it's easy to see why this is being classified as one. The other tag we can see is `WORK_OF_ART`, it isn't inherently clear what exactly this means, so we can get more information using `spacy.explain`:

In [7]:
spacy.explain('WORK_OF_ART')

'Titles of books, songs, etc.'

And we can see that this description fits well to the tagged item, which refers to an article (although not quite a book).

We have a visual output from our tagged text, but this won't be particularly useful programatically. What we need is a way to extract the relevant tags (the organizations) from our text. To do that we can use `doc.ents` which will return a list of all identified entities.

Each item in this entity list contains two attributes that we are interested in, `label_` and `text`:

In [8]:
for entity in doc.ents:
    print(f"{entity.label_}: {entity.text}")

ORG: ARK
ORG: Bear
ORG: ARK
ORG: ARK


We're almost there. Now, we need to filter out any entities that are not `ORG` entities, and append those remaining `ORG`s to an organization list:

In [9]:
# initialize our list
org_list = []

for entity in doc.ents:
    # if label_ is ORG, we append text, otherwise ignore
    if entity.label_ == 'ORG':
        org_list.append(entity.text)

org_list

['ARK', 'Bear', 'ARK', 'ARK']

In [10]:
# we don't need to see 'ARK' three times, so we use set() to remove duplicates, and then convert back to list
org_list = list(set(org_list))

org_list

['Bear', 'ARK']