<a href="https://colab.research.google.com/github/erikapaceep/NLP/blob/main/NER_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Name entity recognition with spaCy

In [1]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [10]:
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [2]:
import spacy



We need to downloand and then load a model

In [None]:
!python -m spacy download en_core_web_sm

## Model name is structure in the following way:
the formato of the model name is the following:


```
[lang]_[type]_[genre]_[size]
```

[type] : means a specific pipeline that maybe support just vocabulary, the core type in this case is a general purpose pipeline. 

[size] : can be large, medium and small



In [8]:
nlp = spacy.load('en_core_web_sm')

In [12]:
doc = nlp(txt)

In [13]:
# this is called a doc object
type(doc)

spacy.tokens.doc.Doc

In [18]:
# another way to visualize that is to use displacy
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

# Extracting entities


In [16]:
spacy.explain('ORG') # This will print out what kind of organization that is 

'Companies, agencies, institutions, etc.'

In [17]:
spacy.explain('GPE') # geo-political entity

'Countries, cities, states'

In [19]:
doc.ents

(ARK,
 The Bear Cave](https://thebearcave.substack.com/p/special-edition,
 ARK,
 ARK)

In [20]:
type(doc.ents[0]) # this is a span object: which also contain multiple attributes ans methods

spacy.tokens.span.Span

In [None]:
help(doc.ents[0])

In [25]:
doc.ents[0].label_, doc.ents[0].text, type(doc.ents[0].label_)

('ORG', 'ARK', str)

In [26]:
for entity in doc.ents:
  print(f'{entity.label_}: {entity.text}')

ORG: ARK
WORK_OF_ART: The Bear Cave](https://thebearcave.substack.com/p/special-edition
ORG: ARK
ORG: ARK


First we are going to initalize a list where we are going to append our organization to. and extract all the entities that are linked to an organization

In [30]:
org_list = []
for entity in doc.ents:
  if entity.label_ == 'ORG':
    org_list.append(entity.text)
    
org_list

['ARK', 'ARK', 'ARK']

In [35]:
# another example to try the NER
txt = "Apple reached an all-time high stock price of 143 dollars this January"

doc2 = nlp(txt)
displacy.render(doc2, style='ent', jupyter=True)

org_lst = []

for ent in doc2.ents:
  if ent.label_ == "ORG":
    org_lst.append(ent)

org_lst

[Apple]