<a href="https://colab.research.google.com/github/danie-bit/nlp-learnings/blob/main/5_spacy_lang_processing_pipeline/spacy_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align="center">Spacy Language Processing Pipelines Tutorial</h2>

<h3>Blank nlp pipeline</h3>

In [1]:
import spacy

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


We get above error because we have a blank pipeline as shown below. Pipeline is something that starts with a Tokenizer component in a dotted rectange below. You can see there is nothing there hence the blank pipeline

<img height=300 width=400 src="https://github.com/codebasics/nlp-tutorials/blob/main/5_spacy_lang_processing_pipeline/spacy_blank_pipeline.jpg?raw=1" />

In [2]:
nlp.pipe_names

[]

nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer

More general diagram for nlp pipeline may look something like below

<img height=300 width=400 src="https://github.com/codebasics/nlp-tutorials/blob/main/5_spacy_lang_processing_pipeline/spacy_loaded_pipeline.jpg?raw=1" />

<h3>Download trained pipeline</h3>

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart

In [4]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [5]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7d6f7b8ef530>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7d6f7c446f90>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7d6f7b9119a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7d6f7d0c1c10>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7d6f7c5fb190>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7d6f7b9118c0>)]

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [7]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

Captain  |  proper noun  |  Captain
america  |  proper noun  |  america
ate  |  verb  |  eat
100  |  numeral  |  100
$  |  numeral  |  $
of  |  adposition  |  of
samosa  |  proper noun  |  samosa
.  |  punctuation  |  .
Then  |  adverb  |  then
he  |  pronoun  |  he
said  |  verb  |  say
I  |  pronoun  |  I
can  |  auxiliary  |  can
do  |  verb  |  do
this  |  pronoun  |  this
all  |  determiner  |  all
day  |  noun  |  day
.  |  punctuation  |  .


**Run same code above with a blank pipeline and check what output you see?**

In [9]:
nlp_bl = spacy.blank("en")
doc = nlp_bl("Captain america ate 100$ of samosa. Then he said I can do this all day.")
for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

Captain  |  None  |  
america  |  None  |  
ate  |  None  |  
100  |  None  |  
$  |  None  |  
of  |  None  |  
samosa  |  None  |  
.  |  None  |  
Then  |  None  |  
he  |  None  |  
said  |  None  |  
I  |  None  |  
can  |  None  |  
do  |  None  |  
this  |  None  |  
all  |  None  |  
day  |  None  |  
.  |  None  |  




<h3>Named Entity Recognition</h3>

In [10]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:   # ents - entities
    print(ent.text,"|", ent.label_)

Tesla Inc | ORG
$45 billion | MONEY


In [12]:
from spacy import displacy

displacy.render(doc, style="ent" )

<h3>Trained processing pipeline in French</h3>

In [13]:
# nlp = spacy.load("fr_core_news_sm")

OSError: [E050] Can't find model 'fr_core_news_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

You need to install the processing pipeline for french language using this command,

python -m spacy download fr_core_news_sm

In [16]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m92.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [17]:
nlp = spacy.load("fr_core_news_sm")

In [19]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [20]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


<h3>Adding a component to a blank pipeline</h3>

In [21]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [23]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text,"|", ent.label_)

Tesla Inc | ORG
$45 billion | MONEY


In below image you can see sentencizer component in the pipeline



<img height=300 width=400 src="https://github.com/codebasics/nlp-tutorials/blob/main/5_spacy_lang_processing_pipeline/sentecizer.jpg?raw=1" />

<h3>Further reading</h3>

https://spacy.io/usage/processing-pipelines#pipelines