<a href="https://colab.research.google.com/github/farrelrassya/nlpspacybeginners/blob/main/4.spaCy%E2%80%99s_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we will be learning about the various pipelines in spaCy. As we have seen, spaCy offers both heuristic (rules-based) and machine learning natural language processing solutions. These solutions are activated by pipes. In this notebook, you will learn about pipes and pipelines generally and the ones offered by spaCy specifically. In a later notebook, we will explore how you can create custom pipes and add them to a spaCy pipeline. Before we jump in, let’s import spaCy.

In [3]:
import spacy

![](http://spacy.pythonhumanities.com/_images/sample_pipeline.png)

In most cases, you will use an off-the-shelf spaCy model. In some cases, however, an off-the-shelf model will not fill your needs or will perform a specific task very slowly. A good example of this is sentence tokenization. Imagine if you had a document that was around 1 million sentences long. Even if you used the small English model, your model would take a long time to process those 1 million sentences and separate them. In this instance, you would want to make a blank English model and simply add the Sentencizer to it. The reason is because each pipe in a pipeline will be activated (unless specified) and that means that each pipe from Dependency Parser to named entity recognition will be performed on your data. This is a serious waste of computational resources and time. The small model may take hours to achieve this task. By creating a blank model and simply adding a Sentencizer to it, you can reduce this time to merely minutes.

In [4]:
nlp = spacy.blank("en")

Here, notice that we have used spacy.blank, rather than spacy.load. When we create a blank model, we simply pass the two letter combination for a language, in this case, en for English. Now, let’s use the add_pipe() command to add a new pipe to it. We will simply add a sentencizer.

In [5]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7a1b01bff240>

In [6]:
import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
nlp.max_length = 5278439

In [7]:
%%time
doc = nlp(soup)
print (len(list(doc.sents)))

94134
CPU times: user 14.9 s, sys: 239 ms, total: 15.1 s
Wall time: 15.5 s


The difference in time here is remarkable. Our text string was around 5.2 million characters. The blank model with just the Sentencizer completed its task in 7.54 seconds and found around 94k sentences. The small English model, the most efficient one offered by spaCy, did the same task in 46 minutes and 15 seconds and found around 112k sentences. The small English model, in other words, took approximately 380 times longer.

Often times you need to find sentences quickly, not necessarily accurately. In these instances, it makes sense to know tricks like the one above. This notebook concludes part one of this book.



In [6]:
nlp2 = spacy.load("en_core_web_sm")
nlp2.max_length = 5278439

In [1]:
"""
%%time
doc = nlp2(soup)
print (len(list(doc.sents)))

output:
112074
Wall time: 47min 15s
"""

'\n%%time\ndoc = nlp2(soup)\nprint (len(list(doc.sents)))\n\noutput:\n112074\nWall time: 47min 15s\n'

In [2]:
"""
nlp2.analyze_pipes()

output:
{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'attrs': {'token.lemma': {'assigns': ['lemmatizer'], 'requires': []},
  'doc.sents': {'assigns': ['parser'], 'requires': []},
  'token.is_sent_start': {'assigns': ['parser'], 'requires': []},
  'token.dep': {'assigns': ['parser'], 'requires': []},
  'token.tag': {'assigns': ['tagger'], 'requires': []},
  'doc.ents': {'assigns': ['ner'], 'requires': []},
  'token.ent_iob': {'assigns': ['ner'], 'requires': []},
  'token.head': {'assigns': ['parser'], 'requires': []},
  'doc.tensor': {'assigns': ['tok2vec'], 'requires': []},
  'token.ent_type': {'assigns': ['ner'], 'requires': []}}}
  """

"\nnlp2.analyze_pipes()\n\noutput:\n{'summary': {'tok2vec': {'assigns': ['doc.tensor'],\n   'requires': [],\n   'scores': [],\n   'retokenizes': False},\n  'tagger': {'assigns': ['token.tag'],\n   'requires': [],\n   'scores': ['tag_acc'],\n   'retokenizes': False},\n  'parser': {'assigns': ['token.dep',\n    'token.head',\n    'token.is_sent_start',\n    'doc.sents'],\n   'requires': [],\n   'scores': ['dep_uas',\n    'dep_las',\n    'dep_las_per_type',\n    'sents_p',\n    'sents_r',\n    'sents_f'],\n   'retokenizes': False},\n  'attribute_ruler': {'assigns': [],\n   'requires': [],\n   'scores': [],\n   'retokenizes': False},\n  'lemmatizer': {'assigns': ['token.lemma'],\n   'requires': [],\n   'scores': ['lemma_acc'],\n   'retokenizes': False},\n  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],\n   'requires': [],\n   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],\n   'retokenizes': False}},\n 'problems': {'tok2vec': [],\n  'tagger': [],\n  'parser'

Note the dictionary structure. This tells us not only what is inside the pipeline, but its order. Each key after “summary” is a pipe. The value is a dictionary. This dictionary tells us a few different things. All of these value dictionaries state: “assigns” which corresponds to a value of what that particular pipe assigns to the token and doc as it passes through the pipeline. In some cases, there will be a key of “scores” in the dictionary. This indicates how the machine learning model was evaluated. We will learn more about model evaluation in our machine learning section below.



This notebook concludes part one of this book. It has given you an umbrella overview of spaCy. Over the next few parts of this book, we will deep dive into specific areas and use spaCy to solve general and domain-specific problems from several different areas of industry. Join me as we learn to create custom models and do custom things to leverage the full potential of the spaCy library.