*Here, we will learn about various pipelines in spaCy. spaCy offers both heuristic(rule-based) and machine learning nlp solutions. These solutions are activated by pipes.*

*Here, we will learn about pipes and pipelines generally and later we will explore how we can create custom pipes and ass them to a spaCy pipeline.*

*Before jump, lets import spaCy*

In [2]:
# Import spacy
import spacy

#### Standard Pipes available from spaCy(Components and Factories)
*A pipeline is a sequence of pipes, or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes requires the output from earlier pipes.*

> *Sample SpaCy Pipeline for NER: Input_Sentence => Entity_Ruler => Entity_Linker => Output(Sentence with Entities Annotated)*

*In the above pipeline, for a sentence input, two pipes are activated on this-*
*> 1) Entity Ruler- A rule-based named entity recognizer which finds entities and*
*> 2) Entity Linker- identifies what entity is to perform toponym resolution.*

*The sentence is then outputted with the sentence and entities annotated.*

*At this point, we could use the "doc.ents" feature to find the entities in our sentence. To vectorize the input sentence a Tok2Vec input layer can be used. This will allow machine learning pipes to make predictions.*

*The complete list of the AttributeRuler pipes available in spaCy and the Matchers---*

##### Attributes Rulers
*> Dependency Parser*
*> EntityLinker*
*> EntityRecognizer*
*> EntityRuler*
*> Lemmatizer*
*> Morphology*
*> SentenceRecognizer*
*> Sentencizer*
*> SpanCategorizer*
*> Tagger*
*> TextCategorizer*
*> Tok2Vec*
*> Tokenizer*
*> TrainablePipe*
*> Transformer*

##### Matchers
*> DependencyMatcher*
*> Matcher*
*> PhraseMatcher*

#### How to Add Pipes
*Generally, we use an off-the-shelf spaCy model, however, sometimes this won't fill the needs or perform a specific task very slowly on big data.*

*For instance, suppose we have a document having 1 million sentences. Even if we use a small english model, our model would take hours to process those 1 million sentences. Because, each pipe in a pipeline will be activated (unless specified) and each pipe from Dependency Parser to NER will be performed on the data. This is a serious waste of computational resources and time.*

*By creating a blank model and simply adding a Senticizer to it can reduce this time to merely minutes. To demonstrate this process, lets first create a blank model...*

In [13]:
# Create a blank model
nlp_blank = spacy.blank("en")

*Here, we use spacy.blank rather than spacy.load. To create an empty model, we simply pass the two letter for a language. Here, "en" for English.*

*Now, Lets add a pipe to it. We can simply add a sentencizer to it.*

In [14]:
nlp_blank.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1c0ddc12140>

*Now, lets check using a small model and the blank model on a dataset.*

*First, we will apply the blank model(created above) and then a small model to compare the computation time.*

In [15]:
# Scrape data from web
import requests
from bs4 import BeautifulSoup
source = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(source.content).text.replace("-\n","").replace("\n"," ")
nlp_blank.max_length = 5278439

*Now apply the blank model created above...*

In [18]:
%%time
doc_blank = nlp_blank(soup)
print(len(list(doc_blank.sents)))

94133
Wall time: 18 s


*Now load a small model and apply the model to the dataset...*

In [19]:
# Import a small model
nlp_small = spacy.load("en_core_web_sm")
nlp_small.max_length = 5278439

In [21]:
%%time
doc_small = nlp_small(soup)
print(len(list(doc_small.sents)))

98262
Wall time: 1h 26min 2s


#### Examing a Pipeline
*In spaCy, we have a few different ways to study a pipeline. If we want to this in a script, we can do the following command.*

In [22]:
# Analyze the small model imported above
nlp_small.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

*Note the dictionary staructure- it tells us what is inside the pipeline and its order. Each key after "summary" is a pipe.*

*> "assigns:" corresponds a vlue of what that particular pipe assigns to the token and doc as it passes through the pipeline.*

*> In some cases, there wiil be a key "scores:" indicates how the machine learning model was evaluated.*

In [None]:
nlp_blank.analyze_pipes()