# spaCy Pipelines

A pipeline is a sequence of pipes (pipeline components), or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes require the output from earlier components, while in other cases, a pipe can exist entirely on its own. 

In [1]:
# import required libraries
import spacy

In [2]:
# load english small model to text processing from spaCy
nlp = spacy.load('en_core_web_sm')

In [3]:
# let us create a spaCy pipeline to basic processing on a sample text
doc = nlp('We are learning spaCy pipelines')

In [4]:
type(doc)

spacy.tokens.doc.Doc

When we call nlp on a text, spaCy first tokenizes the text to produce a Doc container. The Doc object is then processed in several different steps, known as the processing pipeline.

In [5]:
print([token.text for token in doc])

['We', 'are', 'learning', 'spaCy', 'pipelines']


In [6]:
print([(token.text,token.pos_) for token in doc])

[('We', 'PRON'), ('are', 'AUX'), ('learning', 'VERB'), ('spaCy', 'NUM'), ('pipelines', 'NOUN')]


Another example, for a named entity recognition pipeline, three pipes can be used: a Tokenizer pipe, which is the first processing step in spaCy pipelines; a rule-based named entity recognizer known as the EntityRuler, which finds entities; and an EntityLinker pipe that identifies the type of each entity. Through this processing pipeline, an input text is converted to a Doc container with its corresponding annotated entities. We can use the doc-dot-ents feature to find the entities in the input text.

In [9]:
doc = nlp('Albert Einstein was genius')

print([(ent.text,ent.label_) for ent in doc.ents])

[('Albert Einstein', 'PERSON')]


#### Adding Pipes

We often use an existing spaCy model. However, in some cases, an off-the-shelf model will not satisfy our requirements. Hence we need to add custom pipes in such cases


An example of this is the sentence segmentation for a long document with 10,000 sentences. Even if we use the smallest English model, the most efficient spaCy model, en_core_web_sm, the model can take a long time to process 10,000 sentences and separate them. The reason is that when calling an existing spaCy model on a text, the whole NLP pipeline will be activated and that means that each pipe from named entity recognition to dependency parsing will run on the text. This increases the use of computational time by 100 times.

In [12]:
import time

In [14]:
text = ' '.join(['This is a sample test sentence.']*10000)

start_time = time.time()

doc = nlp(text)

end_time = time.time()

print('Finished processing with en_core_web_sm model in {0} minutes'.format(round((end_time-start_time)/60.0, 5)))

Finished processing with en_core_web_sm model in 0.22787 minutes


In this instance, we would want to make a blank spaCy English model by using spacy.blank("en") and add the sentencizer component to the pipeline by using .add_pipe() method of the nlp model. 

By creating a blank model and simply adding a sentencizer pipe, we can considerably reduce computational time. The reason is that for this version of the spaCy model, only intended pipeline component (sentence segmentation) will run on the given documents.

In [16]:
# Let us see this

blank_nlp = spacy.blank('en')
blank_nlp.add_pipe('sentencizer')

start_time = time.time()

doc = blank_nlp(text)

end_time = time.time()

print('Finished processing with en_core_web_sm model in {0} minutes'.format(round((end_time-start_time)/60.0, 5)))

Finished processing with en_core_web_sm model in 0.00206 minutes


As we see the computational time is considerably reduced

#### Analyzing pipeline components

spaCy allows us to analyze a spaCy pipeline to check whether any required attributes are not set.

In [20]:
nlp.analyze_pipes(pretty=True)

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att