# Chapter 3 - Processing Pipelines

This chapter is dedicated to processing pipelines: a series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities.In this lesson, you'll learn about the pipeline components provided by spaCy, and what happens behind the scenes when you call nlp on a string of text.

You've already written this plenty of times by now: pass a string of text to the nlp object, and receive a Doc object. But what does the nlp object actually do? First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.
![pipeline](fig/pipeline.png)

spaCy ships with the following built-in pipeline components.

- **The part-of-speech tagger** sets the `token.tag` attribute.
- **The dependency parser** adds the `token.dep` and `token.head` attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.
- **The named entity recognizer** adds the detected entities to the `doc.ents` property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.
- The **text classifier** sets category labels that apply to the whole text, and adds them to the `doc.cats` property.

Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.
![pipeline components](fig/pipeline_components.png)

All models you can load into spaCy include several files and a meta JSON. The meta defines things like the language and pipeline. This tells spaCy which components to instantiate. The built-in components that make predictions also need binary data. The data is included in the model package and loaded into the component when you load the model.
![under the hood](fig/under_the_hood.png)

To see the names of the pipeline components present in the current nlp object, you can use the `nlp.pipe_names` attribute. For a list of component name and component function tuples, you can use the `nlp.pipeline` attribute. The component functions are the functions applied to the Doc to process it and set attributes – for example, part-of-speech tags or named entities.

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')

print(nlp.pipe_names)
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7ff5a9dbccf8>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7ff5a80f6228>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7ff5a80f6288>)]


## Custom Pipeline Components

Now that you know how spaCy's pipeline works, let's take a look at another very powerful feature: custom pipeline components. Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the Doc and add more data to it.

After the text is tokenized and a Doc object has been created, pipeline components are applied in order. spaCy supports a range of built-in components, but also lets you define your own. Custom components are executed automatically when you call the nlp object on a text. They're especially useful for adding your own custom metadata to documents and tokens. You can also use them to update built-in attributes, like the named entity spans. In other words they allow to:
1. Make a function execute automatically when you call `nlp`.
2. Add your own metadata to documents and tokens.
3. Update built-in attributes like `doc.ents`.

Fundamentally, a pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline. Components can be added to the pipeline using the `nlp.add_pipe` method. The method takes at least one argument: the component function.

To specify where to add the component in the pipeline, you can use the following keyword arguments: 
- Setting `last` to `True` will add the component last in the pipeline. This is the default behavior. 
- Setting `first` to `True` will add the component first in the pipeline, right after the tokenizer.
- The `before` and `after` arguments let you define the name of an existing component to add the new component before or after. For example, before equals "ner" will add it before the named entity recognizer.

The other component to add the new component before or after needs to exist, though – otherwise, spaCy will raise an error.

| Argument | Description | Example |
|----------|-------------|---------|
|`last`    |	If `True`, add last	 | `nlp.add_pipe(component, last=True)` |
|`first`   |	If `True`, add first | `nlp.add_pipe(component, first=True)` |
|`before`  |	Add before component | `nlp.add_pipe(component, before='ner')` |
|`after`   |	Add after component  | `nlp.add_pipe(component, after='tagger')` |

Here's an example of a simple pipeline component. We start off with the small English model. We then define the component – a function that takes a `Doc` object and returns it. Let's do something simple and print the length of the `Doc` that passes through the pipeline. Don't forget to return the `Doc` so it can be processed by the next component in the pipeline! The `Doc` created by the tokenizer is passed through all components, so it's important that they all return the modified `Doc`. We can now add the component to the pipeline. Let's add it to the very beginning right after the tokenizer by setting `first` equals `True`. When we print the pipeline component names, the custom component now shows up at the start. This means it will be applied first when we process a `Doc`.

In [6]:
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custom_component(doc):
    """Print the doc's length."""
    print('Doc length:', len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']


Now when we process a text using the nlp object, the custom component will be applied to the Doc and the length of the document will be printed.

In [7]:
# Process a text
doc = nlp("Hello world!")

Doc length: 3


In this exercise, you’ll be writing a custom component that uses the `PhraseMatcher` to find animal names in the document and adds the matched spans to the `doc.ents`. A `PhraseMatcher` with the animal patterns has already been created as the variable matcher.

Define the custom component and apply the matcher to the doc.
Create a Span for each match, assign the label ID for 'ANIMAL' and overwrite the doc.ents with the new spans.
Add the new component to the pipeline after the 'ner' component.
Process the text and print the entity text and entity label for the entities in `doc.ents`.

In [9]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span

nlp = spacy.load('en_core_web_sm')
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print(animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add('ANIMAL', None, *animal_patterns)

def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL') for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label) for ent in doc.ents])

[Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 6303828839600189595), ('Golden Retriever', 6303828839600189595)]


## Extension Attributes

In this lesson, we'll learn how to add custom attributes to the Doc, Token and Span objects to store custom data. Custom attributes let you add any meta data to Docs, Tokens and Spans. The data can be added once, or it can be computed dynamically. Custom attributes are available via the `._.` property. This makes it clear that they were added by the user, and not built into spaCy, like `token.text`. Attributes need to be registered on the global Doc, Token and Span classes you can import from `spacy.tokens`. You've already worked with those in the previous chapters. To register a custom attribute on the Doc, Token or Span, you can use the `set_extension` method. The first argument is the attribute name. Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

First we need to register the extensions:

In [11]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

and only after registering, they can be set.

In [14]:
doc = nlp('This is a sentence')
token = doc[0]
span = doc[0:2]

doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

print(doc._.title, token._.is_color, span._.has_color)

My document True False


There are three types of extensions: 
- attribute extensions
- property extensions
- method extensions

### Attribute extensions
Attribute extensions set a default value that can be overwritten. For example, a custom `is_color` attribute on the token that defaults to `False`. On individual tokens, its value can be changed by overwriting it – in this case, `True` for the token "blue".

In [18]:
# Set extension on the Token with default value
# Note that we must add the `force=True` option because this extension has been set
# in the previous cell
Token.set_extension('is_color', default=False, force=True)
doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

### Property extensions

Property extensions work like properties in Python: they can define a getter function and an optional setter. The getter function is only called when you retrieve the attribute. This lets you compute the value dynamically, and even take other custom attributes into account. Getter functions take one argument: the object, in this case, the token. In this example, the function returns whether the token text is in our list of colors. We can then provide the function via the getter keyword argument when we register the extension. The token "blue" now returns `True` for "is color".

In [20]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


If you want to set extension attributes on a `Span`, you almost always want to use a property extension with a getter. Otherwise, you'd have to update every possible span ever by hand to set all the values. In this example, the `get_has_color` function takes the span and returns whether the text of any of the tokens is in the list of colors. After we've processed the doc, we can check different slices of the doc and the custom `has_color` property returns whether the span contains a color token or not.

In [22]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color, force=True)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


### Method extensions

Method extensions make the extension attribute a callable method. You can then pass one or more arguments to it, and compute attribute values dynamically – for example, based on a certain argument or setting. In this example, the method function checks whether the doc contains a token with a given text. The first argument of the method is always the object itself – in this case, the Doc. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, `token_text`. Here, the custom `has_token` method returns `True` for the word "blue" and `False` for the word "cloud".

In [23]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud
