# Chapter 3: Processing Pipelines

This chapter will show you to everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own meta data to the documents, spans and tokens.

## Processing pipelines

resources: [slides](slides/chapter3_01_processing-pipelines.md)

Welcome back! This chapter is dedicated to processing pipelines: a series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities.

In this lesson, you'll learn about the pipeline components provided by spaCy, and what happens behind the scenes when you call nlp on a string of text.

### What happens when you call nlp?

![pipeline](slides/static/pipeline.png)

You've already written this plenty of times by now: pass a string of text to the nlp object, and receive a Doc object.

But what does the nlp object actually do?

First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.

### Built-in pipeline components

spaCy ships with the following built-in pipeline components.

The part-of-speech tagger sets the token dot tag attribute.

The dependency parser adds the token dot dep and token dot head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.

The named entity recognizer adds the detected entities to the doc dot ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.

Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc dot cats property.

Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

|Name|Description|Creates|
|---|---|---|
|tagger|Part-of-speech tagger|Token.tag|
|parser|Dependency parser|Token.dep, Token.head, Doc.sents, Doc.noun_chunks|
|ner|Named entity recognizer|Doc.ents, Token.ent_iob, Token.ent_type|
|textcat|Text classifier|Doc.cats|

### Under the hood

![under_the_hood2](slides/static/package_meta.png)

All models you can load into spaCy include several files and a meta JSON.

The meta defines things like the language and pipeline. This tells spaCy which components to instantiate.

The built-in components that make predictions also need binary data. The data is included in the model package and loaded into the component when you load the model.

### Pipeline attributes

- `nlp.pipe_names`: list of pipeline component names
- `nlp.pipeline`: list of `(name, component)` tuples

In [1]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x11770d828>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x117a99888>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x117a998e8>)]


## Custom pipeline components

resources: [slides](slides/chapter3_02_custom-pipeline-components.md)

Custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the nlp object on a text – for example, to modify the Doc and add more data to it.

### Why custom components?

![pipeline](slides/static/pipeline.png)

- Make a function execute automatically when you call `nlp`
- Add your own metadata to documents and tokens
- Updating built-in attributes like `doc.ents`

### Anatomy of a component

- Function that takes a `doc`, modifies it and returns it
- Can be added using the `nlp.add_pipe method`

```python
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)
```

|Argument|Description|Example|
|---|---|---|
|last|if True, add last|nlp.add_pipe(component, last=True)|
|first|if True, add first|nlp.add_pipe(component, first=True)|
|before|Add before componenet|nlp.add_pipe(componenet, before='ner')|
|after|Add after component|nlp.add_pipe(componenet, after='trigger')|

### Example: a simple component

In [3]:
import spacy

# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print('Doc length:', len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

# Process a text
doc = nlp('Hello world!')

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
Doc length: 3


In [4]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL') for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


## Extension attributes

resources: [slides](slides/chapter3_03_extension-attributes.md)

In this lesson, you'll learn how to add custom attributes to the Doc, Token and Span objects to store custom data.

### Setting custom attributes

- Add custom metadata to documents, tokens and spans
    - The data can be added once, or it can be computed dynamically.
- Accessible via the `._` property
    - This makes it clear that they were added by the user, and not built into spaCy, like token dot text.

```python
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False
```

- Registered on the global `Doc`, `Token` or `Span` using the `set_extension` method
    - The first argument is the attribute name. Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

```python
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)
```

### Extension attribute types

- Attribute extensions
- Property extensions
- Method extensions

### Attribute extensions

- Set a default value that can be overwritten

In [5]:
from spacy.tokens import Token

# Set extension on the Token with default value
Token.set_extension('is_color', default=False)

doc = nlp('The sky is blue.')

# Overwrite extension attribute value
doc[3]._.is_color = True

### Property extensions

- Define a getter and an optional setter function
- Getter only called when you retrieve the attribute value

In [8]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extensions on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp('The sky is blue.')
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


If you want to set extension attributes on a Span, you almost always want to use a property extension with a getter. Otherwise, you'd have to update every possible span ever by hand to set all the values.

In this example, the "get has color" function takes the span and returns whether the text of any of the tokens is in the list of colors.

After we've processed the doc, we can check different slices of the doc and the custom "has color" property returns whether the span contains a color token or not.

In [9]:
from spacy.tokens import Token

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)

doc = nlp('The sky is blue.')
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


### Method extensions

Method extensions make the extension attribute a callable method.

You can then pass one or more arguments to it, and compute attribute values dynamically – for example, based on a certain argument or setting.

In this example, the method function checks whether the doc contains a token with a given text. The first argument of the method is always the object itself – in this case, the Doc. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, "token text".

Here, the custom "has token" method returns True for the word "blue" and False for the word "cloud".

In [10]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)

doc = nlp('The sky is blue.')
print(doc._.has_token('blue'), ' - blue')
print(doc._.has_token('cloud'), '- cloud')

True  - blue
False - cloud


In [11]:
import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())

with open("exercises/capitals.json") as f:
    CAPITALS = json.loads(f.read())

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))


def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label='GPE') for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital
Span.set_extension('capital', getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


## Scaling and performance

resources: [slides](slides/chapter3_04_scaling-performance.md)

In this lesson, I'll show you a few tips and tricks to make your spaCy pipelines run as fast as possible, and process large volumes of text efficiently.

### Processing large volumes of text

- Use `nlp.pipe` method
- Processes texts as a stream, yields `Doc` objects
- Much faster than calling `nlp` on each text

**BAD:**

```python
docs = [nlp(text) for text in LOTS_OF_TEXTS]
```

**GOOD:**

```python
docs = list(nlp.pipe(LOTS_OF_TEXTS))
```

### Passing in context

- Setting `as_tuples=True` on `nlp.pipe` lets you pass in `(text, context)` tuples
- Yields `(doc, context)` tuples
- Useful for associating metadata with the `doc`

In [13]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('Add another text', {'id': 2, 'page_number': 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
Add another text 16


You can even add the context meta data to custom attributes.

In this example, we're registering two extensions, "id" and "page number", which default to None.

After processing the text and passing through the context, we can overwrite the doc extensions with our context metadata.

In [17]:
from spacy.tokens import Doc

Doc.set_extension('id', default=None, force=True)
Doc.set_extension('page_number', default=None, force=True)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

### Using only the tokenizer

![pipeline](slides/static/pipeline.png)

Another common scenario: Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text.

Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.

If you only need a tokenized Doc object, you can use the nlp dot make doc method instead, which takes a text and returns a Doc.

This is also how spaCy does it behind the scenes: nlp dot make doc turns the text into a Doc before the pipeline components are called.

**BAD:**

```python
doc = nlp('Hello world')
```

**GOOD:**

```python
doc = nlp.make_doc('Hello world')
```

### Disabling pipeline components

- Use `nlp.disable_pipes` to temporarily disable one or more pipes

```python
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)
```

- In the `with` block, spaCy will only run the remaining components.
- After the `with` block, the disabled pipeline components are automatically restored.