# Chapter 3: Processing Pipelines

This chapter will show you everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own metadata to the documents, spans and tokens.

**Sections**

1. Processing pipelines 
2. What happens when you call nlp? 
3. Inspecting the pipeline 
4. Custom pipeline components 
5. Use cases for custom components 
6. Simple components 
7. Complex components 
8. Extension attributes 
9. Setting extension attributes (Part 1) 
10. Setting extension attributes (Part 2) 
11. Entities and extensions 
12. Components with extensions 
13. Scaling and performance 
14. Processing streams 
15. Processing data with context 
16. Selective processing

## 1. Processing pipelines

* processing pipelines: a series of functions applied to a doc to add attributes like part-of-speech tags, dependency labels, or named entities
* this lesson: learn about the pipeline components provided by spaCy, and what happens behind the scenes when you call nlp on a string of text


### What happens when you call nlp?
* First, the tokenizer is applied to turn the string of text into a `Doc` object
* Next, a series of pipeline components is applied to the doc in order
    - In this case, the tagger, then the parser, then the entity recognizer
* Finally, the processed doc is returned, so you can work with it

Text --> nlp \[ tokenizer -> tagger -> parser -> ner -> ... \] --> Doc

### Built-in pipeline components

* tagger
    - description: part-of-speech
    - creates: `Token.tag`, `Token.pos`
* parser
    - description: Dependency parser
    - creates: `Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks`
* ner
    - description: named entity recognizer
    - creates: `Doc.ents`, `Token.ent_iob`, `Token.ent_type`
* textcat
    - description: text classifier
    - creates: `Doc.cats`


**Notes**
* spaCy ships with the following built-in pipeline components
* the part-of-speech tagger
    - sets the `token.tag` and `token.pos` attributes
* the dependency parser
    - adds the `token.dep` and `token.head` attributes
    - is also responsible for detecting sentences and base noun phrases also known as noun chunks
* the named entity recognizer
    - adds the detected entities to the `doc.ents` property
    - also sets entity type attributes on the tokens that indicate if a token is part of an entity or not
* the text classifier
    - sets category labels that apply to the whole text
    - adds them to the `doc.cats` property
    - since text categories are always very specific, the text classifier is not included in any of the pre-trained models by default, but you can use it to train your own system

### Under the hood

* pipeline defined in model's `meta.json` in order
* built-in components need binary data to make predictions

**Notes**
* all models you can load into spaCy include several files and a `meta.json`
* the meta defines things like the language and pipeline
    - this tells spaCy which components to instantiate
* the built-in components that make predictions also need binary data
    - the data included in the model package and loaded into the component when you load the model

### Pipeline attributes
* `nlp.pipe_names`: list of pipeline component names

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

print(nlp.pipe_names)

['tagger', 'parser', 'ner']


* `nlp.pipeline`: list of (name, component) tuples

In [3]:
for pipeline in nlp.pipeline:
    print(pipeline)

('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fa09e80bd50>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fa09e327de0>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fa09e327d70>)


**Notes**
* to see the names of the pipeline components present in the current nlp object, you can use the `nlp.pipe_names` attribute
* for a list of component name and component function tuples, you can use the `nlp.pipeline` attribute
* the component functions are the functions applied to the doc to process it and set attributes
    - EX: part-of-speech tags or named entities

## 2. What happens when you call nlp?

In [None]:
# What does spaCy do when you call `nlp` on a string of text?
# (did not run)
doc = nlp("This is a sentence.")

( ) Run the tagger, parser, and entity recognizer and then the tokenizer

(X) Tokenize the text and apply each pipeline component in order

( ) Connect to the spaCy server to compute the result and return it

( ) Initialize the language, add the pipeline and load in the binary model weights

**Correct!**: The tokenizer turns a string of text into a `Doc` object. spaCy then applied every component in the pipeline on the document, in order.

## 3. Inspecting the pipeline

In [4]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fa08521b690>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fa088500670>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fa0885008a0>)]


## 4. Custom pipeline components
* custom pipeline components let you add your own function to the spaCy pipeline that is executed when you call the `nlp` object on a text
    - EX: to modify the doc and add more data to it

### Why custom components?
* make a function execute automatically when you call `nlp`
* add your own metadata to documents and tokens
* updating built-in attributes like `doc.ents`

**Notes**
* after the text is tokenized and a `Doc` object has been created, pipeline components are applied in order
* spaCy supports a range of built-in components, but also lets you define your own
* custom components are executed automatically when you call the `nlp` object on a text
* they're especially useful for adding your own custom metadata to documents and tokens
* you can also use them to update built-in attributes like the named entity spans

### Anatomy of a component (1)
* function that takes a `doc`, modifies it and returns it
* can be added using the `nlp.add_pipe` method

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")

def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)

**Notes**
* fundamentally: a pipeline component is a function or callable that takes a doc, modifies it, and returns it, so it can be processed by the next component in the pipeline
* components can be added to the pipeline using the `nlp.add_pipe` method
* the method takes at least one argument: the component function

### Anatomy of a component (2)

* `last`
    - description: if `True`, add last
    - example: `nlp.add_pipe(component, last=True)
    - default option
* `first`
    - description: if `True`, add first
    - example: `nlp.add_pipe(component, first=True)`
    - right after tokenizer
* `before`
    - description: add before component
    - example: `nlp.add_pipe(component, before="ner")`
* `after`
    - description: add after component
    - example: `nlp.add_pipe(component, after="tagger")`

### Example: a single component (1)

In [3]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print("Pipeline:", nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']


**Notes**
* don't forget to return the doc so it can be processed by the next component in the pipeline
* the doc created by the tokenizer is passed through all components, so it's important that they all return the modified doc

### Example: a simple component (2)

In [4]:
# Create the nlp object
nlp = spacy.load("en_core_web_sm")

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print("Doc length:", len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Process a text
doc = nlp("Hello world!")

Doc length: 3


## 5. Use cases for custom components

Which of these problems can be solved by custom pipeline components? Choose all the apply!

1. Updating the pre-trained models and improving their predictions 
2. Computing your own values based on tokens and their attributes (X)
3. Adding named entities, for example based on a dictionary (X)
4. Implementing support for an additional language

**Explanation**: Custom components are great for adding custom values to documents, tokens and spans, and customizing the `doc.ents`.

## 6. Simple components

The example shows a custom component that prints the number of tokens in a document. Can you complete it?

In [5]:
import spacy

# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("I love this movie")

['length_component', 'tagger', 'parser', 'ner']
This document is 4 tokens long.


## 7. Complex components

In this exercise, you'll be writing a custom component that uses the `PhraseMatcher` to find animal names in the document and adds the matched spans to the `doc.ents`. A `PhraseMatcher` with the animal patterns has already been created as a variable `matcher`.

In [6]:
import spacy
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label "ANIMAL"
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the "ner" component
nlp.add_pipe(animal_component, after="ner")
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


## 8. Extension attributes

We'll learn how to add custom attributes to the `Doc`, `Token`, and `Span` objects to store custom data.

### Setting custom attributes
* Add custom metadata to documents, tokens, and spans
* Accessible via the `._` property

In [None]:
# (did not run)
doc._.title = "My document"
token._.is_color = True
span._.has_color = False

* Registered on the global `Doc`, `Token`, or `Span` using the `set_extension` method

In [None]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token, and Span
Doc.set_extension("title", default=None)
Token.set_extension("is_color", default=False)
Span.set_extension("has_color", default=False)

**Notes**
* custom attributes let you add any metadata to docs, tokens, and spans
    - the data can be added once, or it can be computed dynamically
    - available via the `._` (dot underscore) property; makes it clear that they were added by the user and not built into spaCy like `token.text`
    - need to be registered on the global `Doc`, `Token`, and `Span` classes you can import from `spacy.tokens`
* to register a custom attribute on `Doc`, `Token`, and `Span`, you can use the `set_extension` method



### Extension attribute types
1. Attribute extensions 
2. Property extensions 
3. Method extensions 

### Attribute extensions
* Set a default value that can be overwritten

In [7]:
from spacy.tokens import Token

# Set extension on the Token with default value
Token.set_extension("is_color", default=False)

doc = nlp("The sky is blue.")

# Overwritten extension attribute value
doc[3]._.is_color = True

**Notes**
* attribute extensions set a default value that can be overwritten
* EX: a custom `is_color` attribute on the token that defaults to `False`
* on individual tokens, its value can be changed by overwriting it
    - EX: True for the token "blue"

### Property extensions (1)
* define a getter and an optional setter function
* getter only called when you _retrieve_ the attribute value

In [12]:
from spacy.tokens import Token

# Define getter function
def get_is_color(token):
    colors = ["red", "yellow", "blue"]
    return token.text in colors

# Set extensions on the Token with getter
Token.set_extension("is_color", getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, "-", doc[3].text)

True - blue


**Notes**
* Property extensions work like properties in Python; they can define a getter function and an optional setter
    - the getter function is only called when you retrieve the attribute
    - this lets you compute the value dynamically & even take other custom attributes into account
* getter functions take one argument: the object i.e. the token
    - EX: the function returns whether the token text is in our list of colors
* we can then provide the function via the `getter` keyword argument when we register the extension
* the token "blue" now returns `True` for `._.is_color`

### Property extension (2)
* `Span` extensions should almost always use a getter

In [13]:
from spacy.tokens import Span

# Define getter function
def get_has_color(span):
    colors = ["red", "yellow", "blue"]
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension("has_color", getter=get_has_color)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, "-", doc[1:4].text)
print(doc[0:2]._.has_color, "-", doc[0:2].text)

True - sky is blue
False - The sky


**Notes**
* if you want to set extension attributes on a span, you almost always want to use a property extension with a getter
* otherwise, you'd have to update every possible span ever by hand to set all the values

### Method extensions
* Assign a **function** that becomes available as an object method
* Lets you pass **arguments** to the extension function

In [14]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension("has_token", method=has_token)

doc = nlp("The sky is blue.")
print(doc._.has_token("blue"), "- blue")
print(doc._.has_token("cloud"), "- cloud")

True - blue
False - cloud


**Notes**
* method extensions make the extension attribute a callable method
* you can then pass one or more arguments to it, and compute attribute values dynamically
    - EX: based on a certain argument or setting

## 9. Setting extension attributes (1)

In [15]:
# Step 1
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute "is_country" with the default value False
Token.set_extension("is_country", default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [16]:
# Step 2
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]

# Register the Token property extension "reversed" with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


## 10. Setting extension attributes (2)

In [17]:
# Part 1
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension "has_number" with the getter get_has_number
Doc.set_extension("has_number", getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


In [18]:
# Part 2
from spacy.lang.en import English
from spacy.tokens import Span

nlp = English()

# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return f"<{tag}>{span.text}</{tag}>"

# Register the Span method extension "to_html" with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name "strong"
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello world</strong>


## 11. Entities and extensions

In this exercise, you'll combine custom extension attributes with the model's predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

In [25]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using the getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

fifty years None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


## 12. Components with extensions

Extension attributes are especially powerful if they're combined with custom pipeline components. In this exercise, you'll write a pipeline component that finds country names and a custom extension attribute that returns a country's capital, if available.

In [None]:
# (did not run)
import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

with open("exercises/en/capitals.json", encoding="utf8") as f:
    CAPITALS = json.loads(f.read())

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))

def countries_component(doc):
    # Create an entity Span with the label "GPE" for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc

# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute "capital" with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

## 13. Scaling and performing


### Processing large volumes of text
* Use `nlp.pipe` method
* Processes texts as a stream, yields `Doc` objects
* Much faster than calling `nlp` on each text

In [None]:
# (did not use)
# BAD
docs = [nlp(text) for text in LOTS_OF_TEXTS]

# GOOD
docs = list(nlp.pipe(LOTS_OF_TEXT))

**Notes**
* if you need to process a lot of texts and create a lot of `Doc` objects in a row, the `nlp.pipe` method can speed this up significantly
* it processes the texts as a stream and yields `Doc` objects
* it's much faster than just calling nlp on each text because it baches up the texts
* `nlp.pipe` is a generator that yields `Doc` objects, so in order to get a list of docs, remember to call the `list` method around it

### Passing in context (1)
* Setting as `as_tuples=True` on `nlp.pipe` lets you pass in `(text, context)` tuples
* Yields `(doc, context)` tuples
* Useful for associating metadata with the `doc`

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

data = [
    ("This is a text", {"id": 1, "page_number": 15}),
    ("And another text", {"id": 2, "page_number": 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context["page_number"])

This is a text 15
And another text 16


**Notes**
* `nlp.pipe` also supports passing in tuples of text/context if you set `as_tuple` to `True`
* the method will then yield doc / context tuples
* this is useful for passing in additional metadata, like an ID associated with the text, or a page number

### Passing in context (2)

In [4]:
from spacy.tokens import Doc

Doc.set_extension("id", default=None)
Doc.set_extension("page_number", default=None)

data = [
    ("This is a text", {"id": 1, "page_number": 15}),
    ("And another text", {"id": 2, "page_number": 16})
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context["id"]
    doc._.page_number = context["page_number"]

**Notes**
* you can even add the context metadata to custom attributes
* EX: we're registering two extensions, `id` and `page_number`, which default to `None`
* after processing the text and passing through the context, we can overwrite the doc extensions with our context metadata

### Using only the tokenizer (1)

text --> nlp \[ tokenizer -> tagger -> parser -> ner -> ... \] --> Doc

* don't run the whole pipeline

**Notes**
* another common scenario: sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text
* Running the whole pipeline is unnecessarily slow because you'll be getting a bunch of predictions from the model that you don't need

### Using only the tokenizer (2)
* Use `nlp.make_doc` to turn a text into a `Doc` object

In [None]:
# (did not run)
# BAD
doc = nlp("Hello world")

# GOOD
doc = nlp.make_doc("Hello world")

**Notes**
* if you only need a tokenized `Doc` object you can use the `nlp.make_doc` method instead, which takes a text and returns a doc
* this is also how spaCy does it behind the scenes: `nlp.make_doc` turns the text into a doc before the pipeline components are called

### Disabling pipeline components
* Use `nlp.disable_pipes` to temporarily disable one or more pipes

In [None]:
# (did not run)
# Disable tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text and print the entities
    doc = nlp(text)
    print(doc.ents)

* Restores them after the `with` block
* Only runs the remaining components

**Notes**
* spaCy also allows you to temporarily disable pipeline components using the `nlp.disable_pipes` context manager
* it takes a variable number of arguments, the string names of the pipeline components to disable
    - EX: if you only want to use the entity recognizer to process document, you can temporarily disable the tagger and parser
* after the `with` block, the disabled pipeline components are automatically restored
* in the `with` block, spaCy will only run the remaining components

## 14. Processing streams

In [None]:
# Part 1
# (do not run)
# OLD CODE
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/en/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for text in TEXTS:
    doc = nlp(text)
    print([token.text for token in doc if token.pos_ == "ADJ"])

In [None]:
# (do not run)
# NEW CODE
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/en/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

In [None]:
# Part 2
# (do not run)
# OLD CODE
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/en/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the entities
docs = [nlp(text) for text in TEXTS]
entities = [doc.ents for doc in docs]
print(*entities)

In [None]:
# (do not run)
# NEW CODE
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/en/tweets.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

In [7]:
# Part 3
# OLD CODE
from spacy.lang.en import English

nlp = English()

people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

# Create a list of patterns for the PhraseMatcher
patterns = [nlp(person) for person in people]

In [8]:
# NEW CODE
from spacy.lang.en import English

nlp = English()

people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))

## 15. Processing data with context

In [None]:
# (do not run)
import json
from spacy.lang.en import English
from spacy.tokens import Doc

with open("exercises/en/bookquotes.json", encoding="utf8") as f:
    DATA = json.loads(f.read())

nlp = English()

# Register the Doc extension "author" (default None)
Doc.set_extension("author", default=None)

# Register the Doc extension "book" (default None)
Doc.set_extension("book", default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context["book"]
    doc._.author = context["author"]

    # Print the text and custom attribute data
    print(f"{doc.text}\n — '{doc._.book}' by {doc._.author}\n")

## 16. Selective processing

In [9]:
# Part 1
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [11]:
# Part 2
import spacy

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Disable the tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(American, College Park, Georgia)
