#### Advanced NLP with spaCy
#### Chapter 3: Processing Pipelines
##### 1. Under the hood

Pipeline components built-in to the nlp object:
1. tagger: Token.tag, Token.pos
1. parser: Token.dep, Token.head, Doc.sents, Doc.noun_chunks
1. ner: Doc.ents, Token.ent_iob, Token.ent_type
1. textcat: Doc.cats (not included by default, but available)
1. (custom components)

The result is a returned Doc object.

All pipelines have a config.cfg to define them, including the pipeline components:

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [3]:
nlp = spacy.load("en_core_web_lg")
for component in nlp.pipeline:
    print(component)

('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000001D0054FEE70>)
('tagger', <spacy.pipeline.tagger.Tagger object at 0x000001D0054FDB50>)
('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x000001D0046BFA70>)
('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x000001D004988B50>)
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000001D004982310>)
('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x000001D0046BFAE0>)


##### 2. What happens when you *call* nlp?
It tokenizes first, then calls the rest of the components in the defined order.

##### 3. Inspecting the pipeline

In [4]:
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
print(nlp.pipeline)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x000001D00847B410>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x000001D0084793D0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x000001D00936C120>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x000001D0066E4B50>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000001D0067414D0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x000001D00541FD10>)]


##### 4. Custom pipeline components
Custom components are most often used to add custom metadata to documents and tokens, or to update entities off a custom knowledge-base.

A bare-bones "hello world" component to illustrate its structure:

In [5]:
from spacy.language import Language

nlp = spacy.load("en_core_web_sm") # Create new one so that running cell twice doesn't create name collision.

@Language.component("custom_component")
def custom_component_function(doc):    # Can be any callable.
    # You'd normally do something useful here.
    return doc

nlp.add_pipe("custom_component");

There are 4 arguments available to add_pipe:

##### 5. Use cases for custom components

Custom components are useful for:
- Computing your own values based on tokens and their attributes
- Adding named entities, for example based on a dictionary

##### 6. Simple components

In [6]:
@Language.component("length_component")
def length_component_function(doc):
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    return doc

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("length_component", first=True)
doc = nlp("Cats have distinct personalities.")

This document is 5 tokens long.


##### 7. Complex components

In [7]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", animal_patterns)

@Language.component("animal_component")
def animal_component_function(doc):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label="ANIMAL") for _, start, end in matches]
    doc.ents = spans
    return doc

nlp.add_pipe("animal_component")
print(nlp.pipe_names)

doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


##### 8. Extensions
Setting custom extensions:  
(_ indicates "user added" for some reason)

Attributes must be registered:

Attribute:

In [8]:
from spacy.tokens import Token

Token.set_extension("is_color", default=False, force=True) # Force is needed so we can run the cell more than once
doc = nlp("The sky is blue.")
doc[3]._.is_color = True

Property:

In [9]:
def get_is_color(token):
    colors = ["red", "yellow", "blue"]
    return token.text in colors

Token.set_extension("is_color", getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, "-", doc[3].text)

True - blue


In [10]:
from spacy.tokens import Span

Span.set_extension(
    "has_color", 
    getter=lambda span: any(token.text in ("red", "yellow", "blue") for token in span), 
    force=True)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, "-", doc[1:4].text)
print(doc[0:2]._.has_color, "-", doc[0:2].text)

True - sky is blue
False - The sky


Methods:  
(let's you pass arguments to the extension)

In [11]:
from spacy.tokens import Doc

def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return doc

Doc.set_extension("has_token", method=has_token, force=True)
doc = nlp("The sky is blue.")
print(doc._.has_token("blue"), "- blue")
print(doc._.has_token("cloud"), "- cloud")

The sky is blue. - blue
The sky is blue. - cloud


##### 9. and 10. Setting extension attributes

In [12]:
nlp = spacy.blank("en")

Token.set_extension("is_country", default=False, force=True)
doc = nlp("I live in Spain.")
doc[3]._.is_country = True
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [13]:
nlp = spacy.blank("en")

def get_reversed(token):
    return token.text[::-1]

Token.set_extension("reversed", getter=get_reversed, force=True)
doc = nlp("All generalizations are flase, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: esalf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


In [14]:
nlp = spacy.blank("en")

def get_has_number(doc):
    return any(token.like_num for token in doc)

Doc.set_extension("has_number", getter=get_has_number, force=True)

doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


In [15]:
nlp = spacy.blank("en")

def to_html(span, tag):
    return f"<{tag}>{span.text}</{tag}>"

Span.set_extension("to_html", method=to_html, force=True)

doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello world</strong>


##### 11. Entities and extensions

In [16]:
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_wikipedia_url(span):
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    print(ent.text, span._.wikipedia_url)

fifty years None
first None
David Bowie None


##### 12. Components with extensions

In [17]:
import json
from spacy.language import Language
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

with open("capitals.json", encoding="utf8") as f:
    CAPITALS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", list(nlp.pipe(COUNTRIES)))

@Language.component("countries_component")
def countries_component_function(doc):
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for _, start, end in matches]
    return doc

nlp.add_pipe("countries_component")
print(nlp.pipe_names)

get_capital = lambda span: CAPITALS.get(span.text)

Span.set_extension("capital", getter=get_capital, force=True)

doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


##### 13. Scaling and performance
When processing large volumes of text with multiple Doc objects, use the nlp.pipe generator, which will process texts as a stream, yielding Doc objects:

Not batching texts will be a lot slower:

The pipe generator also provides an options for passing tuples of texts and contexts:

In [18]:
data = [
    ("This is a text", {"id": 1, "page_number": 15}),
    ("And another text", {"id": 2, "page_number": 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context["page_number"])

This is a text 15
And another text 16


In [19]:
from spacy.tokens import Doc

Doc.set_extension("id", default=None)
Doc.set_extension("page_number", default=None)

data = [
    ("This is a text", {"id": 1, "page_number": 15}),
    ("And another text", {"id": 2, "page_number": 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context["id"]
    doc._.page_number = context["page_number"]

When you only need the tokenizer:

In [20]:
doc = nlp.make_doc("Hello world!")
print([token.text for token in doc])
print([token.pos_ for token in doc])

['Hello', 'world', '!']
['', '', '']


Disabling individual pipeline components:

In [27]:
nlp = spacy.load("en_core_web_sm")
with nlp.select_pipes(disable=["tagger", "parser"]):
    doc = nlp("Hello United States!")
    print(doc.ents)
    print(doc[0].pos_)

(United States,)



##### 14. Processing streams

In [31]:
nlp = spacy.load("en_core_web_sm")
TEXTS = [
    "McDonalds is my favorite restaurant.",
    "Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..",
    "People really still eat McDonalds :(",
    "The McDonalds in Spain has chicken wings. My heart is so happy ",
    "@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P",
    "please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D",
    "This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it"
]

for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']


In [33]:
docs = nlp.pipe(TEXTS)
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) () (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) () (This morning,)


In [39]:
nlp = spacy.blank("en")
people = ["David Bowie", "Angela Merkel", "Lady Gaga"]
patterns = list(nlp.pipe(people))
print(patterns)

[David Bowie, Angela Merkel, Lady Gaga]


##### 15. Processing data with context

In [40]:
DATA = [
    [
        "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.",
        { "author": "Franz Kafka", "book": "Metamorphosis" }
    ],
    [
        "I know not all that may be coming, but be it what it will, I'll go to it laughing.",
        { "author": "Herman Melville", "book": "Moby-Dick or, The Whale" }
    ],
    [
        "It was the best of times, it was the worst of times.",
        { "author": "Charles Dickens", "book": "A Tale of Two Cities" }
    ],
    [
        "The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.",
        { "author": "Jack Kerouac", "book": "On the Road" }
    ],
    [
        "It was a bright cold day in April, and the clocks were striking thirteen.",
        { "author": "George Orwell", "book": "1984" }
    ],
    [
        "Nowadays people know the price of everything and the value of nothing.",
        { "author": "Oscar Wilde", "book": "The Picture Of Dorian Gray" }
    ]
]

In [46]:
from spacy.tokens import Doc

nlp = spacy.blank("en")
Doc.set_extension("author", default=None, force=True)
Doc.set_extension("book", default=None, force=True)
for doc, context in nlp.pipe(DATA, as_tuples=True):
    doc._.book = context["book"]
    doc._.author = context["author"]
    print(f"{doc.text}\n — '{doc._.book}' by {doc._.author}\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
 — 'Metamorphosis' by Franz Kafka

I know not all that may be coming, but be it what it will, I'll go to it laughing.
 — 'Moby-Dick or, The Whale' by Herman Melville

It was the best of times, it was the worst of times.
 — 'A Tale of Two Cities' by Charles Dickens

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.
 — 'On the Road' by Jack Kerouac

It was a bright cold day in April, and the clocks were striking thirteen.
 — '1984' by George Orwell

Nowadays people know the price of everything and the value of nothing.
 — 'The Picture Of Dorian Gray' by Oscar Wilde



##### 16. Selective processing

In [47]:
nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


In [50]:
nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)
with nlp.select_pipes(disable=["tagger", "lemmatizer"]):
    doc = nlp(text)
    print(doc.ents)

(Chick, American, College Park, Georgia)
