## [Chapter 3](https://course.spacy.io/chapter3)
This chapter will show you to everything you need to know about spaCy's processing pipeline. You'll learn what goes on under the hood when you process a text, how to write your own components and add them to the pipeline, and how to use custom attributes to add your own meta data to the documents, spans and tokens.

In [4]:
import spacy
import json

# $ python -m spacy download en_core_web_sm
# $ python -m spacy download en_core_web_md
# $ python -m spacy download en_core_web_lg

from IPython.display import Image
from IPython.core.display import HTML 

#### Processing Pipeline. What happens when you call nlp?
First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.

 - The part-of-speech tagger sets the token dot tag attribute.
  - The depdendency parser adds the token dot dep and token dot head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.
  - The named entity recognizer adds the detected entities to the doc dot ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.
   - Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc dot cats property.

Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

In [3]:
Image(url= "https://course.spacy.io/pipeline.png")

In [5]:
# Let’s inspect the small English model’s pipeline!

# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x000002CD4443FA90>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000002CD45A4AD68>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000002CD45A4ADC8>)]


#### Customising Pipelines

In [17]:
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print('Hey, this doc is length:{}'.format(len(doc)))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, last=True) # args: first, last, before, after

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

# Process a text
doc = nlp("Hello world!")

Pipeline: ['tagger', 'parser', 'ner', 'custom_component']
Hey, this doc is length:3


In this exercise, you’ll be writing a custom component that uses the PhraseMatcher to find animal names in the document and adds the matched spans to the doc.ents. A PhraseMatcher with the animal patterns has already been created as the variable matcher.

##### ! note that adding functions that edit elements will overwrite exisitng values. such as ENTS

In [27]:
nlp = spacy.load("en_core_web_lg")
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

In [28]:
# Print the pipeline component names
print('Orig Pipeline:', nlp.pipe_names)

## add matcher
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component in pipeline
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label="ANIMAL") for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after="ner")
print('\nNew Pipeline:', nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever. I live in New York city. London is a nice city of 4 million people. England has lots of rain.")

print([(ent.text, ent.label_) for ent in doc.ents])

Orig Pipeline: ['tagger', 'parser', 'ner']

New Pipeline: ['tagger', 'parser', 'ner']
[('New York', 'GPE'), ('London', 'GPE'), ('4 million', 'CARDINAL'), ('England', 'GPE')]


#### Extension Attributes
Custom attributes let you add any meta data to Docs, Tokens and Spans. The data can be added once, or it can be computed dynamically.

Custom attributes are available via the dot-underscore property. This makes it clear that they were added by the user, and not built into spaCy, like token dot text.

Attributes need to be registered on the global Doc, Token and Span classes you can import from spacy dot tokens. You've already worked with those in the previous chapters. To register a custom attribute on the Doc, Token or Span, you can use the set extension method.

The first argument is the attribute name. Keyword arguments let you define how the value should be computed. In this case, it has a default value and can be overwritten.

In [31]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension("is_country", default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [32]:
# Let’s try setting some more complex attributes using getters and method extensions.
from spacy.lang.en import English
from spacy.tokens import Doc

nlp = English()

# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension("has_number", getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


#### Entities & Extensions

In this exercise, you’ll combine custom extension attributes with the model’s predictions and create an attribute getter that returns a Wikipedia search URL if the span is a person, organization, or location.

In [41]:
import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url, force=True)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture. He lived in London."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.label_,ent.text, ent._.wikipedia_url)

DATE over fifty years None
ORDINAL first None
PERSON David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie
GPE London https://en.wikipedia.org/w/index.php?search=London


In [45]:
# Extension attributes are especially powerful if they’re combined with custom pipeline components. 
# In this exercise, you’ll write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.
# A phrase matcher with all countries is available as the variable matcher.
# A dictionary of countries mapped to their capital cities is available as the variable CAPITALS.

import json
from spacy.lang.en import English
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())

with open("exercises/capitals.json") as f:
    CAPITALS = json.loads(f.read())

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))


def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    matches = matcher(doc)
    doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matches]
    return doc


# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)

# Register the Span extension attribute 'capital' with the getter get_capital
Span.set_extension("capital", getter=get_capital)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


#### Scaling & Performance

If you need to process a lot of texts and create a lot of Doc objects in a row, the nlp dot pipe method can speed this up significantly.
It processes the texts as a stream and yields Doc objects.
It is much faster than just calling nlp on each text, because it batches up the texts.
nlp dot pipe is a generator that yields Doc objects, so in order to get a list of Docs, remember to call the list method around it.
BAD:

    docs = [nlp(text) for text in LOTS_OF_TEXTS]

GOOD:

    docs = list(nlp.pipe(LOTS_OF_TEXTS))

In [46]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("exercises/tweets.json") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

(McDonalds,) (@McDonalds,) (McDonalds,) (McDonalds, Spain) (The Arch Deluxe,) (WANT, McRib) (This morning,)


In [49]:
from spacy.lang.en import English

nlp = English()

people = ["David Bowie", "Angela Merkel", "Lady Gaga"]

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))
patterns

[David Bowie, Angela Merkel, Lady Gaga]

In [50]:
# In this exercise, you’ll be using custom attributes to add author and book meta information to quotes.
# A list of [text, context] examples is available as the variable DATA. The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'.

import json
from spacy.lang.en import English
from spacy.tokens import Doc

with open("exercises/bookquotes.json") as f:
    DATA = json.loads(f.read())

nlp = English()

# Register the Doc extension 'author' (default None)
Doc.set_extension("author", default=None)

# Register the Doc extension 'book' (default None)
Doc.set_extension("book", default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context["book"]
    doc._.author = context["author"]

    # Print the text and custom attribute data
    print(doc.text, "\n", "— '{}' by {}".format(doc._.book, doc._.author), "\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



In [55]:
# In this exercise, you’ll use the nlp.make_doc and nlp.disable_pipes methods to only run selected components when processing a text.

nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']
