What are trained pipelines?

Models that enable spaCy to predict linguistic attributes in context
Part-of-speech tags
Syntactic dependencies
Named entities


Trained on labeled example texts
Can be updated with more examples to fine-tune predictions

In [4]:
# Pipelone Packages
#!python -m spacy download en_core_web_sm
import spacy

The "en_core_web_sm" package is a small English pipeline that supports all core capabilities and is trained on web text.


The spacy.load method loads a pipeline package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, meta information about the pipeline and the configuration file used to train it. It tells spaCy which language class to use and how to configure the processing pipeline.



In [5]:
nlp = spacy.load("en_core_web_sm")

What’s not included in a pipeline package that you can load into spaCy?


-A config file describing how to create the pipeline.

-Binary weights to make statistical predictions.

-The labelled data that the pipeline was trained on. CORRECT That's correct! Trained pipelines allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.

-Strings of the pipeline's vocabulary and their hashes.

In [6]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


### Predicting linguistic annotations

To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

In [7]:


#Process the text with the nlp object and create a doc.
#For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    ADJ       acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [8]:

# Process the text and create a doc object.
# Iterate over the doc.ents and print the entity text and label_ attribute.
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


# Predicting named entities in context

Process the text with the nlp object.
Iterate over the entities and print the entity text and label.
Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [9]:
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


### Rule-based matching

Why not just regular expressions?

-Match on Doc objects, not just strings

-Match on tokens and token attributes

-Use a model's predictions

-Example: "duck" (verb) vs. "duck" (noun)

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use a model's predictions.

Match patterns



Lists of dictionaries, one per token

Match exact token texts

[{"TEXT": "iPhone"}, {"TEXT": "X"}]


Match lexical attributes

[{"LOWER": "iphone"}, {"LOWER": "x"}]


Match any token attributes

[{"LEMMA": "buy"}, {"POS": "NOUN"}]

In [10]:
# Import the Matcher
from spacy.matcher import Matcher

# Load a pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
## The matcher is initialized with the shared vocabulary, nlp.vocab.
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
## The matcher.add method lets you add a pattern. 
#    The first argument is a unique ID to identify which pattern was matched. 
#    The second argument is a list of patterns.
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.


In [12]:

# Call the matcher on the doc
doc = nlp("Upcoming iPhone X release date leaked")

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)
#iPhone X
#match_id: hash value of the pattern name
#start: start index of matched span
#end: end index of matched span

iPhone X


In [15]:
### Matching lexical attributes
doc = nlp("2018 FIFA World Cup: France won!")

pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
matcher.add("2018 FIFA", [pattern])

matches = matcher(doc)


for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


In [17]:
### Matching other token attributes
doc = nlp("I loved dogs but now I love cats more.")

pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]
matcher.add("love", [pattern])

matches = matcher(doc)


for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


In [18]:
### Using operators and quantifiers (1)
doc = nlp("I bought a smartphone. Now I'm buying apps.")

pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

matcher.add("buy", [pattern])

matches = matcher(doc)


for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


### Using operators and quantifiers (2)

"OP" can have one of four values:

An "!" negates the token, so it's matched 0 times. {"OP": "!"}

A "?" makes the token optional, and matches it 0 or 1 times. {"OP": "?"}

A "+" matches a token 1 or more times. {"OP": "+"}

And finally, an "*" matches 0 or more times. {"OP": "*"}

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.

In [20]:
### Let’s try spaCy’s rule-based Matcher. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.

#Import the Matcher from spacy.matcher.
#Initialize it with the nlp object’s shared vocab.
#Create a pattern that matches the "TEXT" values of two tokens: "iPhone" and "X".
#Use the matcher.add method to add the pattern to the matcher.
#Call the matcher on the doc and store the result in the variable matches.
#Iterate over the matches and get the matched span from the start to the end index.

#import spacy

# Import the Matcher
#from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


### Writing match patterns 

In [None]:
#Part 1
#Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [21]:
#import spacy
#from spacy.matcher import Matcher

#nlp = spacy.load("en_core_web_sm")
#matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [24]:
#import spacy
#from spacy.matcher import Matcher

#nlp = spacy.load("en_core_web_sm")
#matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [25]:
#import spacy
#from spacy.matcher import Matcher

#nlp = spacy.load("en_core_web_sm")
#matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
