#### Advanced NLP with spaCy
#### [Chapter 1: Finding words, phrases, names and concepts](https://course.spacy.io/en/chapter1)

##### 1. Introduction to spaCy

Creating a blank processing pipeline, which by convention is named 'nlp':

In [26]:
import spacy
nlp = spacy.blank("en")

When you process a text, nlp will return a doc object. It can be used as an iterator where each iteration is a token:

In [27]:
doc = nlp("Hello world!")
for token in doc:
    print(token.text)

Hello
world
!


Tokens can be retrieved via index positions and slices:

In [28]:
token = doc[1]
print(token.text)

world


Accessing a slice of a doc returns a span obj:

In [29]:
span = doc[1:3]
print(span.text)
print(type(span))

world!
<class 'spacy.tokens.span.Span'>


Tokens have a lot of attributes that can be accessed. Here are a few of them:

In [30]:
doc = nlp("It costs $5.")
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])
print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


Note that numbers spelled out are still True for the like_num attribute:

In [31]:
doc = nlp("That is ten items")
ten = doc[2]
print(ten.is_alpha, ten.like_num)

True True


##### 2. Getting Started

English:

In [32]:
nlp = spacy.blank("en")
doc = nlp("This is a sentence.")
print(doc.text)

This is a sentence.


German:

In [33]:
nlp = spacy.blank("de")
doc = nlp("Liebe Grüße!")
print(doc.text)

Liebe Grüße!


Spanish:

In [34]:
nlp = spacy.blank("es")
doc = nlp("¿Cómo estás?")
print(doc.text)

¿Cómo estás?


##### 3. Documents, spans and tokens

In [35]:
nlp = spacy.blank("en")
doc = nlp("I like tree kangaroos and narwhals.")
first_token = doc[0]
print(first_token.text)

I


In [36]:
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

tree kangaroos


In [37]:
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos and narwhals


##### 4. Lexical attributes

In [38]:
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)
for token in doc:
    if token.like_num:
        next_token = doc[token.i + 1]
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


##### 5. Trained pipelines

Trained pipelines are used to make predictions on linguistic attributes like POS tagging, syntatic dependency, and named entities.  
They can be fine-tuned with labeled data.

In [39]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("She ate the pizza")
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


Entities:

In [40]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


Getting info on tags and labels:

In [41]:
print(spacy.explain("GPE"))
print(spacy.explain("NNP"))
print(spacy.explain("dobj"))

Countries, cities, states
noun, proper singular
direct object


##### 6. Pipeline packages
Training data is not included in the pipeline package, just the inference model/weights.

##### 7. Loading pipelines

In [42]:
nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
doc = nlp(text)
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


##### 8. Predicting linguistic annotations

In [43]:
for token in doc:
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [44]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


##### 9. Predicting named entities in context

In [45]:
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

iphone_x = doc[1:3]
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


##### 10. Rule-based matching
Works on token objects' attributes, for example matching "duck" the noun vs "duck" the verb.  
Match patterns need to be in the form of a list of dictionaries:

In [46]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])
doc = nlp("Upcoming iPhone X release date leaked")
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


In [47]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
doc = nlp("2018 FIFA World Cup: France won!")
matcher = Matcher(nlp.vocab)
matcher.add("WORLD_CUP", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


Patterns will match on multiple spans:

In [48]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]
doc = nlp("I loved dogs but now I love cats more.")
matcher = Matcher(nlp.vocab)
matcher.add("LOVE", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


You can include quantifiers via the OP key:

In [49]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # match 0 or 1 times
    {"POS": "NOUN"}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")
matcher = Matcher(nlp.vocab)
matcher.add("BUY", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


In [50]:
%%capture
{"OP": "!"}	# Negation: match 0 times
{"OP": "?"}	# Optional: match 0 or 1 times
{"OP": "+"}	# Match 1 or more times
{"OP": "*"}	# Match 0 or more times

##### 11. Using the Matcher

In [52]:
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_X_PATTERN", [pattern])
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


##### 12. Writing match patterns

In [55]:
matcher = Matcher(nlp.vocab)
doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [57]:
matcher = Matcher(nlp.vocab)
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [58]:
matcher = Matcher(nlp.vocab)
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
