# Advanced NLP with Spacy (tutorial)

This is a notebook (or series of notebooks) made to follow along with the Spacy tutorial course taught by Ines from Explosion AI and Spacy


## Chapter 1: Finding words, phrase, names and concepts
### Introduction to Spacy

In [2]:
# import the english language class
from spacy.lang.en import English

In [3]:
# create the NLP object; contains pipeline and rules for tokenization, etc.
nlp = English()

In [4]:
# the Doc object; behaves like a sequence
doc = nlp('Hello world!')
for token in doc:
    print(token.text)

Hello
world
!


In [5]:
# access a token via its index
token = doc[1]

In [6]:
# access the token text with the text attribute
print(token.text)

world


In [7]:
# the span object; subsets the doc, but doesn't contain data
span = doc[1:4]

In [8]:
print(span.text)

world!


In [9]:
# Lexical attributes
doc = nlp('It cost me 5 bucks!')

In [10]:
# lexical attributes refer only to vocabulary entry and
# do not take into consideration context
print('Index:    ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

print('is_alpha: ', [token.is_alpha for token in doc])
print('is_punct: ', [token.is_punct for token in doc])
print('like_num: ', [token.like_num for token in doc])

Index:     [0, 1, 2, 3, 4, 5]
Text:     ['It', 'cost', 'me', '5', 'bucks', '!']
is_alpha:  [True, True, True, False, True, False]
is_punct:  [False, False, False, False, False, True]
like_num:  [False, False, False, True, False, False]


### Spacy statistical models

In [11]:
# download small english model
#!python3 -m spacy download en

import spacy

# load model; contains binary weights, vocabulary
nlp = spacy.load('en')

In [12]:
# process text
doc = nlp("She ate a whole pizza all by herself.")

# print token text and part of speech
for token in doc:
    # attributes that return strings have trailing underscores
    # attribute without underscores are IDs
    print(token.text, token.pos_)

She PRON
ate VERB
a DET
whole ADJ
pizza NOUN
all ADV
by ADP
herself PRON
. PUNCT


In [13]:
# predict how words are related
for token in doc:
    # dep_ returns predicted dependency label
    # head returns the head (parent) token
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
a DET det pizza
whole ADJ amod pizza
pizza NOUN dobj ate
all ADV advmod by
by ADP prep ate
herself PRON pobj by
. PUNCT punct ate


In [14]:
# ents are real world objects assigned a name (named entities)
# doc.ents property accesses named entities, returns iterator of span objects
doc = nlp(u"Bob Barker drives his Porsche to the Masters on Mars for a million bucks.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Bob Barker PERSON
Porsche ORG
Mars LOC


In [15]:
# spacy.explain returns definitions of common tags and labels
spacy.explain('PERSON')

'People, including fictional'

In [16]:
spacy.explain('ORG')

'Companies, agencies, institutions, etc.'

In [17]:
spacy.explain('PRON')

'pronoun'

In [18]:
spacy.explain('ADP')

'adposition'

In [19]:
doc = nlp('Bob Barker drove his Tesla to Mars to meet Elon')
for ent in doc.ents:
    print(ent.text, ent.label)

Bob Barker 380
Mars 385
Elon 380


In [20]:
tesla = doc[4:5]
print('Missing entity:', tesla.text)

Missing entity: Tesla


### Rule-based matching
More powerful than regular expressions; can match on attributes, use predictions...

In [21]:
from spacy.matcher import Matcher

# initialize the Matcher with shared vocab
matcher = Matcher(nlp.vocab)

# match patterns are lists of dictionaries
# dictionaries describe tokens; keys are attributes that match to expected values
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# add the pattern
matcher.add('IPHONE_PATTERN', None, pattern)

doc = nlp('New iPhone X eats a bag of Cheetos')

# match the pattern; returns a tuple (match_id, start_index, end_index) for each match
matches = matcher(doc)

In [22]:
# create match span object
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


In [23]:
pattern = [
    {'IS_DIGIT': True}, 
    {'LOWER': 'fifa'}, 
    {'LOWER': 'world'},
    {'LOWER': 'cup'}, 
    {'IS_PUNCT': True}
]

matcher.add('FIFA_WC_PATTERN', None, pattern)
doc = nlp('2018 FIFA World Cup: France won!')
matches = matcher(doc)

In [24]:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


In [25]:
# first token has two characteristics (and is followed by a noun):
# 1) its lemma is 'love'
# 2) it's a verb
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'}, 
    {'POS': 'NOUN'}
]

matcher = Matcher(nlp.vocab)
matcher.add('LOVE_NOUN', None, pattern)
doc = nlp('I loved gerbils, but I love cockroaches more now.')
matches = matcher(doc)

In [26]:
for match_id, start, end in matches:
    matched_span = doc[start: end]
    print(matched_span)

loved gerbils
love cockroaches


In [27]:
pattern = [
    {'LEMMA': 'buy'}, 
    {'POS': 'DET', 'OP': '?'}, # 'OP' makes determiner token optional
    {'POS': 'NOUN'}
]

doc = nlp("Devo bought a house and filled it with leather. Now they're buying the world!")
matcher = Matcher(nlp.vocab)
matcher.add('BUYING_PATTERN', None, pattern)
matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span)

bought a house
buying the world
