<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Statistical-models" data-toc-modified-id="Statistical-models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Statistical models</a></span><ul class="toc-item"><li><span><a href="#Part-of-speech" data-toc-modified-id="Part-of-speech-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Part-of-speech</a></span></li><li><span><a href="#Syntactic-Denpendency" data-toc-modified-id="Syntactic-Denpendency-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Syntactic Denpendency</a></span></li><li><span><a href="#Named-Entity" data-toc-modified-id="Named-Entity-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Named Entity</a></span></li><li><span><a href="#spacy.explain-method" data-toc-modified-id="spacy.explain-method-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>spacy.explain method</a></span></li></ul></li><li><span><a href="#Rule-base-matching" data-toc-modified-id="Rule-base-matching-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Rule-base matching</a></span><ul class="toc-item"><li><span><a href="#Using-operaters-and-quantifiers" data-toc-modified-id="Using-operaters-and-quantifiers-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Using operaters and quantifiers</a></span></li></ul></li><li><span><a href="#Data-Structures" data-toc-modified-id="Data-Structures-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Structures</a></span></li></ul></div>

# Introduction

In [1]:
# import the english language class
from spacy.lang.en import English 

# create the nlp object 
nlp = English()

- contains the processing pipeling
- includes language-specific rules for tokenization etc.

In [3]:
# created by precessing a string of text with the nlp object 
doc = nlp("Hello world!")

# Iterate over tokens in a Doc 
for token in doc:
    print(token.text)

Hello
world
!


Tokens represent words or punctuations in a document. We can index into the doc to get a token.

In [4]:
# index into the doc to get a single token
token = doc[1]

In [5]:
# get the token text via .text attribute
print(token.text)

world


A span object is a slice of a document consisting of one or more tokens. 

**It's only a view of the tokens and it doesn't contain any data itself.**

In [6]:
# A slice from the doc is a Span Object
span = doc[1:3]
# get the span text via .text attribute
print(span.text)

world!


The lexical attributes:

In [7]:
doc = nlp("It costs $5.")

In [9]:
print("Index: ", [token.i for token in doc])
print("Text: ", [token.text for token in doc])

print("is_alpha: ", [token.is_alpha for token in doc])
print("is_punct: ", [token.is_punct for token in doc])
print("like_num: ", [token.like_num for token in doc])

Index:  [0, 1, 2, 3, 4]
Text:  ['It', 'costs', '$', '5', '.']
is_alpha:  [True, True, False, False, False]
is_punct:  [False, False, False, False, True]
like_num:  [False, False, False, True, False]


In [10]:
# an example using 'like_num' to find percentages in a text

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("percentage found:", token.text)

percentage found: 60
percentage found: 4


# Statistical models

What are statistical models?

* Enable spaCY to predict liguistic attributes in context 

    * Part-of-speech tags
    * Syntactic dependencies 
    * Named entities
    
* Trained on labeled example texts
* Can be updated with more examples to fine-tune predictions



In [11]:
# pre-trained packages
# ! python -m spacy download en_core_web_sm

In [12]:
import spacy 

# load the small English model 
nlp = spacy.load("en_core_web_sm")

Another pre-trained model: `en_vectors_web_lg`

It has broader coverage by maintaining morphological information somewhat resulting in more distinct tokens as they are both trained on the common crawl corpus but have a different number of tokens.

* Binary weights
* Vocabulary 
* Meta information (language, pipeline)

## Part-of-speech

In [15]:
# Predicting part-of-speech tags

# load the small English model 
nlp = spacy.load("en_core_web_sm")

# Process a text 
doc = nlp("She ate the pizza.")

# Iterate over the tokens

for token in doc: 
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN
. PUNCT


In spaCY, attributes we turns into string usually ends with underline. Attributes without underline will turn into integer ID values.

In [16]:
for token in doc: 
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos)

She 95
ate 100
the 90
pizza 92
. 97


## Syntactic Denpendency

In [17]:
# Predicting syntactic dependencies

for token in doc: 
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate
. PUNCT punct ate


In [19]:
# Visualize the dependencies
from spacy import displacy 

displacy.render(doc, style="dep")

## Named Entity

In [29]:
# Predicting Named Entities

# Process a text 
doc = nlp("Apple is looking at buying U.K startup for $1 billion.")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its labels 
    print(ent.text, ent.label_)

Apple ORG
U.K ORG
$1 billion MONEY


In [30]:
displacy.render(doc,style="ent")

## spacy.explain method

Get quick definitions of the most common tags and labels.

In [32]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [34]:
spacy.explain("NNP")

'noun, proper singular'

In [35]:
spacy.explain("dobj")

'direct object'

# Rule-base matching

Why not just regular expressions?

* Match on `Doc` objects, not just strings
* Match on tokens and token attributes
* Use the model's predictions 
* Example: "duck" (verb) vs. "duck"(noun)

Match patterns are lists of dictionaries, each dictionary describes one token. The keys are the token attributes.

Some Examples:

* Match exact token texts: `[{"TEXT": "iPhone"}, {"TEXT": "X"}]`
* Match lexical attributes: `[{"LOWER": "iphone"}, {"LOWER": "x"}]`
* Match any token attributes: `[{"LEMMA": "buy"}, {"POS": "NOUN"}]`

In [37]:
# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object 
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab 
matcher = Matcher(nlp.vocab)

In [38]:
# Add the pattern to the macher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# second arg is an optional callback
matcher.add("IPHONE_PATTERN", None, pattern)

In [39]:
# Process some texts
doc = nlp("Upcoming iPhone X release data leaked")
matches = matcher(doc)

In [41]:
# Iterate over matches
for match_id, start, end in matches:
    # Get the matched span 
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


Each tuple of matches consist of three elements:

* match_id: hash value of the pattern name
* start: start index of matched span 
* end: end index of matched span

In [46]:
# Matching lexical attributes
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True},
]

In [52]:
doc = nlp("2018 FIFA World Cup: France won!")

In [53]:
matches = matcher(doc) 
matcher.add("wc_PATTERN", None, pattern)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


In [57]:
# Matching other token attributes

pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"},
]

doc = nlp("I loved dogs but now I love cats more.")

In [58]:
matches = matcher(doc)
matcher.add("lp_PATTERN", None, pattern)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


## Using operaters and quantifiers

We can thus define how often a token should be matched.

In [61]:
pattern = [
    {"LEMMA":"buy"},
    {"POS": "DET", "OP":"?"}, # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

In [60]:
spacy.explain("DET")

'determiner'

In [62]:
doc = nlp("I bought a smartphone. Now I'm buying apps.")

In [64]:
matches = matcher(doc)
matcher.add("op_PATTERN", None, pattern)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


Op can have four values:

`{"OP": "!"}` => Negation: match 0 times

`{"OP": "?"}` => Optional: match 0 or 1 times

`{"OP": "+"}` => Match 1 or more times

`{"OP": "*"}` => Match 0 or more times

# Data Structures

Vocab, Lexemes and StringStore

spaCY stores all kinds of data in `Vocab`. In order to save memory, spaCY encodes all strings to **hash values**.

Strings are only stored once in the `StringStore` via `nlp.vocab.strings`.

String store: **lookup table** in bothe directions

In [None]:
# https://course.spacy.io/en/chapter2