# Chapter 1: Finding words, phrases, names and concepts

This chapter will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with statistical models, and how to use them to predict linguistic features in your text.

## Introduction to spaCy

resources: [slides](slides/chapter1_01_introduction-to-spacy.md)

In this lesson, we'll take a look at the most important concepts of spaCy and how to get started.

### The nlp object

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

### The Doc object

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [5]:
# Created by processing a string of text with the nlp object
doc = nlp('Hello world!')

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


### The Token object

Token objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the Doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.

In [8]:
doc = nlp('Hello world!')

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


### The Span object

A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a Span, you can use Python's slice notation. For example, 1 colon 3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.

In [9]:
doc = nlp('Hello world!')

# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


### Lexical Attributes

Here you can see some of the available token attributes:

"i" is the index of the token within the parent document.

"text" returns the token text.

"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [10]:
doc = nlp('It costs $5.')

print('Index:    ', [token.i for token in doc])
print('Text:     ', [token.text for token in doc])
print('is_alpha: ', [token.is_alpha for token in doc])
print('is_punct: ', [token.is_punct for token in doc])
print('like_num: ', [token.like_num for token in doc])

Index:     [0, 1, 2, 3, 4]
Text:      ['It', 'costs', '$', '5', '.']
is_alpha:  [True, True, False, False, False]
is_punct:  [False, False, False, False, True]
like_num:  [False, False, False, True, False]


## Statistical models

resources: [slides](slides/chapter1_02_statistical-models.md)

### What are statistical models?

- Enable spaCy to predict linguistic attributes in context
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

### Model Packages

spaCy provides a number of pre-trained model packages you can download using the "spacy download" command. For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

The spacy dot load method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

`$ python -m spacy download en_core_web_sm`

```python
import spacy

nlp = spacy.load('en_core_web_sm')
```

### Predicting Part-of-speech Tags

Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English model and receive an nlp object.

Next, we're processing the text "She ate the pizza".

For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.

In [27]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp('She ate the pizza')

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


### Predicting Syntactic Dependencies

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [28]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


### Dependency label scheme

![dependency_label](slides/static/dep_example.png)

|Label|Description|Example|
|---|---|---|
|nsubj|nominal subject|She|
|dobj|direct object|pizza|
|det|determiner(article)|the|

### Predicting Named Entities

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc dot ents property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [29]:
# Process a text
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Tip: the explain method

A quick tip: To get definitions for the most common tags and labels, you can use the spacy dot explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy dot explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [32]:
print(spacy.explain('GPE'))
print(spacy.explain('NNP'))
print(spacy.explain('dobj'))

Countries, cities, states
noun, proper singular
direct object


## Rule-based matching

resources: [slides](slides/chapter1_03_rule-based-matching)

### Why not just regular expressions?

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.

### Match patterns

- Lists of dictionaries, one per token
- Match exact token texts

`[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]`

- Match lexical attributes

`[{'LOWER': 'iphone'}, {'LOWER': 'x'}]`

- Match any token attributes

`[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]`

### Using the Matcher

In [51]:
import spacy

# Impot the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp('New iPhone X release date leaked')

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
# match_id: hash value of the pattern name
# start: start index of matched span
# end: end index of matched span
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


### Matching lexical attributes

In [54]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
matcher.add('LEXICAL', None, pattern)

doc = nlp('2018 FIFA World Cup: France won!')

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


### Matching other token attributes

In [56]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

matcher.add('OTHER', None, pattern)

doc = nlp('I loved dogs but now I love cats more.')

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


### Using operators and quantifiers

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

|Example|Description|
|---|---|
|`{'OP': '!'}`|Negation: match 0 times|
|`{'OP': '?'}`|Optional: match 0 or 1 times|
|`{'OP': '+'}`|Match 1 or more times|
|`{'OP': '*'}`|Match 0 or more times|: 

In [58]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

matcher.add('OPE_QUANT', None, pattern)

doc = nlp('I bought a smartphone. Now I\'m buying apps.')

matches = matcher(doc)

for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps
