# Chapter 1: Finding words, phrases, names and concepts

## 1. Introduction to spaCy

**The nlp object**

In [1]:
# Import the English language as class

from spacy.lang.en import English

# Create nlp obkect
nlp = English()

- contains the processing pipeline
- includes language-specific rules for tokenization etc.

**The doc object**

In [2]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


**The Token Object**

![](https://course.spacy.io/doc.png)

In [3]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


**The Span object**

![](https://course.spacy.io/doc_span.png)

In [4]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


**Lexical Attributes**

In [5]:
doc = nlp("It costs $5.")
print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])
print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


## 2. Getting Started

**Part 1: English**

- Import the English class from spacy.lang.en and create the nlp object.
- Create a doc and print its text.

In [6]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


## 3. Documents, spans and tokens

When you call `nlp` on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

**Step 1.**

- Import the English language class and create the nlp object.
- Process the text and instantiate a `Doc` object in the variable `doc`.
- Select the first token of the `Doc` and print its `text`.

In [7]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


**Step 2.**

- Import the English language class and create the `nlp` object.
- Process the text and instantiate a `Doc` object in the variable `doc`.
- Create a slice of the `Doc` for the tokens "tree kangaroos" and "tree kangaroos and narwhals".

In [8]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## 4. Lexical attributes

In this example, you’ll use spaCy’s `Doc` and `Token` objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.


- Use the `like_num` token attribute to check whether a token in the `doc` resembles a number.
- Get the token following the current token in the document. The index of the next token in the `doc` is `token.i + 1`.
- Check whether the next token’s `text` attribute is a percent sign `'%'`.


In [9]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## 5. Statistical Models & 6. Model Packages

- Enable spaCy to predict linguistic attributes in context
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

**Small Model** (`$ python -m spacy download en_core_web_sm`):

- Binary weights
- Vocabulary
- Meta information (language, pipeline)


In [10]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_lg')

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


**Predicting Syntactic Dependencies**

In [11]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


**Dependency label scheme**

![](https://course.spacy.io/dep_example.png)


|Label | Description | Example|
|------|-------------|------|
|nsubj | nominal subject| 	She|
|dobj  | direct object | 	pizza|
|det   | determiner (article)| 	the|

**Predicting Named Entities**

![](https://course.spacy.io/ner_example.png)

In [12]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


**Tip: the explain method**

Get quick definitions of the most common tags and labels.

In [13]:
spacy.explain('GPE')

'Countries, cities, states'

In [14]:
spacy.explain('NNP')

'noun, proper singular'

In [15]:
spacy.explain('dobj')

'direct object'

## 7. Loading models

The models we’re using in this course are already pre-installed. For more details on spaCy’s statistical models and how to install them on your machine, see the [documentation](https://spacy.io/usage/models).

- Use spacy.load to load the large English model `'en_core_web_lg'`.
- Process the text and print the document text.


In [16]:
# import spacy

# # Load the small English model – spaCy is already imported
# nlp = spacy.load("en_core_web_lg")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


## 8. Predicting linguistic annotations

You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call `spacy.explain` in the loop. For example: `spacy.explain('PROPN')` or `spacy.explain('GPE')`.

**Part 1**

- Process the text with the nlp object and create a doc.
- For each token, print the token text, the token’s `.pos_` (part-of-speech tag) and the token’s `.dep_` (dependency label).

In [17]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      ROOT      
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ccomp     
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [18]:
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True, options={'distance':140})

In [19]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

displacy.render(doc, style="ent")

**Part 2**

- Process the text and create a `doc` object.
- Iterate over the `doc.ents` and print the entity text and `label_` attribute.

In [20]:
# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


## 9. Predicing named entities in context

In [21]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


## 10. Rule-based matching

### Why not just regular expressions?

- Match on Doc objects, not just strings
- Match on tokens and token attributes
- Use the model's predictions
- Example: "duck" (verb) vs. "duck" (noun)


### Match patterns

- Lists of dictionaries, one per token

- Match exact token texts

`[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]`

- Match lexical attributes

`[{'LOWER': 'iphone'}, {'LOWER': 'x'}]`

- Match any token attributes

`[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]`

In [22]:
# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_lg')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


### Matching lexical attributes

In [23]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

matcher.add('FIFA_PATTERN', None, pattern)

# Process some text
doc = nlp("2018 FIFA World Cup: France won!")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


### Matching other token attributes

In [24]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

matcher.add('DOG_PATTERN', None, pattern)

doc = nlp("I loved dogs but now I love cats more.")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


### Using operators and quantifiers (1)

In [25]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")

matcher.add('BUY_PATTERN', None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


### Using operators and quantifiers (2)

|Example | Description |
|---------|------------|
|{'OP': '!'} | Negation: match 0 times |
|{'OP': '?'} | Optional: match 0 or 1 times |
|{'OP': '+'} | Match 1 or more times |
|{'OP': '*'} | Match 0 or more times |

## 11. Using the Matcher

Let’s try spaCy’s rule-based `Matcher`. You’ll be using the example from the previous exercise and write a pattern that can match the phrase `"iPhone X"` in the text.

- Import the `Matcher` from `spacy.matcher`.
- Initialize it with the nlp object’s shared `vocab`.
- Create a pattern that matches the `'TEXT'` values of two tokens: `"iPhone"` and `"X"`.
- Use the `matcher.add` method to add the pattern to the matcher.
- Call the matcher on the `doc` and store the result in the variable `matches`.
- Iterate over the matches and get the matched span from the `start` to the `end` index.

In [26]:
import spacy

# Import the Matcher
from spacy.matcher import matcher

nlp = spacy.load("en_core_web_lg")
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


## 12. Writing match patterns

In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.

**Part 1:**

- Write **one** pattern that only matches mentions of the *full* iOS versions: "iOS 7", "iOS 10" and "iOS 11".

In [27]:
doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


**Part 2:**

- Write one pattern that only matches forms of "download" (tokens with the lemma "download"), followed by a token with the part-of-speech tag `'PROPN'` (proper noun).

In [28]:
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


**Part 3:**

- Write one pattern that matches adjectives (`'ADJ'`) followed by one or two `'NOUN'`s (one noun and one optional noun).

In [29]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


# Chapter 2: Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.

## 1. Data Structures (1): Vocab, Lexemes and StringStore

**Shared vocab and string store (1)**

- `Vocab`: stores data shared across multiple documents
- To save memory, spaCy encodes all strings to **hash values**
- Strings are only stored once in the `StringStore` via `nlp.vocab.strings`
- String store: **lookup table** in both directions

```python
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]
```
- Hashes can't be reversed – that's why we need to provide the shared vocab

```python
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]
```

In [30]:
coffee_hash = nlp.vocab.strings['coffee']
print('coffee_hash: ', coffee_hash)
coffee_string = nlp.vocab.strings[coffee_hash]
print('coffee_string: ', coffee_string)

coffee_hash:  3197928453018144401
coffee_string:  coffee


In [31]:
doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


**Lexemes: entries in the vocabulary**

- A `Lexeme` object is an entry in the vocabulary

In [32]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


- Contains the context-independent information about a word

    - Word text: lexeme.text and lexeme.orth (the hash)
    - Lexical attributes like lexeme.is_alpha
    - Not context-dependent part-of-speech tags, dependencies or entity labels