<a href="https://colab.research.google.com/github/bharathkumar-kancharla/Natural-Language-Processing/blob/master/Advanced_Spacy_with_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This NoteBook based on the course [Advanced spaCy with NLP](https://course.spacy.io/)


Required files available [here](https://github.com/ines/spacy-course/tree/master/exercises)

In [0]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object (Contains the processing pipeline)
# It contains all the different components in the pipeline.
# It also includes language-specific rules used for tokenizing the text into words and punctuation.

nlp = English()

**The Doc object**

When we process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets us access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index.

In [0]:
# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


**The Token object**

Token objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the Doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.

![alt text](https://course.spacy.io/doc.png)

In [0]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


**The Span object**

A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a Span, you can use Python's slice notation. 

![Span Object](https://course.spacy.io/doc_span.png)

In [0]:
# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


**Lexical Attributes**

we can see some of the available token attributes:

"i" is the index of the token within the parent document.

"text" returns the token text.

"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphanumeric characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: **they refer to the entry in the vocabulary and don't depend on the token's context.**

In [0]:
doc = nlp("It costs $5.")

print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


**imports for other languages:**

German : from spacy.lang.de import German

Spanish : from spacy.lang.es import Spanish

In [0]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


# Statistical models


**What are statistical models?**
1. Enable spaCy to predict linguistic attributes in context
  
  *   Part-of-speech tags
  *   Syntactic dependencies
  *   Named entities

2. Models are trained on labeled example texts

3. Can be updated with more examples to fine-tune predictions

**Model Packages**

- spaCy provides a number of pre-trained model packages you can download using the "spacy download" command. 

  For example, the "*en_core_web_sm*" package is a small English model that supports all core capabilities and is <font color = 'blue'>trained on web text.</font>

- The spacy dot load method loads a model package by name and returns an nlp object.

- The package provides the binary weights that enable spaCy to make predictions.

- It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

In [0]:
# Download the English model package
!python -m spacy download en_core_web_sm

import spacy
# Load the package
nlp = spacy.load('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


**Predicting Part-of-speech Tags**

- Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

- First, we load the small English model and receive an nlp object.

- Next, we're processing the text "She ate the pizza".

- For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

- In spaCy, attributes that return strings usually end with an underscore – attributes <font color='blue'>without the underscore return an ID.</font>

- Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.

In [0]:
# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


In [0]:
token = doc[1]
token.text,token.pos

('ate', 100)

**Predicting Syntactic Dependencies**

- In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

- The "dep underscore" attribute returns the predicted dependency label.

- The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [0]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


**Dependency label scheme**

![Dependency label](https://course.spacy.io/dep_example.png)


- To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:

- The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

- The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

- The determiner "the", also known as an article, is attached to the noun "pizza".

**Predicting Named Entities**

![Named Entities](https://course.spacy.io/ner_example.png)

- Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

- The doc dot ents property lets you access the named entities predicted by the model.

- It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

- In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

In [0]:
# Process a text
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


**TIP: the explain method**

- Get quick definitions of the most common tags and labels

- "GPE" for geopolitical entity isn't exactly intuitive – but spacy dot explain can tell you that it refers to - countries, cities and states.

- The same works for part-of-speech tags and dependency labels.

In [0]:
spacy.explain('GPE')

'Countries, cities, states'

In [0]:
spacy.explain('NNP')

'noun, proper singular'

In [0]:
spacy.explain('dobj')

'direct object'

**Note:** Statistical models allow you to generalize based on a set of training examples. Once they’re trained, they use binary weights to make predictions. That’s why it’s not necessary to ship them with their training data.


**Q) What’s not included in a model package that you can load into spaCy?**

- A meta file including the language, pipeline and license.

- Binary weights to make statistical predictions.

- <font color='red'>The labelled data that the model was trained on.</font>

- Strings of the model's vocabulary and their hashes.

In [0]:
spacy.explain('PROPN')

'proper noun'

# Ruled Based Matching

**Why not just regular expressions?**
- Match on Doc objects, not just strings
- Match on tokens and token attributes
- Use the model's predictions
  Example: "duck" (verb) vs. "duck" (noun)

- Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

- It's also more flexible: you can search for texts but also other lexical attributes.

- You can even write rules that use the model's predictions.

- For example, find the word "duck" only if it's a verb, not a noun.

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

**Match patterns**
- Lists of dictionaries, one per token

- Match exact token texts

  [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
  
- Match lexical attributes

 [{'LOWER': 'iphone'}, {'LOWER': 'x'}]
 
- Match any token attributes

  [{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

1. To use a pattern, we first import the matcher from spacy dot matcher.

2. We also load a model and create the nlp object.

3. The matcher is initialized with the shared vocabulary, nlp dot vocab. You'll learn more about this later – for now, just remember to always pass it in.

4. The matcher dot add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

5. To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [0]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

In [0]:
# Call the matcher on the doc
matches = matcher(doc)

# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


In [0]:
matches

[(9528407286733565721, 1, 3)]

  **match_id:** hash value of the pattern name
  **start:** start index of matched span
  **end:** end index of matched span

In [0]:
# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    print(match_id)
    print(start)
    print(end)
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

9528407286733565721
1
3
iPhone X


**Matching lexical attributes**

Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

1. A token consisting of only digits.

2. Three case-insensitive tokens for "fifa", "world" and "cup".

3. And a token that consists of punctuation.

4. The pattern matches the tokens "2018 FIFA World Cup:".

In [0]:
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]

doc = nlp("2018 FIFA World Cup: France won!")

doc

2018 FIFA World Cup: France won!

**Matching other token attributes**

In this example, we're looking for two tokens:

A verb with the lemma "love", followed by a noun.

This pattern will match "loved dogs" and "love cats".

In [0]:
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]

In [0]:
doc = nlp("I loved dogs but now I love cats more.")
doc

I loved dogs but now I love cats more.

**Using operators and quantifiers**

  Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

  Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

In [0]:
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '?'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

"OP" can have one of four values:

An "!" negates the token, so it's matched 0 times.

A "?" makes the token optional, and matches it 0 or 1 times.

A "+" matches a token 1 or more times.

And finally, an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.


Example	      Description
{'OP': '!'}	       Negation: match 0 times
{'OP': '?'}	      Optional: match 0 or 1 times
{'OP': '+'}     	Match 1 or more times
{'OP': '*'}	       Match 0 or more times

In [0]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


In [0]:
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [0]:
doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


# Data Structures : Vocab, Lexemes and StringStore

**Shared vocab and string store**

- **Vocab:** stores data shared across multiple documents
- To save memory, spaCy encodes all strings to hash values
- Strings are only stored once in the StringStore via nlp.vocab.strings
- **String store:** lookup table in both directions

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp dot vocab dot strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [0]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]

KeyError: ignored

In [0]:
# Raises an error if we haven't seen the string before
string = nlp.vocab.strings[3197928453018144401]

KeyError: ignored

To get the hash for a string, we can look it up in nlp dot vocab dot strings.

To get the string representation of a hash, we can look up the hash.

A Doc object also exposes its vocab and strings.

In [0]:
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [0]:
#The doc also exposes the vocab and strings
doc = nlp("I love coffee")
print('hash value:', doc.vocab.strings['coffee'])

hash value: 3197928453018144401


**Lexemes: entries in the vocabulary**

- Lexemes are *context-independent* entries in the vocabulary.

- You can get a lexeme by *looking up a string or a hash ID* in the vocab.

- Lexemes expose attributes, just like tokens.

- They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters.

- Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

In [0]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


Contains the **context-independent** information about a word
- **Word text:** `lexeme.text` and `lexeme.orth` (the hash)
- Lexical attributes like `lexeme.is_alpha`
- **Not** context-dependent part-of-speech tags, dependencies or entity labels

**Vocab, hashes and lexemes**

![alt text](https://course.spacy.io/vocab_stringstore.png)

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

In [0]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [0]:
doc = nlp("David Bowie is a PERSON")

# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[person_hash]
print(person_string)

380
PERSON


In [0]:
# Why does this code throw an error?

from spacy.lang.en import English
from spacy.lang.de import German

# Create an English and German nlp object
nlp = English()
nlp_de = German()

# Get the ID for the string 'Bowie'
bowie_id = nlp.vocab.strings['Bowie']
print(bowie_id)

# Look up the ID for 'Bowie' in the vocab
print(nlp_de.vocab.strings[bowie_id])

2644858412616767388


KeyError: ignored

**Answer:** The string `Bowie` isn't in the German vocab, so the hash can't be resolved in the string store.

**Reason:** Hashes can’t be reversed. To prevent this problem, add the word to the new vocab by processing a text or looking up the string, or use the same vocab to resolve the hash back to a string.

**The Doc object**

- The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

- After creating the nlp object, we can import the Doc class from spacy dot tokens.

- Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

- The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [0]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
doc

Hello world!

**The Span object**

![alt text](https://course.spacy.io/span_indices.png)

A Span is a slice of a Doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!

- To create a Span manually, we can also import the class from spacy dot tokens. We can then instantiate it with the doc and the span's start and end index.

- To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument.

- The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans.

In [0]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)
print(span)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")
print(span_with_label)

# Add span to the doc.ents
doc.ents = [span_with_label]
print(doc.ents)

Hello world
Hello world
(Hello world,)


**Tips and Tricks**
- The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.

- If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.

- To keep things consistent, try to use built-in token attributes wherever possible. For example, **token.i** for the token index.

Also, don't forget to always pass in the shared vocab!

In [0]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [0]:
from spacy.lang.en import English

nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.

In [0]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

**Why is the code bad?**

**Answer:**  It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

**Reason:** Always convert the results to strings as late as possible, and try to use native token attributes to keep things consistent.

- The **.pos_** attribute returns the coarse-grained part-of-speech tag and PROPN is the correct tag to check for proper nouns.

- It shouldn’t be necessary to convert strings back to **Token** objects. Instead, try to avoid converting tokens to strings if you still need to access their attributes and relationships.

In [0]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

**Word vectors and semantic similarity**

how to use spaCy to predict how similar documents, spans or tokens are to each other.

**Comparing semantic similarity**
- `spaCy` can compare two objects and predict similarity
- Doc.similarity(), Span.similarity() and Token.similarity()
- Take another object and return a similarity score (0 to 1)
  

- **Important:** needs a model that has word vectors included, for example:
          ✅ en_core_web_md (medium model)
          ✅ en_core_web_lg (large model)
          🚫 NOT en_core_web_sm (small model)
          
spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a dot similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the models documentation.

In [0]:
# Similarity examples
# Load a larger model with vectors

# Download the English model package
# !python -m spacy download en_core_web_lg
# issue: https://stackoverflow.com/questions/56927602/unable-to-load-the-spacy-model-en-core-web-lg-on-google-colab
import spacy
nlp = spacy.load('en_core_web_lg')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.8627204117787385


In [0]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


**Similarity examples**

- the similarity methods to compare different types of objects.

For example, a document and a token.

Here, the similarity score is pretty low and the two objects are considered fairly dissimilar.

Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds.

The score returned here is 0.61, so it's determined to be kind of similar.

In [0]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.32531983166759537


In [0]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.6199092090831612


**How does spaCy predict similarity?**
- Similarity is determined using word vectors
- Multi-dimensional meaning representations of words
- Generated using an algorithm like Word2Vec and lots of text
- Can be added to spaCy's statistical models
      Default: cosine similarity, but can be adjusted
- Doc and Span vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

**how does spaCy do this under the hood?**

1. Similarity is determined using word vectors, multi-dimensional representations of meanings of words.

2. You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text.

3. Vectors can be added to spaCy's statistical models.

4. By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary.

5. Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors.

That's also why you usually get more value out of shorter phrases with fewer irrelevant words.

In [0]:
# Load a larger model with vectors
nlp = spacy.load('en_core_web_lg')

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

you an idea of what those vectors look like, here's an example.

First, we load the medium model again, which ships with word vectors.

Next, we can process a text and look up a token's vector using the dot vector attribute.

The result is a 300-dimensional vector of the word "banana".

**Similarity depends on the application context**

- Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.

- However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do.

- **Here's an example:** spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [0]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501447503553421


In [0]:
import spacy
# !python -m spacy download en_core_web_md
# Load the en_core_web_md model
nlp = spacy.load("en_core_web_md")

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

In [0]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [0]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [0]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.75173926


**Combining models and rules**
Combining statistical models with rule-based systems is one of the most powerful tricks you should have in your NLP toolbox.

**Statistical predictions vs. rules**

- Statistical models are useful if your application needs to be able to generalize based on a few examples.

- For instance, detecting product or person names usually benefits from a statistical model. Instead of providing a list of all person names ever, your application will be able to predict whether a span of tokens is a person name. Similarly, you can predict dependency labels to find subject/object relationships.

- To do this, you would use spaCy's entity recognizer, dependency parser or part-of-speech tagger.

---

Rule-based approaches on the other hand come in handy if there's a more or less finite number of instances you want to find. For example, all countries or cities of the world, drug names or even dog breeds.

In spaCy, you can achieve this with custom tokenization rules, as well as the matcher and phrase matcher.


In [0]:
# Initialize with the shared vocab
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Patterns are lists of dictionaries describing the tokens
pattern = [{'LEMMA': 'love', 'POS': 'VERB'}, {'LOWER': 'cats'}]
matcher.add('LOVE_CATS', None, pattern)

# Operators can specify how often a token should be matched
pattern = [{'TEXT': 'very', 'OP': '+'}, {'TEXT': 'happy'}]

# Calling matcher on doc returns list of (match_id, start, end) tuples
doc = nlp("I love cats and I'm very very happy")
matches = matcher(doc)

In [0]:
doc,matches

(I love cats and I'm very very happy, [(9137535031263442622, 1, 3)])

spaCy's rule-based matcher to find complex patterns in your texts. Here's a quick recap.

The matcher is initialized with the shared vocabulary – usually nlp dot vocab.

Patterns are lists of dictionaries, and each dictionary describes one token and its attributes. Patterns can be added to the matcher using the matcher dot add method.

Operators let you specify how often to match a token. For example, "+" will match one or more times.

Calling the matcher on a doc object will return a list of the matches. Each match is a tuple consisting of an ID, and the start and end token index in the document.

In [0]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print('Matched span:', span.text)
    # Get the span's root token and root head token
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    # Get the previous token and its POS tag
    print('Previous token:', doc[start - 1].text, doc[start - 1].pos_)

Matched span: Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


matcher rule for "golden retriever".

If we iterate over the matches returned by the matcher, we can get the match ID and the start and end index of the matched span. We can then find out more about it. Span objects give us access to the original document and all other token attributes and linguistic features predicted by the model.

For example, we can get the span's root token. If the span consists of more than one token, this will be the token that decides the category of the phrase. For example, the root of "Golden Retriever" is "Retriever". We can also find the head token of the root. This is the syntactic "parent" that governs the phrase – in this case, the verb "have".

Finally, we can look at the previous token and its attributes. In this case, it's a determiner, the article "a".

**Efficient phrase matching**
- PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
- Takes Doc object as patterns
- More efficient and faster than the Matcher
- Great for matching large word list

In [0]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add('DOG', None, pattern)
doc = nlp("I have a Golden Retriever")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Get the matched span
    span = doc[start:end]
    print('Matched span:', span.text)

Matched span: Golden Retriever


The phrase matcher can be imported from spacy dot matcher and follows the same API as the regular matcher.

Instead of a list of dictionaries, we pass in a Doc object as the pattern.

We can then iterate over the matches in the text, which gives us the match ID, and the start and end of the match. This lets us create a Span object for the matched tokens "Golden Retriever" to analyze it in context.

**Why does this pattern not match the tokens “Silicon Valley” in the doc?**

**Answer:** The tokenizer doesn't create tokens for single spaces, so there's no token with the value ' ' in between.

**Reason:** The tokenizer already takes care of splitting off whitespace and each dictionary in the pattern describes one token.

By default, all tokens described by a pattern will be matched exactly once. Operators are only needed to change this behavior – for example, to match zero or more times.

In [0]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "Amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad-free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", None, pattern1)
matcher.add("PATTERN2", None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

Sometimes it’s more efficient to match exact strings instead of writing patterns describing the individual tokens. This is especially true for finite categories of things – like all countries of the world. We already have a list of countries, so let’s use this as the basis of our information extraction script. A list of string names is available as the variable COUNTRIES.

Import the PhraseMatcher and initialize it with the shared vocab as the variable matcher.
Add the phrase patterns and call the matcher on the doc.

In [0]:
import json
from spacy.lang.en import English

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())

nlp = English()
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

FileNotFoundError: ignored

In [0]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
import json

with open("exercises/countries.json") as f:
    COUNTRIES = json.loads(f.read())
with open("exercises/country_text.txt") as f:
    TEXT = f.read()

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
patterns = list(nlp.pipe(COUNTRIES))
matcher.add("COUNTRY", None, *patterns)

# Create a doc and find matches in it
doc = nlp(TEXT)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")

    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, "-->", span.text)

# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "GPE"])

# Chapter 3: Processing Pipelines

A series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities

`spacy.load()` to load a model, spaCy will initialize the language, add the pipeline and load in the binary model weights. When you call the nlp object on a text, the model is already loaded.

Tokenizer always run before all other pipeline components, because it transforms a string of text into a Doc object. The pipeline also doesn’t have to consist of the tagger, parser and entity recognizer.

`doc = nlp("This is a sentence.")` - Tokenize the text and apply each pipeline component in order.

In [0]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')        #Tokenize the text and apply each pipeline component in order

# Print the names of the pipeline components
print(nlp.pipe_names)

# Print the full pipeline of (name, component) tuples
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc6b64471d0>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc6b65505e8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc6b6550588>)]


### Custom pipeline components

let's us add our own function to the spaCy pipeline that is executed when we call the nlp object on a text – for example, to modify the Doc and add more data to it.

![alt text](https://course.spacy.io/pipeline.png)

why custom components?
- Make a function execute automatically when you call nlp
- Adds our own metadata to documents and tokens
- Updating built-in attributes like `doc.ents`

Custom components are executed automatically when you call the nlp object on a text.

They're especially useful for adding your own custom metadata to documents and tokens.

we can also use them to update built-in attributes, like the named entity spans.

> Pipeline component is a function or callable that takes a doc, modifies it and returns it, so it can be processed by the next component in the pipeline.

> Components can be added to the pipeline using the `nlp.add_pipe` method. The method takes at least one argument: the component function.


    def custom_component(doc):
      # Do something to the doc here
      return doc

    nlp.add_pipe(custom_component)

|Argument|	Description|	Example|
|---|---|---|
|last	|If True, add last|	nlp.add_pipe(component, last=True)|
|first |	If True, add first|	nlp.add_pipe(component, first=True)|
|before |	Add before component|	nlp.add_pipe(component, before='ner')|
|after |	Add after component|	nlp.add_pipe(component, after='tagger')|


In [48]:
import spacy
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component
def custom_component(doc):
    # Print the doc's length
    print('Doc length:', len(doc))
    # Return the doc object
    return doc

# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)

# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']


In [0]:
# Process a text
doc = nlp("Hello world!")

Doc length: 3


In [0]:
doc

Hello world!

**Which of these problems can be solved by custom pipeline components? Choose all that apply!**

1. Updating the pre-trained models and improving their predictions
2. Computing your own values based on tokens and their attributes
3. Adding named entities, for example based on a dictionary
4. Implementing support for an additional language

**ANS:** 2 & 3

**Reason:** 
1. Custom components can only modify the Doc and can’t be used to update weights of other components directly.  
2. Great for adding custom values to documents, tokens and spans, and customizing the `doc.ents`. 
3. Custom components are added to the pipeline after the language class is already initialized and after tokenization, so they’re not suitable to add new languages.

In [0]:
# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc


# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)

# Process a text
doc = nlp("This is a sentence.")

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.


In [0]:
# The PhraseMatcher to find animal names in the document and adds the matched spans to the `doc.ents`. 
# A PhraseMatcher with the animal patterns has already been created as the variable matcher.

from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
animals = ["Golden Retriever", "cat", "turtle", "Rattus norvegicus"]
animal_patterns = list(nlp.pipe(animals))
print("animal_patterns:", animal_patterns)
matcher = PhraseMatcher(nlp.vocab)
matcher.add("ANIMAL", None, *animal_patterns)

# Define the custom component
def animal_component(doc):
    # Apply the matcher to the doc
    matches = matcher(doc)
    # Create a Span for each match and assign the label 'ANIMAL'
    spans = [Span(doc, start, end, label='ANIMAL') for match_id, start, end in matches]
    # Overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc


# Add the component to the pipeline after the 'ner' component
nlp.add_pipe(animal_component, after='ner')
print(nlp.pipe_names)

# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

animal_patterns: [Golden Retriever, cat, turtle, Rattus norvegicus]
['tagger', 'parser', 'ner', 'animal_component']
[('cat', 'ANIMAL'), ('Golden Retriever', 'ANIMAL')]


### Extension attributes

Add custom attributes to the Doc, Token and Span objects to store custom data.

- Custom attributes are available via the `dot-underscore` property. This makes it clear that they were added by the user, and not built into spaCy, like token dot text.

- Attributes need to be registered on the `global Doc, Token and Span classes` we can import from spacy dot tokens. To register a custom attribute on the Doc, Token or Span, we can use the `set extension` method.

- The first argument is the attribute name. Keyword arguments let's us define how the value should be computed. In this case, it has a default value and can be overwritten.

In [0]:
# Import global classes
from spacy.tokens import Doc, Token, Span

# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

# Add custom metadata to documents, tokens and spans
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False

NameError: ignored

**Extension attribute type**s
1. Attribute extensions
2. Property extensions
3. Method extensions

In [0]:
from spacy.tokens import Token

# Set extension on the Token with default value
Token.set_extension('is_color', default=False)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

ValueError: ignored

**Attribute extensions**

In [0]:
# Set extension on the Token with default value
Token.set_extension('is_color', default=False, force=True)

doc = nlp("The sky is blue.")

# Overwrite extension attribute value
doc[3]._.is_color = True

**Property extensions**

- Define a getter and an optional setter function
- Getter only called when we retrieve the attribute value


This lets us compute the value dynamically, and even take other custom attributes into account.

Getter functions take one argument: the object, in this case, the token. In this example, the function returns whether the token text is in our list of colors.

We can then provide the function via the getter keyword argument when we register the extension.

In [0]:
# Define getter function
def get_is_color(token):
    colors = ['red', 'yellow', 'blue']
    return token.text in colors

# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color, force=True)

doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

True - blue


In [0]:
# Span extensions should almost always use a getter

# Define getter function
def get_has_color(span):
    colors = ['red', 'yellow', 'blue']
    return any(token.text in colors for token in span)

# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color, force=True)

doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue
False - The sky


**Method extensions**
- Assign a function that becomes available as an object method
- Lets us pass arguments to the extension function

In [0]:
from spacy.tokens import Doc

# Define method with arguments
def has_token(doc, token_text):
    in_doc = token_text in [token.text for token in doc]
    return in_doc

# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token, force=True)

doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud


In this example, the method function checks whether the doc contains a token with a given text. The first argument of the method is always the object itself – in this case, the Doc. It's passed in automatically when the method is called. All other function arguments will be arguments on the method extension. In this case, "token text".

Here, the custom "has token" method returns True for the word "blue" and False for the word "cloud".

In [0]:
from spacy.lang.en import English
from spacy.tokens import Token

nlp = English()

# Register the Token extension attribute 'is_country' with the default value False
Token.set_extension('is_country', default=False)

# Process the text and set the is_country attribute to True for the token "Spain"
doc = nlp("I live in Spain.")
doc[3]._.is_country = True

# Print the token text and the is_country attribute for all tokens
print([(token.text, token._.is_country) for token in doc])

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [0]:
# Define the getter function that takes a token and returns its reversed text
def get_reversed(token):
    return token.text[::-1]


# Register the Token property extension 'reversed' with the getter get_reversed
Token.set_extension("reversed", getter=get_reversed, force=True)

# Process the text and print the reversed attribute for each token
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print("reversed:", token._.reversed)

reversed: llA
reversed: snoitazilareneg
reversed: era
reversed: eslaf
reversed: ,
reversed: gnidulcni
reversed: siht
reversed: eno
reversed: .


In [0]:
# Define the getter function
def get_has_number(doc):
    # Return if any of the tokens in the doc return True for token.like_num
    return any(token.like_num for token in doc)


# Register the Doc property extension 'has_number' with the getter get_has_number
Doc.set_extension('has_number', getter=get_has_number)

# Process the text and check the custom has_number attribute
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


In [0]:
from spacy.tokens import Span
# Define the method
def to_html(span, tag):
    # Wrap the span text in a HTML tag and return it
    return "<{tag}>{text}</{tag}>".format(tag=tag, text=span.text)


# Register the Span property extension 'to_html' with the method to_html
Span.set_extension("to_html", method=to_html)

# Process the text and call the to_html method on the span with the tag name 'strong'
doc = nlp("Hello world, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello world</strong>


In [0]:
# Entities and Extension

# we’ll combine custom extension attributes with the model’s predictions and create an attribute getter that returns a Wikipedia search URL 
# if the span is a person, organization, or location.
import spacy
nlp = spacy.load("en_core_web_sm")


def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ("PERSON", "ORG", "GPE", "LOCATION"):
        entity_text = span.text.replace(" ", "_")
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text


# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension("wikipedia_url", getter=get_wikipedia_url)

doc = nlp(
    "In over fifty years from his very first recordings right through to his "
    "last album, David Bowie was at the vanguard of contemporary culture."
)
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

fifty years None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


**Components with extension**

Extension attributes are especially powerful if they’re combined with custom pipeline components. 

write a pipeline component that finds country names and a custom extension attribute that returns a country’s capital, if available.

In [47]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

COUNTRIES = [
    "Afghanistan",
    "Åland Islands",
    "Albania",
    "Algeria",
    "American Samoa",
    "Andorra",
    "Angola",
    "Anguilla",
    "Antarctica",
    "Antigua and Barbuda",
    "Argentina",
    "Armenia",
    "Aruba",
    "Australia",
    "Austria",
    "Azerbaijan",
    "Bahamas",
    "Bahrain",
    "Bangladesh",
    "Barbados",
    "Belarus",
    "Belgium",
    "Belize",
    "Benin",
    "Bermuda",
    "Bhutan",
    "Bolivia (Plurinational State of)",
    "Bonaire, Sint Eustatius and Saba",
    "Bosnia and Herzegovina",
    "Botswana",
    "Bouvet Island",
    "Brazil",
    "British Indian Ocean Territory",
    "United States Minor Outlying Islands",
    "Virgin Islands (British)",
    "Virgin Islands (U.S.)",
    "Brunei Darussalam",
    "Bulgaria",
    "Burkina Faso",
    "Burundi",
    "Cambodia",
    "Cameroon",
    "Canada",
    "Cabo Verde",
    "Cayman Islands",
    "Central African Republic",
    "Chad",
    "Chile",
    "China",
    "Christmas Island",
    "Cocos (Keeling) Islands",
    "Colombia",
    "Comoros",
    "Congo",
    "Congo (Democratic Republic of the)",
    "Cook Islands",
    "Costa Rica",
    "Croatia",
    "Cuba",
    "Curaçao",
    "Cyprus",
    "Czech Republic",
    "Denmark",
    "Djibouti",
    "Dominica",
    "Dominican Republic",
    "Ecuador",
    "Egypt",
    "El Salvador",
    "Equatorial Guinea",
    "Eritrea",
    "Estonia",
    "Ethiopia",
    "Falkland Islands (Malvinas)",
    "Faroe Islands",
    "Fiji",
    "Finland",
    "France",
    "French Guiana",
    "French Polynesia",
    "French Southern Territories",
    "Gabon",
    "Gambia",
    "Georgia",
    "Germany",
    "Ghana",
    "Gibraltar",
    "Greece",
    "Greenland",
    "Grenada",
    "Guadeloupe",
    "Guam",
    "Guatemala",
    "Guernsey",
    "Guinea",
    "Guinea-Bissau",
    "Guyana",
    "Haiti",
    "Heard Island and McDonald Islands",
    "Holy See",
    "Honduras",
    "Hong Kong",
    "Hungary",
    "Iceland",
    "India",
    "Indonesia",
    "Côte d'Ivoire",
    "Iran (Islamic Republic of)",
    "Iraq",
    "Ireland",
    "Isle of Man",
    "Israel",
    "Italy",
    "Jamaica",
    "Japan",
    "Jersey",
    "Jordan",
    "Kazakhstan",
    "Kenya",
    "Kiribati",
    "Kuwait",
    "Kyrgyzstan",
    "Lao People's Democratic Republic",
    "Latvia",
    "Lebanon",
    "Lesotho",
    "Liberia",
    "Libya",
    "Liechtenstein",
    "Lithuania",
    "Luxembourg",
    "Macao",
    "Macedonia (the former Yugoslav Republic of)",
    "Madagascar",
    "Malawi",
    "Malaysia",
    "Maldives",
    "Mali",
    "Malta",
    "Marshall Islands",
    "Martinique",
    "Mauritania",
    "Mauritius",
    "Mayotte",
    "Mexico",
    "Micronesia (Federated States of)",
    "Moldova (Republic of)",
    "Monaco",
    "Mongolia",
    "Montenegro",
    "Montserrat",
    "Morocco",
    "Mozambique",
    "Myanmar",
    "Namibia",
    "Nauru",
    "Nepal",
    "Netherlands",
    "New Caledonia",
    "New Zealand",
    "Nicaragua",
    "Niger",
    "Nigeria",
    "Niue",
    "Norfolk Island",
    "Korea (Democratic People's Republic of)",
    "Northern Mariana Islands",
    "Norway",
    "Oman",
    "Pakistan",
    "Palau",
    "Palestine, State of",
    "Panama",
    "Papua New Guinea",
    "Paraguay",
    "Peru",
    "Philippines",
    "Pitcairn",
    "Poland",
    "Portugal",
    "Puerto Rico",
    "Qatar",
    "Republic of Kosovo",
    "Réunion",
    "Romania",
    "Russian Federation",
    "Rwanda",
    "Saint Barthélemy",
    "Saint Helena, Ascension and Tristan da Cunha",
    "Saint Kitts and Nevis",
    "Saint Lucia",
    "Saint Martin (French part)",
    "Saint Pierre and Miquelon",
    "Saint Vincent and the Grenadines",
    "Samoa",
    "San Marino",
    "Sao Tome and Principe",
    "Saudi Arabia",
    "Senegal",
    "Serbia",
    "Seychelles",
    "Sierra Leone",
    "Singapore",
    "Sint Maarten (Dutch part)",
    "Slovakia",
    "Slovenia",
    "Solomon Islands",
    "Somalia",
    "South Africa",
    "South Georgia and the South Sandwich Islands",
    "Korea (Republic of)",
    "South Sudan",
    "Spain",
    "Sri Lanka",
    "Sudan",
    "Suriname",
    "Svalbard and Jan Mayen",
    "Swaziland",
    "Sweden",
    "Switzerland",
    "Syrian Arab Republic",
    "Taiwan",
    "Tajikistan",
    "Tanzania, United Republic of",
    "Thailand",
    "Timor-Leste",
    "Togo",
    "Tokelau",
    "Tonga",
    "Trinidad and Tobago",
    "Tunisia",
    "Turkey",
    "Turkmenistan",
    "Turks and Caicos Islands",
    "Tuvalu",
    "Uganda",
    "Ukraine",
    "United Arab Emirates",
    "United Kingdom of Great Britain and Northern Ireland",
    "United States of America",
    "Uruguay",
    "Uzbekistan",
    "Vanuatu",
    "Venezuela (Bolivarian Republic of)",
    "Viet Nam",
    "Wallis and Futuna",
    "Western Sahara",
    "Yemen",
    "Zambia",
    "Zimbabwe"
]

CAPITALS = {
  "Afghanistan":"Kabul",
  "\u00c5land Islands":"Mariehamn",
  "Albania":"Tirana",
  "Algeria":"Algiers",
  "American Samoa":"Pago Pago",
  "Andorra":"Andorra la Vella",
  "Angola":"Luanda",
  "Anguilla":"The Valley",
  "Antarctica":"",
  "Antigua and Barbuda":"Saint John's",
  "Argentina":"Buenos Aires",
  "Armenia":"Yerevan",
  "Aruba":"Oranjestad",
  "Australia":"Canberra",
  "Austria":"Vienna",
  "Azerbaijan":"Baku",
  "Bahamas":"Nassau",
  "Bahrain":"Manama",
  "Bangladesh":"Dhaka",
  "Barbados":"Bridgetown",
  "Belarus":"Minsk",
  "Belgium":"Brussels",
  "Belize":"Belmopan",
  "Benin":"Porto-Novo",
  "Bermuda":"Hamilton",
  "Bhutan":"Thimphu",
  "Bolivia (Plurinational State of)":"Sucre",
  "Bonaire, Sint Eustatius and Saba":"Kralendijk",
  "Bosnia and Herzegovina":"Sarajevo",
  "Botswana":"Gaborone",
  "Bouvet Island":"",
  "Brazil":"Bras\u00edlia",
  "British Indian Ocean Territory":"Diego Garcia",
  "United States Minor Outlying Islands":"",
  "Virgin Islands (British)":"Road Town",
  "Virgin Islands (U.S.)":"Charlotte Amalie",
  "Brunei Darussalam":"Bandar Seri Begawan",
  "Bulgaria":"Sofia",
  "Burkina Faso":"Ouagadougou",
  "Burundi":"Bujumbura",
  "Cambodia":"Phnom Penh",
  "Cameroon":"Yaound\u00e9",
  "Canada":"Ottawa",
  "Cabo Verde":"Praia",
  "Cayman Islands":"George Town",
  "Central African Republic":"Bangui",
  "Chad":"N'Djamena",
  "Chile":"Santiago",
  "China":"Beijing",
  "Christmas Island":"Flying Fish Cove",
  "Cocos (Keeling) Islands":"West Island",
  "Colombia":"Bogot\u00e1",
  "Comoros":"Moroni",
  "Congo":"Brazzaville",
  "Congo (Democratic Republic of the)":"Kinshasa",
  "Cook Islands":"Avarua",
  "Costa Rica":"San Jos\u00e9",
  "Croatia":"Zagreb",
  "Cuba":"Havana",
  "Cura\u00e7ao":"Willemstad",
  "Cyprus":"Nicosia",
  "Czech Republic":"Prague",
  "Denmark":"Copenhagen",
  "Djibouti":"Djibouti",
  "Dominica":"Roseau",
  "Dominican Republic":"Santo Domingo",
  "Ecuador":"Quito",
  "Egypt":"Cairo",
  "El Salvador":"San Salvador",
  "Equatorial Guinea":"Malabo",
  "Eritrea":"Asmara",
  "Estonia":"Tallinn",
  "Ethiopia":"Addis Ababa",
  "Falkland Islands (Malvinas)":"Stanley",
  "Faroe Islands":"T\u00f3rshavn",
  "Fiji":"Suva",
  "Finland":"Helsinki",
  "France":"Paris",
  "French Guiana":"Cayenne",
  "French Polynesia":"Papeet\u0113",
  "French Southern Territories":"Port-aux-Fran\u00e7ais",
  "Gabon":"Libreville",
  "Gambia":"Banjul",
  "Georgia":"Tbilisi",
  "Germany":"Berlin",
  "Ghana":"Accra",
  "Gibraltar":"Gibraltar",
  "Greece":"Athens",
  "Greenland":"Nuuk",
  "Grenada":"St. George's",
  "Guadeloupe":"Basse-Terre",
  "Guam":"Hag\u00e5t\u00f1a",
  "Guatemala":"Guatemala City",
  "Guernsey":"St. Peter Port",
  "Guinea":"Conakry",
  "Guinea-Bissau":"Bissau",
  "Guyana":"Georgetown",
  "Haiti":"Port-au-Prince",
  "Heard Island and McDonald Islands":"",
  "Holy See":"Rome",
  "Honduras":"Tegucigalpa",
  "Hong Kong":"City of Victoria",
  "Hungary":"Budapest",
  "Iceland":"Reykjav\u00edk",
  "India":"New Delhi",
  "Indonesia":"Jakarta",
  "C\u00f4te d'Ivoire":"Yamoussoukro",
  "Iran (Islamic Republic of)":"Tehran",
  "Iraq":"Baghdad",
  "Ireland":"Dublin",
  "Isle of Man":"Douglas",
  "Israel":"Jerusalem",
  "Italy":"Rome",
  "Jamaica":"Kingston",
  "Japan":"Tokyo",
  "Jersey":"Saint Helier",
  "Jordan":"Amman",
  "Kazakhstan":"Astana",
  "Kenya":"Nairobi",
  "Kiribati":"South Tarawa",
  "Kuwait":"Kuwait City",
  "Kyrgyzstan":"Bishkek",
  "Lao People's Democratic Republic":"Vientiane",
  "Latvia":"Riga",
  "Lebanon":"Beirut",
  "Lesotho":"Maseru",
  "Liberia":"Monrovia",
  "Libya":"Tripoli",
  "Liechtenstein":"Vaduz",
  "Lithuania":"Vilnius",
  "Luxembourg":"Luxembourg",
  "Macao":"",
  "Macedonia (the former Yugoslav Republic of)":"Skopje",
  "Madagascar":"Antananarivo",
  "Malawi":"Lilongwe",
  "Malaysia":"Kuala Lumpur",
  "Maldives":"Mal\u00e9",
  "Mali":"Bamako",
  "Malta":"Valletta",
  "Marshall Islands":"Majuro",
  "Martinique":"Fort-de-France",
  "Mauritania":"Nouakchott",
  "Mauritius":"Port Louis",
  "Mayotte":"Mamoudzou",
  "Mexico":"Mexico City",
  "Micronesia (Federated States of)":"Palikir",
  "Moldova (Republic of)":"Chi\u0219in\u0103u",
  "Monaco":"Monaco",
  "Mongolia":"Ulan Bator",
  "Montenegro":"Podgorica",
  "Montserrat":"Plymouth",
  "Morocco":"Rabat",
  "Mozambique":"Maputo",
  "Myanmar":"Naypyidaw",
  "Namibia":"Windhoek",
  "Nauru":"Yaren",
  "Nepal":"Kathmandu",
  "Netherlands":"Amsterdam",
  "New Caledonia":"Noum\u00e9a",
  "New Zealand":"Wellington",
  "Nicaragua":"Managua",
  "Niger":"Niamey",
  "Nigeria":"Abuja",
  "Niue":"Alofi",
  "Norfolk Island":"Kingston",
  "Korea (Democratic People's Republic of)":"Pyongyang",
  "Northern Mariana Islands":"Saipan",
  "Norway":"Oslo",
  "Oman":"Muscat",
  "Pakistan":"Islamabad",
  "Palau":"Ngerulmud",
  "Palestine, State of":"Ramallah",
  "Panama":"Panama City",
  "Papua New Guinea":"Port Moresby",
  "Paraguay":"Asunci\u00f3n",
  "Peru":"Lima",
  "Philippines":"Manila",
  "Pitcairn":"Adamstown",
  "Poland":"Warsaw",
  "Portugal":"Lisbon",
  "Puerto Rico":"San Juan",
  "Qatar":"Doha",
  "Republic of Kosovo":"Pristina",
  "R\u00e9union":"Saint-Denis",
  "Romania":"Bucharest",
  "Russian Federation":"Moscow",
  "Rwanda":"Kigali",
  "Saint Barth\u00e9lemy":"Gustavia",
  "Saint Helena, Ascension and Tristan da Cunha":"Jamestown",
  "Saint Kitts and Nevis":"Basseterre",
  "Saint Lucia":"Castries",
  "Saint Martin (French part)":"Marigot",
  "Saint Pierre and Miquelon":"Saint-Pierre",
  "Saint Vincent and the Grenadines":"Kingstown",
  "Samoa":"Apia",
  "San Marino":"City of San Marino",
  "Sao Tome and Principe":"S\u00e3o Tom\u00e9",
  "Saudi Arabia":"Riyadh",
  "Senegal":"Dakar",
  "Serbia":"Belgrade",
  "Seychelles":"Victoria",
  "Sierra Leone":"Freetown",
  "Singapore":"Singapore",
  "Sint Maarten (Dutch part)":"Philipsburg",
  "Slovakia":"Bratislava",
  "Slovenia":"Ljubljana",
  "Solomon Islands":"Honiara",
  "Somalia":"Mogadishu",
  "South Africa":"Pretoria",
  "South Georgia and the South Sandwich Islands":"King Edward Point",
  "Korea (Republic of)":"Seoul",
  "South Sudan":"Juba",
  "Spain":"Madrid",
  "Sri Lanka":"Colombo",
  "Sudan":"Khartoum",
  "Suriname":"Paramaribo",
  "Svalbard and Jan Mayen":"Longyearbyen",
  "Swaziland":"Lobamba",
  "Sweden":"Stockholm",
  "Switzerland":"Bern",
  "Syrian Arab Republic":"Damascus",
  "Taiwan":"Taipei",
  "Tajikistan":"Dushanbe",
  "Tanzania, United Republic of":"Dodoma",
  "Thailand":"Bangkok",
  "Timor-Leste":"Dili",
  "Togo":"Lom\u00e9",
  "Tokelau":"Fakaofo",
  "Tonga":"Nuku'alofa",
  "Trinidad and Tobago":"Port of Spain",
  "Tunisia":"Tunis",
  "Turkey":"Ankara",
  "Turkmenistan":"Ashgabat",
  "Turks and Caicos Islands":"Cockburn Town",
  "Tuvalu":"Funafuti",
  "Uganda":"Kampala",
  "Ukraine":"Kiev",
  "United Arab Emirates":"Abu Dhabi",
  "United Kingdom of Great Britain and Northern Ireland":"London",
  "United States of America":"Washington, D.C.",
  "Uruguay":"Montevideo",
  "Uzbekistan":"Tashkent",
  "Vanuatu":"Port Vila",
  "Venezuela (Bolivarian Republic of)":"Caracas",
  "Viet Nam":"Hanoi",
  "Wallis and Futuna":"Mata-Utu",
  "Western Sahara":"El Aai\u00fan",
  "Yemen":"Sana'a",
  "Zambia":"Lusaka",
  "Zimbabwe":"Harare"
}

nlp = English()
matcher = PhraseMatcher(nlp.vocab)
matcher.add("COUNTRY", None, *list(nlp.pipe(COUNTRIES)))
 
def countries_component(doc):
  # Create an entity Span with the label 'GPE' for all matches
  matches = matcher(doc)
  doc.ents = [Span(doc, start, end, label='GPE') for match_id, start, end in matches]
  return doc
 
# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)
 
# Getter that looks up the span text in the dictionary of country capitals
get_capital = lambda span: CAPITALS.get(span.text)
 
# Register the Span extension attribute 'capital' with the getter get_capital
Span.set_extension('capital', getter= get_capital)
 
# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

['countries_component']
[('Czech Republic', 'GPE', 'Prague'), ('Slovakia', 'GPE', 'Bratislava')]


### Scaling and performance

**Processing large volumes of text**
- Use `nlp.pipe` method
- Processes texts as a stream, yields Doc objects
- Much faster than calling nlp on each text

**BAD:**

`docs = [nlp(text) for text in LOTS_OF_TEXTS]`

**GOOD:**

`docs = list(nlp.pipe(LOTS_OF_TEXTS))`

`nlp.pipe` is a generator that yields Doc objects

**Passing in context**
- Setting `as_tuples=True` on `nlp.pipe` lets you pass in `(text, context)` tuples
- Yields `(doc, context)` tuples
- Useful for associating metadata with the doc

In [0]:
data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

# This is useful for passing in additional metadata, like an ID associated with the text, or a page number.
for doc, context in nlp.pipe(data, as_tuples=True):
    print(doc.text, context['page_number'])

This is a text 15
And another text 16


In [0]:
#  we're registering two extensions, "id" and "page number", which default to None.
# After processing the text and passing through the context, we can overwrite the doc extensions with our context metadata.
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)

data = [
    ('This is a text', {'id': 1, 'page_number': 15}),
    ('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):
    doc._.id = context['id']
    doc._.page_number = context['page_number']

**Note:** Sometimes you already have a model loaded to do other processing, but you only need the tokenizer for one particular text.

Running the whole pipeline is unnecessarily slow, because you'll be getting a bunch of predictions from the model that you don't need.

**BAD:**

`doc = nlp("Hello world")`

**GOOD:**

`doc = nlp.make_doc("Hello world!")`

In [0]:
# If we only need a tokenized Doc object, we can use the nlp.make_doc method instead, which takes a text and returns a Doc.
# nlp.make_doc turns the text into a Doc before the pipeline components are called

**Disabling pipeline components**

- Use `nlp.disable_pipes` to temporarily disable one or more pipes
- Restores them after the `with` block
- Only runs the remaining components

In [0]:
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Disable the tagger and parser
with nlp.disable_pipes('tagger','parser'):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

print(doc.ents)

(American, College Park, Georgia)
(American, College Park, Georgia)


In [44]:
TEXTS = [
    "McDonalds is my favorite restaurant.",
    "Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..",
    "People really still eat McDonalds :(",
    "The McDonalds in Spain has chicken wings. My heart is so happy ",
    "@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P",
    "please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D",
    "This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it"
]

import json
import spacy
 
nlp = spacy.load("en_core_web_sm")
 
# with open("exercises/tweets.json") as f:
#   TEXTS = json.loads(f.read())
 
# Process the texts and print the adjectives
for text in TEXTS:
  doc = nlp(text)
  print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']


In [45]:
# Using nlp.pipe

# Process the texts and print the adjectives
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == "ADJ"])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible']


Custom attributes to add author and book meta information to quotes.

A list of `[text, context]` examples is available as the variable DATA. 

The texts are quotes from famous books, and the contexts dictionaries with the keys 'author' and 'book'.

In [0]:
DATA = [
    [
        "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.",
        { "author": "Franz Kafka", "book": "Metamorphosis" }
    ],
    [
        "I know not all that may be coming, but be it what it will, I'll go to it laughing.",
        { "author": "Herman Melville", "book": "Moby-Dick or, The Whale" }
    ],
    [
        "It was the best of times, it was the worst of times.",
        { "author": "Charles Dickens", "book": "A Tale of Two Cities" }
    ],
    [
        "The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars.",
        { "author": "Jack Kerouac", "book": "On the Road" }
    ],
    [
        "It was a bright cold day in April, and the clocks were striking thirteen.",
        { "author": "George Orwell", "book": "1984" }
    ],
    [
        "Nowadays people know the price of everything and the value of nothing.",
        { "author": "Oscar Wilde", "book": "The Picture Of Dorian Gray" }
    ]
]

In [51]:
nlp = English()
 
# Register the Doc extension 'author' (default None)
Doc.set_extension('author', default=None, force=True)
 
# Register the Doc extension 'book' (default None)
Doc.set_extension('book', default=None, force=True)
 
for doc, context in nlp.pipe(DATA, as_tuples=True):
  # Set the doc._.book and doc._.author attributes from the context
  doc._.book = context['book']
  doc._.author = context['author']
 
  # Print the text and custom attribute data
  print(doc.text, "\n", "— '{}' by {}".format(doc._.book, doc._.author), "\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



# Chapter 4: Training a neural network model

How to update spaCy's statistical models to customize them for our use case – for example, to predict a new entity type in online comments. we'll write our own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make our custom NLP projects more successful.

**Why updating the model?**
- Better results on our specific domain
- Learn classification schemes specifically for our problem
- Essential for text classification
- Very useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing

**How training works**
1. Initialize the model weights randomly with nlp.begin_training
2. Predict a few examples with the current weights by calling nlp.update
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2.

![alt text](https://course.spacy.io/training.png)

**Training data:** Examples and their annotations.

**Text:** The input text the model should predict a label for.

**Label:** The label the model should predict.

**Gradient:**How to change the weights.

### Training
The training data tells the model what we want it to predict. This could be texts and named entities we want to recognize, or tokens and their correct part-of-speech tags.

To update an existing model, we can start with a few hundred to a few thousand examples.

To train a new category we may need up to a million.

spaCy's pre-trained English models for instance were trained on 2 million words labelled with part-of-speech tags, dependencies and named entities.

Training data is usually created by humans who assign labels to texts.

This is a lot of work, but can be semi-automated – for example, using spaCy's Matcher.

**spaCy’s components are supervised models for text annotations**

spaCy’s rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

In [0]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

# with open("exercises/iphone.json") as f:
#     TEXTS = json.loads(f.read())

TEXTS = [
  "How to preorder the iPhone X",
  "iPhone X is coming",
  "Should I pay $1,000 for the iPhone X?",
  "The iPhone 8 reviews are here",
  "Your iPhone goes up to 11 today",
  "I need a new phone! Any tips?"
]

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

In [30]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 10, 'GADGET'), (4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


### The training loop

 spaCy gives you full control over the training loop.

**The steps of a training loop**
1. `Loop` for a number of times.
2. `Shuffle` the training data.
3. `Divide` the data into batches.
4. `Update` the model for each batch.
5. `Save` the updated model.

To prevent the model from getting stuck in a suboptimal solution, we randomly shuffle the data for each iteration. This is a very common strategy when doing stochastic gradient descent.

Next, we divide the training data into batches of several examples, also known as minibatching. This makes it easier to make a more accurate estimate of the gradient.

The label is what we want the model to predict. This can be a text category, or an entity span and its type.

The gradient is how we should change the model to reduce the current error. It's computed when we compare the predicted label to the true label.

In [0]:
import random
import os
path_to_model = '/content'
TRAINING_DATA = [
    ("How to preorder the iPhone X", {'entities': [(20, 28, 'GADGET')]})
    # And many more examples...
]

# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# Save the model
nlp.to_disk(path_to_model)

In [39]:
!ls

meta.json  sample_data	tokenizer  vocab


**Updating an existing model**
1. Improve the predictions on new data
2. Especially useful to improve existing categories, like `PERSON`
3. Also possible to add new categories
4. Be careful and make sure the model doesn't "forget" the old ones

In [43]:
TRAINING_DATA = [
    ["How to preorder the iPhone X", { "entities": [[20, 28, "GADGET"]] }],
    ["iPhone X is coming", { "entities": [[0, 8, "GADGET"]] }],
    ["Should I pay $1,000 for the iPhone X?", { "entities": [[28, 36, "GADGET"]] }],
    ["The iPhone 8 reviews are here", { "entities": [[4, 12, "GADGET"]] }],
    ["Your iPhone goes up to 11 today", { "entities": [[5, 11, "GADGET"]] }],
    ["I need a new phone! Any tips?", { "entities": [] }]
]


# Start with blank English model
nlp = spacy.blank('en')  # The blank model doesn't have any pipeline components, only the language data and tokenization rules.
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 8.333333253860474}
{'ner': 20.147342085838318}
{'ner': 32.57593071460724}
{'ner': 8.915273308753967}
{'ner': 14.980735898017883}
{'ner': 19.785090535879135}
{'ner': 3.4871500469744205}
{'ner': 8.14518128708005}
{'ner': 9.895275425631553}
{'ner': 3.050518828444183}
{'ner': 4.285389910481172}
{'ner': 6.3983944808423985}
{'ner': 2.4975483752787113}
{'ner': 4.363425818562973}
{'ner': 6.49664493027376}
{'ner': 2.532540822401643}
{'ner': 3.292675310309278}
{'ner': 6.6151768606796395}
{'ner': 0.471284881750762}
{'ner': 3.220182040666259}
{'ner': 3.751443688146537}
{'ner': 0.26872746776416534}
{'ner': 1.3177758780420845}
{'ner': 1.3331767641539045}
{'ner': 0.002529067925934214}
{'ner': 0.6114598607682638}
{'ner': 0.6121194222692825}
{'ner': 2.2857404061487614}
{'ner': 2.285746621577806}
{'ner': 2.286038732835405}


### Best practices for training spaCy models

**Problem 1: Models can "forget" things**
- Existing model can overfit on new data

    e.g.: if we only update it with WEBSITE, it can "unlearn" what a PERSON is
- Also known as "catastrophic forgetting" problem

**Solution 1: Mix in previously correct predictions**

For example, if we're training WEBSITE, also include examples of PERSON
Run existing spaCy model over data and extract all other relevant entities

**BAD:**

    TRAINING_DATA = [
      ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
      ]
**GOOD:**

    TRAINING_DATA = [
      ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
      ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
    ]

**Problem 2: Models can't learn everything**
- spaCy's models make predictions based on local context
- Model can struggle to learn if decision is difficult to make based on context
- Label scheme needs to be consistent and not too specific
For example: `CLOTHING` is better than `ADULT_CLOTHING` and `CHILDRENS_CLOTHING`


**Solution 2: Plan your label scheme carefully**
- Pick categories that are reflected in local context
- More generic is better than too specific
- Use rules to go from generic labels to specific categories

**BAD:**

    LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
**GOOD:**

    LABELS = ['CLOTHING', 'BAND']
