# Introduction, References & Credit

This Notebook has been created while doing Spacy Official Course from https://course.spacy.io . The content is an copy of the course and some of the code samples and use-cases are mine to practise the cocnepts. 

The purpose of this notebook is to create single "Basic to Advance" implementation of everything that can be done in Spacy for NLP. This will include, basic structures of Spacy Programming and bunch of different use-cases that can be solved using it. 


More Reference Links to learn Spacy: 

https://github.com/ines/spacy-course/tree/master/slides


# Basic 

In [0]:
import spacy
spacy.__version__

'2.0.18'

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.

## Loading Model

In [0]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

## Doc Object & Tokens

In [0]:
# Created by processing a string of text with the nlp object
doc = nlp("Hi There, This is Lavi Nigam, Testing Spacy")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hi
There
,
This
is
Lavi
Nigam
,
Testing
Spacy


In [0]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [0]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


![alt text](https://course.spacy.io/doc.png)

Token objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the Doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.

In [0]:
doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


## Span Object

The Span Object
![alt text](https://course.spacy.io/doc_span.png)

A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.

To create a Span, you can use Python's slice notation. For example, 1 colon 3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.

In [0]:
doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


In [0]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


## Token Attributes

**Lexical Attributes**

Here you can see some of the available token attributes:

"i" is the index of the token within the parent document.

"text" returns the token text.

"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphanumeric characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.

In [0]:
doc = nlp("It costs $five and $ 10.")
print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4, 5, 6, 7]
Text:     ['It', 'costs', '$', 'five', 'and', '$', '10', '.']
is_alpha: [True, True, False, True, True, False, False, False]
is_punct: [False, False, False, False, False, False, False, True]
like_num: [False, False, False, True, False, False, True, False]


In [0]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.text and token.i<len(doc)-1:
        # Get the next token in the document
#         print (token.i)
        
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


# Statistical Models

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

What are statistical models?

Enable spaCy to predict linguistic attributes in context
Part-of-speech tags
Syntactic dependencies
Named entities

Trained on labeled example texts
Can be updated with more examples to fine-tune predictions

## Model Package

**Model Package **

spaCy provides a number of pre-trained model packages you can download using the "spacy download" command. For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.

The spacy dot load method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

_lg & _md is also Available

**All three models have different level of prediction and output. It should be experiemnted**

In [0]:
# Normal Downlaod 
# python -m spacy download en

# Download Models in Collab
import spacy.cli
spacy.cli.download("en_core_web_sm")
spacy.cli.download("en_core_web_md")
spacy.cli.download("en_core_web_lg")

# Change the end part to _md, _lg to downlaod medium and large size of the same Spacy model. Lg is slightly larger model. 


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_md -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')


[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_lg -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_lg

    You can now load the model via spacy.load('en_core_web_lg')



In [0]:
# Loading the Downloaded Model
import spacy

nlp = spacy.load('en_core_web_sm')

## POS 

**Predicting Part-of-speech Tags**

Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.

First, we load the small English model and receive an nlp object.

Next, we're processing the text "I went to watch movie and then ate mexican food with my childhood friends".

For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.



In [0]:
import spacy

# Load the small English model
nlp = spacy.load('en_core_web_sm')

# Process a text
doc = nlp("I went to watch movie and then ate mexican food with my childhood friends")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

I PRON
went VERB
to PART
watch VERB
movie NOUN
and CCONJ
then ADV
ate VERB
mexican ADJ
food NOUN
with ADP
my ADJ
childhood NOUN
friends NOUN


Spacy follows "Universal POS tags". Details can be found here: http://universaldependencies.org/u/pos/all.html#

## Dependency

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

List of Dependencies can be found here for reference: https://universaldependencies.org/en/dep/

Another Detailed Reference for Dependency; https://nlp.stanford.edu/software/dependencies_manual.pdf

In [0]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

I PRON nsubj went
went VERB ROOT went
to PART aux watch
watch VERB advcl went
movie NOUN dobj watch
and CCONJ cc went
then ADV advmod ate
ate VERB conj went
mexican ADJ amod food
food NOUN dobj ate
with ADP prep ate
my ADJ poss friends
childhood NOUN compound friends
friends NOUN pobj with


To check the Dpendency & other Doc Properties in Dataframe, you can use this code:

In [0]:
import pandas as pd
# To put this in Dataframe for proper viewing 
header = ['text', 'pos', 'dependecy', 'head_text']
df = pd.DataFrame([[token.text, token.pos_, token.dep_, token.head.text] for token in doc])
df.columns = header 
df

Unnamed: 0,text,pos,dependecy,head_text
0,I,PRON,nsubj,went
1,went,VERB,ROOT,went
2,to,PART,aux,watch
3,watch,VERB,advcl,went
4,movie,NOUN,dobj,watch
5,and,CCONJ,cc,went
6,then,ADV,advmod,ate
7,ate,VERB,conj,went
8,mexican,ADJ,amod,food
9,food,NOUN,dobj,ate


Dependency label scheme

![alt text](https://course.spacy.io/dep_example.png)

To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

To see the visulazied dependecy graph, you can use displace.render function to display it on notebook. If you want to serve it as a service, then you can use displacy.serve and it will create a web based service. 

In [0]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
displacy.render(doc, style="dep",jupyter=True)

**Predicting Named Entities**

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The doc dot ents property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.

## Noun chunks

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over

In [0]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)
    
import pandas as pd
# To put this in Dataframe for proper viewing 
header = ['text', 'root', 'dependecy', 'head_text']
df = pd.DataFrame([[token.text, token.root.text, token.root.dep_, token.root.head.text] for token in doc.noun_chunks])
df.columns = header 
df

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward


Unnamed: 0,text,root,dependecy,head_text
0,Autonomous cars,cars,nsubj,shift
1,insurance liability,liability,dobj,shift
2,manufacturers,manufacturers,pobj,toward


## NER Examples

In [0]:
# Taking the larger model to predict entites. Somtimes, _md works better. One needs to re-run all codes down this while changing models. 
nlp = spacy.load("en_core_web_lg")

In [0]:
# Process a text
doc = nlp(u"Flipkart is on the verge of becoming a billion dollar unicorn in india")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)

Flipkart ORG
billion dollar MONEY
india GPE


Here's list of all pre-defined entities that can be predicted using different models:

https://spacy.io/api/annotation#named-entities

Some of them are:

PERSON	- People, including fictional.

NORP	- Nationalities or religious or political groups.

FAC	- Buildings, airports, highways, bridges, etc.

ORG	- Companies, agencies, institutions, etc.

GPE	- Countries, cities, states.

LOC	Non-GPE  - locations, mountain ranges, bodies of water.

PRODUCT	- Objects, vehicles, foods, etc. (Not services.)

EVENT	Named -  hurricanes, battles, wars, sports events, etc.

WORK_OF_ART	 - Titles of books, songs, etc.

LAW	 - Named documents made into laws.

LANGUAGE	 - Any named language.

DATE	- Absolute or relative dates or periods.

TIME	- Times smaller than a day.

PERCENT	- Percentage, including ”%“.

MONEY	 - Monetary values, including unit.

QUANTITY	- Measurements, as of weight or distance.

ORDINAL - 	“first”, “second”, etc.
 
CARDINAL	- Numerals that do not fall under another type.


In [0]:
# Some more Examples 
# Process a text
doc = nlp(u"Hurricane Katrina is predicted to hit the east coast at around 5 in evening")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)

Hurricane Katrina EVENT
around 5 in evening TIME


In [0]:
doc = nlp(u"Narendra Modi, based on popular opinion over twitter, might come back as India's Prime Minister")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)

Narendra Modi PERSON
India GPE


In [0]:
doc = nlp(u"We should ask our kids to read harry potter in spanish langauge which should be atleast 10 percent of their reading quota")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)
# This seems to miss Harry Potter 

spanish NORP
10 percent PERCENT


In [0]:
doc = nlp(u"The flight should deport from IGI Airport in Delhi near LIC Building. The allowed quantity for checking is 25 kilograms")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)


IGI Airport FAC
Delhi GPE
LIC Building FAC
25 kilograms QUANTITY


In [0]:
doc = nlp(u"Sonu Nigam's famous song kal ho na ho is trending charts")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)
# Well, this was really interesting; In Hindi; "Kal Ho Na Ho" has a time canotation, meaning - not sure if
# we have tomorrow. However, it miseed it as song name, which is fine since it's an hindi song name. 

Sonu Nigam's PERSON
kal ho na ho TIME


In [0]:
doc = nlp(u"Justine Beiber song heartless is trending")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
displacy.render(doc, style="ent",jupyter=True)

Justine Beiber PERSON


In [0]:
doc = nlp(u"Triple Talak Law has been passed by supreme court of India on 25th April 2017")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)
    
# Missed Triple Talak Law as ORG. 

displacy.render(doc, style="ent",jupyter=True)

Triple Talak Law ORG
India GPE
25th April 2017 DATE


In [0]:
# To save the dependcy and Entity in external SVG file. Change the style to "ent" to save entity! 

import spacy
from spacy import displacy
from pathlib import Path

nlp = spacy.load("en_core_web_sm")
sentences = [u"This is an example.", u"This is another one."]
for sent in sentences:
    doc = nlp(sent)
    svg = displacy.render(doc, style="dep")
    file_name = '-'.join([w.text for w in doc if not w.is_punct]) + ".svg"
    output_path = Path("" + file_name)
    output_path.open("w", encoding="utf-8").write(svg)

**Spacy Explain Functionality**

A quick tip: To get definitions for the most common tags and labels, you can use the spacy dot explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy dot explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [0]:
spacy.explain('GPE')

'Countries, cities, states'

In [0]:
spacy.explain('dobj')

'direct object'

**Sample Excercise**

In [0]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      ccomp     
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

Process the text with the nlp object.

Iterate over the entities and print the entity text and label.

Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

Rule Matcher will help us do the same thing automatically. 

In [0]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


## Spacy Matcher 

spaCy's matcher, which lets you write rules to find words and phrases in text.

Why not just regular expressions?

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

For example, find the word "duck" only if it's a verb, not a noun.


Match on Doc objects, not just strings

Match on tokens and token attributes

Use the model's predictions

Example: "duck" (verb) vs. "duck" (noun)

- Lists of dictionaries, one per token


- Match exact token texts

  [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]


- Match lexical attributes

  [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

- Match any token attributes

  [{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

To use a pattern, we first import the matcher from spacy dot matcher.

We also load a model and create the nlp object.

The matcher is initialized with the shared vocabulary, nlp dot vocab. You'll learn more about this later – for now, just remember to always pass it in.

The matcher dot add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [0]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)

# Process some text
doc = nlp("New iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)


When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.

In [0]:
print(matches)

[(44191368, 0, 1), (44191368, 1, 2), (44191368, 2, 3), (44191368, 3, 4), (44191368, 4, 5), (44191368, 5, 6)]


In [0]:
# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)


# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

New
iPhone
X
release
date
leaked


Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

A token consisting of only digits.

Three case-insensitive tokens for "fifa", "world" and "cup".

And a token that consists of punctuation.

The pattern matches the tokens "2018 FIFA World Cup:".

In [0]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [
    {'IS_DIGIT': True},
    {'LOWER': 'fifa'},
    {'LOWER': 'world'},
    {'LOWER': 'cup'},
    {'IS_PUNCT': True}
]
matcher.add('Fifa_PATTERN', None, pattern)

# Process some text
doc = nlp("2018 FIFA World Cup: France won!")

# Call the matcher on the doc
matches = matcher(doc)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:


In this example, we're looking for two tokens:

A verb with the lemma "love", followed by a noun.

This pattern will match "loved dogs" and "love cats".

In [0]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_md')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [
    {'LEMMA': 'love', 'POS': 'VERB'},
    {'POS': 'NOUN'}
]
matcher.add('Love_PATTERN', None, pattern)

# Process some text
doc = nlp("I loved dogs but now I love cats more.")

# Call the matcher on the doc
matches = matcher(doc)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)
    
    
# Interesting thing to notice is that _sm model dosn't pick "love cat", but _md does. This is why 
# experimenting with different models is crucial in spacy. However, _md is bit lower

loved dogs
love cats


Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.

OP" can have one of four values:

An "!" negates the token, so it's matched 0 times.

A "?" makes the token optional, and matches it 0 or 1 times.

A "+" matches a token 1 or more times.

And finally, an "*" matches 0 or more times.

Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.

In [0]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [
    {'LEMMA': 'buy'},
    {'POS': 'DET', 'OP': '*'},  # optional: match 0 or 1 times
    {'POS': 'NOUN'}
]
matcher.add('Optional_PATTERN', None, pattern)

# Process some text
doc = nlp("I bought a smartphone. Now I'm buying apps.")

# Call the matcher on the doc
matches = matcher(doc)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

bought a smartphone
buying apps


**Examples**

Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [0]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag 'PROPN' (proper noun).

In [0]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).

In [0]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 4
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice responses


# Large-scale data analysis with spaCy

## Data Structure

### Shared vocab and string store 

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp dot vocab dot strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word in not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [0]:
coffee_hash = nlp.vocab.strings['coffee']
coffee_string = nlp.vocab.strings[coffee_hash]
print(coffee_hash, coffee_string)

3197928453018144401 coffee


In [0]:
doc = nlp("I love coffee")
print('Spacy DS:', doc.vocab.strings)
print("Hash:", doc.vocab.strings["I love coffee"])
print("Hash:", doc.vocab.strings["coffee"])

Spacy DS: <spacy.strings.StringStore object at 0x7f9ddf367e18>
Hash: 16584983610728657704
Hash: 3197928453018144401


In [0]:
print("Hash:", doc.vocab.strings["Medical"])

Hash: 10117188822904858183


### Lexemes: entries in the vocabulary

Lexemes are context-independent entries in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the the word consists of alphanumeric characters.

Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

- Contains the context-independent information about a word

   - Word text: lexeme.text and lexeme.orth (the hash)

   - Lexical attributes like lexeme.is_alpha

   - Not context-dependent part-of-speech tags, dependencies or entity labels

In [0]:
doc = nlp("I love coffee")
lexeme = nlp.vocab['coffee']

# Print the lexical attributes
print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


### Vocab, hashes and lexemes

![alt text](https://course.spacy.io/vocab_stringstore.png)

Here's an example.

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

### Doc & Span Object 
![alt text](https://course.spacy.io/span_indices.png)

**The Doc **is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

After creating the nlp object, we can import the Doc class from spacy dot tokens.

Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.



In [0]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc)

Hello world!


In [0]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


**A Span** is a slice of a Doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!

To create a Span manually, we can also import the class from spacy dot tokens. We can then instantiate it with the doc and the span's start and end index.

To add an entity label to the span, we first need to look up the string in the string store. We can then provide it to the span as the label argument.

The doc dot ents are writable, so we can add entities manually by overwriting it with a list of spans.

In [0]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

LabelHash = nlp.vocab.strings['GREETINGS']

# Create a span with a label
span_with_label = Span(doc, 0, 2, label=LabelHash)

# Add span to the doc.ents
doc.ents = [span_with_label]

ValueError: ignored

### **Best Practises**

A few tips and tricks before we get started:

The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.

If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.

To keep things consistent, try to use built-in token attributes wherever possible. For example, token dot i for the token index.

Also, don't forget to always pass in the shared vocab!

In [0]:
# This is Not Effcient way of doing it, since we are converting spacy objects too early in the program to list & Strings

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == "PROPN":
        # Check if the next token is a verb
        if pos_tags[index + 1] == "VERB":
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

In [0]:
# According to Best Practises, this should be done:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

## Word vectors and semantic similarity

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a dot similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

For example, the medium or large English model – but not the small one. So if you want to use vectors, always go with a model that ends in "md" or "lg". You can find more details on this in the models documentation.

spaCy can compare two objects and predict similarity

Doc.similarity(), Span.similarity() and Token.similarity()

Take another object and return a similarity score (0 to 1)

Important: needs a model that has word vectors included, for example:

✅ en_core_web_md (medium model)

✅ en_core_web_lg (large model)

🚫 NOT en_core_web_sm (small model)



In [50]:
# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.8627204117787385
0.7369546


You can also use the similarity methods to compare different types of objects.

For example, a document and a token.

Here, the similarity score is pretty low and the two objects are considered fairly dissimilar.

Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds.

The score returned here is 0.61, so it's determined to be kind of similar.

In [51]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.32531983166759537
0.619909235817623


How does spaCy predict similarity?

- Similarity is determined using word vectors

- Multi-dimensional meaning representations of words

- Generated using an algorithm like Word2Vec and lots of text

- Can be added to spaCy's statistical models

- Default: cosine similarity, but can be adjusted

- Doc and Span vectors default to average of token vectors

- Short phrases are better than long documents with many irrelevant words

To give you an idea of what those vectors look like, here's an example.

First, we load the medium model again, which ships with word vectors.

Next, we can process a text and look up a token's vector using the dot vector attribute.

The result is a 300-dimensional vector of the word "banana".

In [52]:
# Load a larger model with vectors
nlp = spacy.load('en_core_web_md')

doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(doc[3].vector)

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.

However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do.

Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

In [53]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501447503553421


In [54]:
import spacy

nlp = spacy.load("en_core_web_md")

doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [56]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books"
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [57]:
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.7517393180536583
