# spaCy for Entity Extraction tasks. From token to statistical model

To install spacy type in terminal in desired environment:
    
    pip install spacy
    
To install specific language model:

    python -m spacy download en_core_web_lg 
    
where en_core_web_lg is a model name. There are few types available: _sm - small, _md - medium, _lg - large.

Note! Small models don't contain word vectors.


## Import libraries 

In [17]:
import spacy
import random

from spacy.lang.en import English
#from spacy.lang.el import Greek

from spacy import displacy
from spacy.matcher import Matcher

In [3]:
spacy.__version__

'2.2.4'

## Basic document processing  

Initial process document using spaCy default pipeline

In [87]:
#Create basic spacy object for English language. Similar for any supported language.
nlp = English()

#Another way: to load small English/ model
nlp = spacy.load('en_core_web_sm')

#run default spacy pipeline to process a document. Document transforms into object spacy.tokens.doc.Doc.
doc = nlp('I use spaCy version 2.2.4!')

#iterate through tokens in the document
for token in doc:
    print(token.i, token.text)

0 I
1 use
2 spaCy
3 version
4 2.2.4
5 !


In [88]:
#extract specific token by index
token = doc[2]
token.text

'spaCy'

In [89]:
#get a phrase slicing the document
span = doc[3:5]
span.text

'version 2.2.4'

## Explore Token object: getting part of speech, dependencies 

Let's have a look how to get token's text, part of speech, dependency label:

In [93]:
print('Sentence: {} \nSelected token: {} \nPart of speech: {} \nDependency label: {}'\
      .format(doc, doc[1].text, doc[1].pos_, doc[1].dep_))

Sentence: I use spaCy version 2.2.4! 
Selected token: use 
Part of speech: VERB 
Dependency label: ROOT


In [95]:
print('Document representation: {}\nToken indexes: {}\nToken is alpha: {}\nToken is punctuation: {}\n\
Token is a number: {}'.format([token.text for token in doc],
                              [token.i for token in doc], 
                              [token.is_alpha for token in doc],
                              [token.is_punct for token in doc],
                              [token.like_num for token in doc]))

Document representation: ['I', 'use', 'spaCy', 'version', '2.2.4', '!']
Token indexes: [0, 1, 2, 3, 4, 5]
Token is alpha: [True, True, True, True, False, False]
Token is punctuation: [False, False, False, False, False, True]
Token is a number: [False, False, False, False, True, False]


### Dependency visualisation 

In [84]:
doc = nlp("This is a sentence to demonstrate dependency visualisation in spaCy.")
displacy.render(doc, style="dep")

### Training example 1 

In [85]:
print('{:<15}{:<10}{:<10}\n'.format("Token", "POS", 'DEP'))

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    print('{:<15}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

Token          POS       DEP       

This           DET       nsubj     
is             AUX       ROOT      
a              DET       det       
sentence       NOUN      attr      
to             PART      aux       
demonstrate    VERB      relcl     
dependency     NOUN      compound  
visualisation  NOUN      dobj      
in             ADP       prep      
spaCy          NOUN      pobj      
.              PUNCT     punct     


### Training example 2

Assume we need to get all percentages from a text.

In [86]:
doc = nlp("There is a chance of rain 38% before 1 p.m. Cloudy, with a high near 42%.\
            West northwest wind 7 to 13 mph, with gusts as high as 23 mph. Chance of precipitation is 40%.")
tokens = list()
for token in doc:
    if token.like_num:
        if doc[token.i + 1].text == '%':
            tokens.append(token.text)
print('Percentage found: {}.'.format(', '.join(tokens)))

Percentage found: 38, 42, 40.


# Statistical models 

## Get Named Entity

In [55]:
doc = nlp(u"US company Apple Inc. (AAPL) became the world's first company to record a market capitalization of $1 trillion, and subsequently passed the $1.3 trillion threshold in Dec. 2019")

#access to predicted NE through .ents
for ent in doc.ents:
    print(ent.text, ent.label_)

US GPE
Apple Inc. ORG
AAPL ORG
first ORDINAL
$1 trillion MONEY
$1.3 trillion MONEY
Dec. 2019 DATE


## Link entities with tokens

In [66]:
text = "BBC News is a British free-to-air television news channel. It was launched as BBC News 24 on 9 November 1997 at 5:30 pm as part of the BBC's foray into digital domestic television channels, becoming the first competitor to Sky News, which had been running since 1989."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="ent")

### Some help hints

To get a clear understanding what a spacy abbreviation means you can use the following method:

In [69]:
spacy.explain('PRON'), spacy.explain('ORDINAL'), spacy.explain('CARDINAL')

('pronoun',
 '"first", "second", etc.',
 'Numerals that do not fall under another type')

## Rule-based matching (more powerful than RegEx)

To create a pattern to capture some patterns in a text we can do following:
    1. Create Matcher object
    2. Defind a pattern
    3. Add pattern to the Matcher
    4. Apply the Matcher for a document
    5. Get result

1. Initialise Matcher object

In [70]:
matcher = Matcher(nlp.vocab)

2. Create own pattern

For example you want to match 'MacBook Pro' in text.

In [73]:
#This pattern will match MacBook and Pro next to it
pattern = [{'TEXT': 'MacBook'}, {'TEXT': 'Pro'}]

# Add the pattern to the matcher
matcher.add('MacBook_PATTERN', None, pattern)

# Use the matcher on the document
doc = nlp('Designed for those who defy limits and change the world, the new MacBook Pro is by far the most powerful notebook we’ve ever made. With an immersive 16-inch Retina display, super-fast processors, next-generation graphics, the largest battery capacity ever in a MacBook Pro, a new Magic Keyboard and massive storage, it’s the ultimate pro notebook for the ultimate user.')
matches = matcher(doc)

print('Found matches:', [doc[start:end].text for match_id, start, end in matches])

Found matches: ['MacBook Pro', 'MacBook Pro']


Another example. Next pattern will spot a number and a noun, so we can find number of bedrooms/ bathrooms in the apartment.

In [77]:
nlp('2 bedrooms')[0].pos_

'NUM'

In [99]:
doc = nlp("The apartment features 2 bedrooms, a fully equipped kitchen with a dining area and dishwasher, 2 bathrooms, and a living room with a flat-screen TV.")

print([[t.text, t.pos_] for t in doc])
# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'IS_DIGIT': True}, {'POS': 'NOUN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', None, pattern)
matches = matcher(doc)

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

[['The', 'DET'], ['apartment', 'NOUN'], ['features', 'VERB'], ['2', 'NUM'], ['bedrooms', 'NOUN'], [',', 'PUNCT'], ['a', 'DET'], ['fully', 'ADV'], ['equipped', 'VERB'], ['kitchen', 'NOUN'], ['with', 'ADP'], ['a', 'DET'], ['dining', 'NOUN'], ['area', 'NOUN'], ['and', 'CCONJ'], ['dishwasher', 'NOUN'], [',', 'PUNCT'], ['2', 'NUM'], ['bathrooms', 'NOUN'], [',', 'PUNCT'], ['and', 'CCONJ'], ['a', 'DET'], ['living', 'NOUN'], ['room', 'NOUN'], ['with', 'ADP'], ['a', 'DET'], ['flat', 'ADJ'], ['-', 'PUNCT'], ['screen', 'NOUN'], ['TV', 'NOUN'], ['.', 'PUNCT']]
Match found: 2
Match found: 2 bedrooms
Match found: 2
Match found: 2 bathrooms


In [24]:
doc = nlp("I downloaded Fortnite on my laptop and can't open the game at all. \
           Help? so when I was downloading Minecraft, I got the Windows version \
           where it is the '.zip' folder and I used the default program to unpack \
           it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

Total matches found: 3


In [25]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


# Spacy Data Structures

In [30]:
doc = nlp('I like coffee')
coffee_hash = nlp.vocab.strings['coffee']

#hash cannot be stored, so error
#coffee_string = nlp.vocab.strings[coffee_hash]

coffee_hash, nlp.vocab.strings[3197928453018144401]

(3197928453018144401, 'coffee')

In [31]:
lexeme = nlp.vocab['coffee']

print(lexeme.text, lexeme.orth, lexeme.is_alpha)

coffee 3197928453018144401 True


# The Span object

In [59]:
from spacy.tokens import Doc, Span

words = ['Hello', 'world', '!']
spaces = [True, False, False]

#create a document manually
doc = Doc(nlp.vocab, words = words, spaces = spaces)

#create a span manually
span = Span(doc, 0, 2)

span_with_label = Span(doc, 0, 2, label = 'GREETING')

#add span to doc.ents, it is overwritable
doc.ents = [span_with_label]

In [60]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Go', ',', 'get', 'started', '!']
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [36]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

David Bowie PERSON


In [37]:
# Get all tokens and part-of-speech tags
pos_tags = [token.pos_ for token in doc]

for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index + 1] == 'VERB':
            print('Found a verb after a proper noun!')

In [38]:
# Get all tokens and part-of-speech tags

for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')

## Similarity

In [62]:
nlp = spacy.load('en_core_web_lg')

In [63]:
doc1 = nlp('I don\'t like fast food')
doc2 = nlp('I like pizza')
print(doc1.similarity(doc2))

0.8510907303318979


In [42]:
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.7369546


In [43]:
# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

In [44]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [45]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books" 
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [46]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.75173926


# Combinations

In [47]:
matcher = Matcher(nlp.vocab)
matcher.add('DOG', None, [{'LOWER': 'golden'}, {'LOWER': 'retriever'}])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched span: ", span.text)
    print('Root token:', span.root.text)
    print('Root head token:', span.root.head.text)
    print('Previous token:', doc[start-1].text, doc[start-1].pos_)

Matched span:  Golden Retriever
Root token: Retriever
Root head token: have
Previous token: a DET


In [48]:
# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'}, {'TEXT': '-'}, {'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

In [None]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

doc = nlp('Czech Republic may help Slovakia protect its airspace')

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

In [50]:
print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x121cf9100>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x12bbd8d60>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x121d022e0>)]


In [51]:
nlp.pipe_names


['tagger', 'parser', 'ner']

In [52]:
# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc

# Load the small English model
nlp = spacy.load('en_core_web_sm')
  
# Add the component first in the pipeline and print the pipe names
nlp.add_pipe(length_component, first = True)
print(nlp.pipe_names)

['length_component', 'tagger', 'parser', 'ner']


In [53]:
# Define the custom component
def length_component(doc):
    # Get the doc's length
    doc_length = len(doc)
    print("This document is {} tokens long.".format(doc_length))
    # Return the doc
    return doc
  
# Load the small English model and Add the component first in the pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(length_component, first=True)

# Process a text
doc = nlp("This is a sentence.")

This document is 5 tokens long.


In [54]:
# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')

# Create training data 

In [None]:
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)

In [None]:
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

In [None]:
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

In [None]:
# Process each text in TEST_DATA
for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print(doc.ents, '\n\n')