# spaCy essentials

- *Reference: Credit to Jose Portilla for his NLP course on Udemy where most of this content is learned from. I have reproduced this by understanding the concepts taught and re-writing it to ensure practical understanding.*

---

## Import Modules

In [40]:
# spaCy
import spacy
from spacy.matcher import Matcher

# spaCy visualization
from spacy import displacy

---

### Load spaCy libraries

In [2]:
# Load small and large version of the english language library

nlp_sm = spacy.load('en_core_web_sm')
nlp_sm

<spacy.lang.en.English at 0x25018d43848>

In [3]:
nlp_lg = spacy.load('en_core_web_lg')
nlp_lg

<spacy.lang.en.English at 0x2501d903108>

### Spacy Pipeline

- Using the language library just loaded, spacy will sparse the entire string into separate components known as **tokens** where default token unit is word
- Various attributes can also be extracted from these tokens
- The above line is actually us loading a **Spacy Model Pipeline**

In [4]:
nlp_sm.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x2501bc8d588>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2501bc8d828>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2501bc3e0b8>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2501bc3e208>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2501bcfd748>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2501bcfdd48>)]

In [5]:
nlp_sm.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

- When we run nlp_sm, our text will enter a **processing pipeline** as indicated above that breaks down the text and performs a series of operations as described above

### Document Object

In [6]:
# Sample text

tmp_txt1 = "Amazon has a market capatalization of $1.75 trillion.     HQ are in the U.S. Isn't that cool!"

In [7]:
# Creating a document object by applying our model to our text

doc = nlp_sm(tmp_txt1)

This document object holds the processed text.

In [8]:
doc

Amazon has a market capatalization of $1.75 trillion.     HQ are in the U.S. Isn't that cool!

In [9]:
type(doc)

spacy.tokens.doc.Doc

### Vocab

- Off a document object you can call the vocabulary that is based on the language library loaded at the start
- `doc.vocab`

In [10]:
len(doc.vocab)

# This implies that the library we loaded ('en_core_web_sm') has a vocabulary of 789 tokens (weird!!!!!!!)
# Recall: doc = nlp_sm(<text>); and nlp_sm = spacy.load('en_core_web_sm')

775

In [11]:
doc_lg = nlp_lg(tmp_txt1)
len(doc_lg.vocab) # weird!!!!!!!!!!!!

775

### Tokens

**Tokenization** is the very first step of text processing, where all the component parts of the text are split into tokens (normally, words and punctuations)

- Tokens are the basic building blocks of a Doc object
- Get each token from your text using the tokens attribute on the document object
- Individual tokens are parsed and tagged with: POS, dependencies and lemmas
- **`doc.tokens`**

**Token attributes** 

- Token Text                               : {token.text} 
- Token Position / Index                   : {token.idx} 
- Token Lemmatization                      : {token.lemma_} 
- Token Part of speech (POS)               : {token.pos_} 
- Token POS descr                          : {token.tag_} 
- Does token contain alphabets / Is string : {token.is_alpha} 
- Is the token a stop word                 : {token.is_stop} 
- Is the token the start of a sentence     : {token.is_sent_start}

In [12]:
# Refer to this command to discover more
# dir(token)

In [13]:
# Number of tokens a document has
print(doc)
len(doc)

Amazon has a market capatalization of $1.75 trillion.     HQ are in the U.S. Isn't that cool!


21

In [14]:
print('TOKENS (and its attributes)')
print('')

print(f'{"Text":{15}} {"Idx":>{3}} {"Lemma":{15}} {"POS":{10}} {"POS Descr":{15}} {"Alphabet?":{10}} {"Stop Word?":{10}} {"Start of sent?":{15}} {"Entity Type":{10}}')
print('-'*110)
for token in doc:
    print(f'{token.text:{15}} {token.idx:<{3}} {token.lemma_:{15}} {token.pos_:{10}} {token.tag_:{15}} {token.is_alpha:<{10}} {token.is_stop:<{10}} {token.is_sent_start:<{15}} {token.ent_type_:{10}}')

TOKENS (and its attributes)

Text            Idx Lemma           POS        POS Descr       Alphabet?  Stop Word? Start of sent?  Entity Type
--------------------------------------------------------------------------------------------------------------
Amazon          0   Amazon          PROPN      NNP             1          0          1               ORG       
has             7   have            VERB       VBZ             1          1          0                         
a               11  a               DET        DT              1          1          0                         
market          13  market          NOUN       NN              1          0          0                         
capatalization  20  capatalization  NOUN       NN              1          0          0                         
of              35  of              ADP        IN              1          1          0                         
$               38  $               SYM        $               0          0

**`Insights`**
- Spacy knows that Amazon in this context is a proper noun
- Spacy tags trillion as number and is a quantifier of money
- Spacy knows that U.S. is a proper noun.
- Isn't is split into is and n't because Spacy understands the root verb (is) and the negation (n't) 
- Extended whitespaces are tokens as well

In [15]:
# You can use indexing to grab tokens individually
# Index the document object to get individual tokens
print(doc[0])
print(doc[-2])

print('~~~~~~~~~~~~~~~')

# Attributes of specific tokens
print(doc[0].pos_)
print(doc[-2].pos_)

Amazon
cool
~~~~~~~~~~~~~~~
PROPN
ADJ


### Span

- Take a slice from the document object and you get a `Span`
- comes in handy when you are only interested in a part of a article
- useful for very large documents
- One might think that you can slice the original text, but spacy makes it easier. 
    - If you want the slice the original text by using words are the indexes, then you first have to form words using regex or something and then use that to slice by words. 
    - However, if you slice the document object, then span will operate on the tokens that form that document. In the case below the tokens are words so it returns words from index 3 till index 6 (not inclusive)

In [16]:
print('Type of the text provided                           : ', type(tmp_txt1))
print('Type of the document object made from text provided : ', type(doc))

Type of the text provided                           :  <class 'str'>
Type of the document object made from text provided :  <class 'spacy.tokens.doc.Doc'>


In [17]:
print(tmp_txt1)
print(doc)

Amazon has a market capatalization of $1.75 trillion.     HQ are in the U.S. Isn't that cool!
Amazon has a market capatalization of $1.75 trillion.     HQ are in the U.S. Isn't that cool!


In [18]:
slice_of_txt1 = tmp_txt1[3:6]
print(type(slice_of_txt1))
slice_of_txt1

<class 'str'>


'zon'

In [19]:
doc

Amazon has a market capatalization of $1.75 trillion.     HQ are in the U.S. Isn't that cool!

In [20]:
span_doc = doc[3:6]
print(type(span_doc))
span_doc

<class 'spacy.tokens.span.Span'>


market capatalization of

In [21]:
slice_of_doc = doc[3:6]
print(type(slice_of_doc))
slice_of_doc

<class 'spacy.tokens.span.Span'>


market capatalization of

### Sentences

- Get each sentence from your text using the sents attribute on the document object
- **`doc.sents`**

In [22]:
# Pass a unicode string to the nlp model pipeline object to create a document
doc2 = nlp_sm(u"This is the first sentence. This is the second. This is the third")

for sentence in doc2.sents:
    print(sentence)

This is the first sentence.
This is the second.
This is the third


### Named Entities

- In the language model you load, spacy recognizes that certain tokens are organization names (e.g. Tesla), are locations, or related to money or dates, etc.
- Named entities add another layer of context
- are accessible through the `ents` property of a document object
- <font color=green> **`doc.ents`** </font>

In [23]:
doc3 = nlp_sm(u'Tesla to build a factory in Toronto for $4 million')

print('TOKENS in document:')
for token in doc3:
    print(token.text, end = ' | ')

print('\n')
print('-'*40)

print('ENTITIES in document:')
for entity in doc3.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print("")

TOKENS in document:
Tesla | to | build | a | factory | in | Toronto | for | $ | 4 | million | 

----------------------------------------
ENTITIES in document:
Toronto
GPE
Countries, cities, states

$4 million
MONEY
Monetary values, including unit



### Noun Chunks

- You can think of noun chunks as a noun plus the words describing the noun – for example, in Sheb Wooley's 1958 song, a <font color=purple>*one-eyed, one-horned, flying, purple people-eater*</font> would be one long noun chunk.
- are basically nouns with some descriptor words attached to them (or just the noun itself if there are no descriptor words)
- are accessible through the **`noun_chunks`** property of a document object
- **`doc.noun_chunks`**

In [24]:
doc4 = nlp_sm('Autonomous cars shift insurance liability toward manufacturers')

for noun_chunk in doc4.noun_chunks:
    print(noun_chunk)

Autonomous cars
insurance liability
manufacturers


---

## Built-In visualizers

- works well within jupyter notebooks

In [25]:
doc5 = nlp_sm(u"Apple is going to build a U.K. factory for $6 million")

In [26]:
# Display to dependency b/w tokens

displacy.render(doc5, 
                style='dep',  # Syntatic dependency
                jupyter=True, 
                options={'distance':80}  # distance b/w token
               )

In [27]:
# Visualize the Entity recognizer

doc6 = nlp_sm(u"Over the last quarter, Apple sold nearly 20 thousand iPods for a profit of $6 million")

# Highlight every entity
displacy.render(doc6, 
                style='ent',  # Entity
                jupyter=True
               )
                


In [28]:
# Visualize outside of jupyter notebook (use .serve() instead of .render())

# displacy.serve(doc6, 
#                 style='ent'
#                )

# Open browser to 127.0.0.1:5000

## Stemming

- Cataloging related words (e.g. boat, boats, boating, boater, etc.)
- it essentially chops off letters from the end until the stem is reached
- spaCy doesn't include a stemmer, instead, it relies on Lemmatization
- To perform stemming, use NLTK (section above)

## Lemmatization

- generally seen as much more information than Stemming
- Looks at surrounding text to dtermine a given word's part os speech
- More informative way of reducing down words, while taking into account how the word is being used in the sentence
- **`token.lemma_`**

In [29]:
doc_lem1 = nlp_sm('I am a runner running in a race because I love to run since I ran today')
doc_lem1

I am a runner running in a race because I love to run since I ran today

In [30]:
print(f'{"Token Text":{15}} {"Token Lemma_":{15}} {"Token POS":{12}} {"Token Lemma Hash Ref":{23}} ')
print('-'*63)

for token in doc_lem1:
    print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{12}} {token.lemma:<{23}}')

Token Text      Token Lemma_    Token POS    Token Lemma Hash Ref    
---------------------------------------------------------------
I               I               PRON         4690420944186131903    
am              be              AUX          10382539506755952630   
a               a               DET          11901859001352538922   
runner          runner          NOUN         12640964157389618806   
running         run             VERB         12767647472892411841   
in              in              ADP          3002984154512732771    
a               a               DET          11901859001352538922   
race            race            NOUN         8048469955494714898    
because         because         SCONJ        16950148841647037698   
I               I               PRON         4690420944186131903    
love            love            VERB         3702023516439754181    
to              to              PART         3791531372978436496    
run             run             VERB  

In [31]:
# Creating a function to shows lemmas for other documents

def show_lemmas(text):
    """
    Helper function to print lemmas of words(tokens) of a provided text
    I/P:
        - text(NLP document object): NLP doc object
    """

    print(f'{"Token Text":{15}} {"Token Lemma_":{15}} {"Token POS":{12}} {"Token Lemma Hash Ref":{23}} ')
    print('-'*65)
    
    for token in text:
        print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{12}} {token.lemma:<{23}}')

In [32]:
doc_lem2 = nlp_sm('I saw the movie Saw yesterday while watching over my dog. You\'ve seen it?')
doc_lem2

I saw the movie Saw yesterday while watching over my dog. You've seen it?

In [33]:
show_lemmas(doc_lem2)

Token Text      Token Lemma_    Token POS    Token Lemma Hash Ref    
-----------------------------------------------------------------
I               I               PRON         4690420944186131903    
saw             see             VERB         11925638236994514241   
the             the             DET          7425985699627899538    
movie           movie           NOUN         18213940162184454424   
Saw             Saw             PROPN        11380715829964476340   
yesterday       yesterday       NOUN         1756787072497230782    
while           while           SCONJ        1039541750886098100    
watching        watch           VERB         2054481287215635300    
over            over            ADP          5456543204961066030    
my              my              PRON         227504873216781231     
dog             dog             NOUN         7562983679033046312    
.               .               PUNCT        12646065887601541794   
You             you             PRON

## Stop Words

- really common english words that dont really provide any additional information
- scpaCy holds a built-in list of English stop words (~320 stopwords)
- a really common step in text preprocessing NLP pipelines is to remove stop words

In [34]:
# spaCy default stopwords

print(f'Number of stop words: {len(nlp_lg.Defaults.stop_words)}')
print("")
print(nlp_lg.Defaults.stop_words)

Number of stop words: 326

{'front', 'anywhere', 'top', 'from', 'throughout', 'if', 'only', 'above', 'much', 'whence', 'back', 'namely', 'well', 'across', "'ve", 'just', 'own', '‘re', 'twelve', 'whoever', 'almost', 'forty', 'however', 'whither', 'or', 'out', 'again', 'more', 'your', 'most', 'than', 'will', 'elsewhere', 'mine', 'and', 'them', "'d", 'latterly', 'ten', 'too', 'four', 'at', 'either', 'n’t', 'whereas', 'towards', 'third', 'see', 'show', 'formerly', 'could', 'do', 'be', 'on', 'all', 'her', 'enough', 'nine', 'although', 'beyond', '‘ll', 'any', 'thence', 'always', 'beside', 'whom', 'really', 'with', 'yourself', 'keep', 'not', 'became', 'us', 'its', 'already', 'often', 'a', 'being', 'him', 'serious', 'several', 'get', 'how', 'various', '’m', 'side', 'yourselves', 'thereupon', 'among', 'bottom', 'every', 'are', 'part', 'nobody', 'anything', 'when', 'whole', 'did', '’ll', 'for', 'whereby', 'eight', 'hereby', 'another', 'last', 'upon', 'nevertheless', 'why', 'i', 'meanwhile', 'wer

In [35]:
# Check if a word is a stop word by calling nlp vocab

print(nlp_lg.vocab['is'])
print(nlp_lg.vocab['is'].is_stop)

<spacy.lexeme.Lexeme object at 0x000002501DA7DF48>
True


In [36]:
print(nlp_lg.vocab['umbrella'])
print(nlp_lg.vocab['umbrella'].is_stop)

<spacy.lexeme.Lexeme object at 0x000002501D8F9908>
False


In [37]:
# ADD you own custom stopword

print('Length of existing stop words : ', len(nlp_lg.Defaults.stop_words))
print('Is btw a stop word?           : ', nlp_lg.vocab['btw'].is_stop)

nlp_lg.Defaults.stop_words.add('btw')
nlp_lg.vocab['btw'].is_stop = True

print('Length of existing stop words : ', len(nlp_lg.Defaults.stop_words))
print('Is btw a stop word?           : ', nlp_lg.vocab['btw'].is_stop)

Length of existing stop words :  326
Is btw a stop word?           :  False
Length of existing stop words :  327
Is btw a stop word?           :  True


In [38]:
# REMOVE you own custom stopword

print('Length of existing stop words : ', len(nlp_lg.Defaults.stop_words))
print('Is btw a stop word?           : ', nlp_lg.vocab['btw'].is_stop)

nlp_lg.Defaults.stop_words.remove('btw')
nlp_lg.vocab['btw'].is_stop = False

print('Length of existing stop words : ', len(nlp_lg.Defaults.stop_words))
print('Is btw a stop word?           : ', nlp_lg.vocab['btw'].is_stop)

Length of existing stop words :  327
Is btw a stop word?           :  True
Length of existing stop words :  326
Is btw a stop word?           :  False


## Vocabulary and Matching

In [130]:
# Understand the patterns that you're looking to match
# and code their patterns to feed to the matcher
# Each different pattern is a list of dictionaries,
# with each dictionary pertaining to how the token 
# within the pattern is to be captured.

# if case: "SolarPower" / "solarpower" / "Solarpower"
# Read: the lower case of the token = solarpower
pattern1 = [{'LOWER': 'solarpower'}]

# if case: "Solar-Power" / "solar-power" / "Solar-power"
# Read: 
    # the lower case of the 1st token = solar
    # the 2nd token (i.e., the token after the 1st, is a punctuation)
    # the lower case of the 3rd token = power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

# if case: "Solar Power" / "solar power" / "Solar power"
    # We dont have anything in b/w to take care of the
    # space since 1 single space is not a token
    # in spacy
# Read: 
    # the lower case of the 1st token = solar
    # the lower case of the next token = power
    # We leave out a whitespace since a single whitespace
    # is not a token
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

# if case: "solar------power" / "solar*@#@power" / "solar <any amount of punctuation" power"
pattern4 = [{'LOWER': 'solar'}, 
            {'IS_PUNCT': True, 
             'OP':'*'},  # Optional - zero or more times
            {'LOWER': 'power'}]

# Putting together all patterns to match for
patterns = [pattern1, pattern2, pattern3, pattern4]
patterns

[[{'LOWER': 'solarpower'}],
 [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}],
 [{'LOWER': 'solar'}, {'LOWER': 'power'}],
 [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP': '*'}, {'LOWER': 'power'}]]

In [150]:
# Showing tokens for each of the cases below for better 
# understanding of how the patterns were defined
doc_for_mat1 = nlp_lg(u'SolarPower')
doc_for_mat2 = nlp_lg(u'solar-power')
doc_for_mat3 = nlp_lg(u'Solar power')
doc_for_mat4 = nlp_lg(u'Solar---$%power')

for token in doc_for_mat1:
    print(token)
    # Since only 1 token here, pattern1 has only 1 dictionary
    # in the list

print('~'*30)
for token in doc_for_mat2:
    print(token)
    # 3 tokens here, so 3 dictionaries for pattern2

print('~'*30)
for token in doc_for_mat3:
    print(token)
    # 2 tokens here, so 2 dictionaries for pattern3
    # 1 whitespace in b/w does not for a token

print('~'*30)
for token in doc_for_mat4:
    print(token)
    # 1 token then any punctuation as separate tokens then another token for power

SolarPower
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
solar
-
power
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Solar
power
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Solar---$%power


In [132]:
# Create an instance of the matcher object

matcher = Matcher(nlp_lg.vocab)
matcher

<spacy.matcher.matcher.Matcher at 0x25060535e58>

In [133]:
# Name our matcher (we could have many matchers)
# We do this by adding to the matchers 
# We'll call the above "Solarpower_Matcher"

matcher.add('Solarpower_Matcher', patterns)

# So now, 3 patterns (to check for) have been added to a
# matcher object we've called Solarpower_Matcher

In [156]:
# Lets test the matcher

# Create a document
doc_mat = nlp_lg(u"The Solar Power industry grows as solarpower increases. Solar-power is good. solar ---$% power")
doc_mat

The Solar Power industry grows as solarpower increases. Solar-power is good. solar ---$% power

In [157]:
# Get all tokens of the document 

for i, token in enumerate(doc_mat):
    print(f'{i:{2}} {token}')

 0 The
 1 Solar
 2 Power
 3 industry
 4 grows
 5 as
 6 solarpower
 7 increases
 8 .
 9 Solar
10 -
11 power
12 is
13 good
14 .
15 solar
16 ---$%
17 power


In [158]:
# Capture the results of the matcher in a variable

matches_found = matcher(doc_mat)

In [159]:
# Tuples of (match_id, start position of token (inclusive), stop position of token (exclusive)

matches_found

[(9730578706714751770, 1, 3),
 (9730578706714751770, 6, 7),
 (9730578706714751770, 9, 12)]

In [160]:
# Printing above in a different way

for match_id, start_pos, end_pos in matches_found:
    string_id = nlp_lg.vocab.strings[match_id]  # get string representation
    span = doc_mat[start_pos : end_pos]  # get the matched span
    print(f'{match_id} {string_id} {start_pos:{3}} {end_pos:{3}} {span.text}')

9730578706714751770 Solarpower_Matcher   1   3 Solar Power
9730578706714751770 Solarpower_Matcher   6   7 solarpower
9730578706714751770 Solarpower_Matcher   9  12 Solar-power


In [148]:
# Remove from matcher if you dont want to detect anymore

# matcher.remove('Solarpower_Matcher')

---