# Natural Language Processing with spaCy

---

Today we are going to learn about NLP (Natural Language Processing) using spaCy, an open source library for advanced NLP. 

For more on spaCy, you can check out their site: https://spacy.io/

In [2]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


Natural Language Processing (NLP) is a subfield of Artificial Intelligence that deals with the 'understanding' of language. 

For the purposes of this tutorial, we are going to look at the very basics of NLP, including using spaCy, an open-source NLP library. 

In [3]:
from spacy.lang.en import English # import the English language class

nlp = English() # create an nlp object; this object contains the processing pipeline, which you 
                # ultimately use to analyze the text
    
doc = nlp("Hello world!") # "Hello world!" becomes the text that we want to analyze
                          # when you process a text with the nlp object, spaCy creates a Doc object
    
for token in doc: # for every token in our Doc object (a token being a word or character)...
    print(token.text) # simply print out that token 

Hello
world
!


## Indexing

Similar to how you index through a list in Python, you can index through a Doc to retreive tokens. 

In [4]:
print(doc) # print the full text 
print(doc[1].text) # print the second token (remember, Python is 0-index) in our Doc object

Hello world!
world


## Spans

You can also use 'span' which lets you take a slice of the Doc

In [5]:
span = doc[0:2] # this will give us the first and second (again, remember, Python is 0-index)
                # the span is not inclusive, so we don't actually get the third token (second index)

print(doc) # print the full text
print(span.text) # print the second and third tokens 

Hello world!
Hello world


## What else can we do with spaCy? 

Tokens have lots of attributes associated with them! For instance: 

1. is_alpha returns boolean indicating if a token consists of an alphanumeric value
2. is_punct returns boolean indicating if a token is punctuation
3. like_num returns boolean indicating if a token resembles a number 

These are all called "lexical attributes" – they refer to the entry in the vocabulary and don't depend on the token's context. (More on that later). 

In [6]:
doc = nlp("The earnings report will be released at 5 pm sharp.") # the text we want to work with 

print('Index: ', [token.i for token in doc]) # i being the index of the token in the Doc
print('Text: ', [token.text for token in doc]) # return the text of the token

print(" ") # just so we have some nice spacing in our results below...

print('is_alpha:', [token.is_alpha for token in doc]) # if token consists of an alphanumeric value
print('is_punct:', [token.is_punct for token in doc]) # if token is punctuation
print('like_num:', [token.like_num for token in doc]) # if token resembles a number (e.g., '10' or 'TEN')

Index:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Text:  ['The', 'earnings', 'report', 'will', 'be', 'released', 'at', '5', 'pm', 'sharp', '.']
 
is_alpha: [True, True, True, True, True, True, True, False, True, True, False]
is_punct: [False, False, False, False, False, False, False, False, False, False, True]
like_num: [False, False, False, False, False, False, False, True, False, False, False]


---

# Exercise 1:

Imagine you are charged with reporting on a long press release, and you just want to know where in the document a percent increase or percent decrease is mentioned...

In [7]:
doc = nlp("In 2012, earnings were hovering around 60%, verus in 2019 where they are less than 4% – a 93% decrease.")

Use next_token, .like_num, .text, and find any percentage value mentiond in the doc.

# Solution

In [8]:
for token in doc: # for every token in our Doc object...
    if token.like_num: # if the token resembles a numerical value...
        next_token = doc[token.i + 1] # look at the token following that numerical value...
        if next_token.text == "%": # if that token is a "%" sign...
            next_token = doc[next_token.i + 1] # look at the token following the "%"
            if next_token.text == "increase" or next_token.text == "decrease": # if the token after the % is the word "decrease" or "increase"...
                print("Percentage found:", token.text, next_token.text ) # we know we have found a percentage value

Percentage found: 93 decrease


---

## Pre-Built Models:

In [9]:
import spacy 

nlp = spacy.load('en') # loading in the package we just downloaded...

doc = nlp("Adidas AG and Gap Inc. are among those at the end of the long supply chain that travel through \
           China’s northwest region of Xinjiang.") # this is the text we want to analyze 
                                                   # that '\' above just lets me split the text into a new 
                                                   # line in my notebook, and isn't part of the text itself

for token in doc: # for each token in our Doc...
    print(token.text, token.pos_, token.dep_, token.head.text) # print the following:
    
    # .pos_ will give us the parts of speech for each token
    # .dep_ will give us the predicted dependency label 
    # .head.text will give us the 'syntactic head token' (think of it as the parent token this word is attached to)

Adidas PROPN compound AG
AG PROPN nsubj are
and CCONJ cc AG
Gap PROPN compound Inc.
Inc. PROPN conj AG
are AUX ROOT are
among ADP prep are
those DET det region
at ADP prep those
the DET det end
end NOUN pobj at
of ADP prep end
the DET det chain
long ADJ amod chain
supply NOUN compound chain
chain NOUN pobj of
that DET nsubj travel
travel NOUN relcl chain
through ADP prep travel
            SPACE  through
China PROPN pobj through
’s PART punct those
northwest ADJ compound region
region NOUN pobj among
of ADP prep region
Xinjiang PROPN pobj of
. PUNCT punct are


## `ent.label_`

Can be used to decipher entities...

In [10]:
doc = nlp("Adidas AG and Gap Inc. are among those at the end of the long supply chain that travel through \
           China’s northwest region of Xinjiang.")

for ent in doc.ents: # for each entity in our Doc...
    print(ent.text, ent.label_) # print it alongside its label

Adidas AG ORG
Gap Inc. ORG
China GPE
Xinjiang GPE


## `.explain`

Can be used to get quick definitions of common tags and labels, you can use ".explain"

In [11]:
print("GPE = " + spacy.explain('GPE'))
print("ORG = " + spacy.explain('ORG'))

GPE = Countries, cities, states
ORG = Companies, agencies, institutions, etc.


spaCy also lets you write rules to find words and/or phrases in a text. Similar to Regular Expressions, but with some major benefits unique to spaCy. 

In particular, it allows you to match on Doc objects (not just strings), use the model's prediction capabilities, and match on tokens and token attributes. Match patterns in spaCy are comprised of lists of dictionaries, and each dictionary describes one token. 

The keys in the dictionary are the names of the token attributes, and are mapped to their expected value. 

In [12]:
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake.") # our text

from spacy.matcher import Matcher # import the matcher
matcher = Matcher(nlp.vocab) # initialize the matcher

pattern = [{'TEXT':'iPhone'}, {'TEXT':'X'}]    # match these exact token texts

matcher.add('IPHONE_PATTERN', None, pattern) # add the pattern to the matcher

matches = matcher(doc) # call the matcher on our Doc and store the result as a list called 'matches'

print(matches)

[(9528407286733565721, 1, 3)]


You'll note that the matcher returns a list of tuples. Each tuple (an immutable list of fixed size) consists of three values: 

    1. The match ID
    2. The start index of the matched span
    3. The end index of the matched span
    
Fortunately, we can iterate over our matches

In [13]:
for match_id, start, end in matches: 
    matched_span = doc[start:end] # start = start index of matched span; end = end index of matched span
    print(matched_span.text)

iPhone X


Remember, you can also match on lexical attributes and token attributions. For instance, below we are going to look for five tokens: 

1. A token consisting of only digits
2. Two, case-insensitive tokens for the words "revenue" and "up"
3. Another token that consists of only digits
4. A punctuation token

In [14]:
doc = nlp("Earnings are in today! 2019 Revenue up 45%! This is the highest revenue in 5 years.")

pattern = [
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'LOWER': 'revenue'}, # looking for the word "revenue"
    {'LOWER': 'up'}, # looking for the word "up"
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'IS_PUNCT': True} # looking for a punctuation token 
]

matcher.add('REVENUE_PATTERN', None, pattern) # add the pattern to the matcher

matches = matcher(doc)

for match_id, start, end in matches: 
    matched_span = doc[start:end] # start = start index of matched span; end = end index of matched span
    print(matched_span.text)

2019 Revenue up 45%


## A note on Operators and Quantifiers.

Operators and Quantifiers let you define how often a token should be matched. 

An Operator can have one of four values: 

1. An "!" negates the token, so it's matched 0 times
2. A "?" makes the token optional, so it matches 0 or 1 times
3. A "+" matches a token 1 or more times
4. A "*" matches a token 0 or more times

Below, the "?" Operator makes the determiner token optional.

In [17]:
doc = nlp("Earnings are in today! 2019 Revenue up 45% for Company X. This is the highest revenue in 5 years.")

# note that our text above has changed... 

pattern = [
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'LOWER': 'revenue'}, # looking for the word "revenue"
    {'LOWER': 'up'}, # looking for the word "up"
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'IS_PUNCT': True, 'OP' : '?'} # looking for an OPTIONAL punctuation token
]

matcher.add('REVENUE_PATTERN', None, pattern) # add the pattern to the matcher

matches = matcher(doc)

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['2019 Revenue up 45%', '2019 Revenue up 45']


## Luckily, Spacy also allows us to use Regular Expressions.

In [18]:
import re

doc = nlp("Earnings are in today! 2019 Revenue up 45% for Company X. This is the highest revenue in 5 years.")

expression = r'[Rr]evenue (up|down)' 

for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    print("Found match:", span.text)

Found match: Revenue up


---

# Exercise 2

Using the text provided below, create a pattern match that finds any news about a possible merger or acquisition.

In [19]:
# your code here

# Solution

In [20]:
doc = nlp("It is rumored that Google bought Apple.")

for ent in doc.ents:
    if ent.label_ == 'ORG':
        print(ent.text)
        next_token = doc[ent.start + 1]
        if next_token.text in ("bought", "sold", "acquired"):
            print(next_token.text) 

Google
bought
Apple


---

## Vocabularies

spaCy stores all shared data in a vocabulary, which includes words, as well as the labeled schemas for tags and entities. It also uses a hash function to generate an ID for each string, which is stored in a string store and is available via nlp.vocab.strings

This string store is ultimately a lookup table whereby you can look up a string to get its hash, or, look up a hash to get the string. For instance:

In [21]:
doc = nlp("I love coffee") # our text

print('hash value:', nlp.vocab.strings['coffee']) # print the hash value given the text
print('string value:', nlp.vocab.strings[3197928453018144401]) # print the text value given the hash 

hash value: 3197928453018144401
string value: coffee


spaCy even lets you compare two objects to predict how similar they are. 

These objects can be documents, spans, or single tokens. 

In [23]:
# !sudo python -m spacy download en_core_web_md

!python -m spacy download en_core_web_md

import en_core_web_md # you only have to download it in the line above if you didn't earlier

nlp = en_core_web_md.load()

# compare two documents

doc1 = nlp("I like fast food") # doc 1 to be compared
doc2 = nlp("I like pizza") # doc 2 to be compared

print(doc1.similarity(doc2))

Collecting en_core_web_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4MB)
[K     |████████████████████████████████| 96.4MB 1.2MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-cp36-none-any.whl size=98051305 sha256=1ec46b0ff494f04bc681514f0189f0b5590c1ea4668112334a27cf2b0df91ec3
  Stored in directory: /tmp/pip-ephem-wheel-cache-xllh_ehx/wheels/df/94/ad/f5cf59224cea6b5686ac4fd1ad19c8a07bc026e13c36502d81
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
0.8627204117787385


We can also compare two tokens: 

In [24]:
# nlp = en_core_web_md.load()

doc = nlp("I like pizza and pasta")

# compare two tokens

token1 = doc[2] # the word "pizza"
token2 = doc[4] # the word "pasta"

print(token1.similarity(token2)) 

0.7369546


In [25]:
doc = nlp("TV and books")

token1, token2 = doc[0], doc[2]

similarity = token1.similarity(token2) # get the similarity of the tokens "TV" and "books"

print(similarity)

0.22325331


Or, a document with a token: 

In [26]:
# nlp = en_core_web_md.load()

# compare a document with a token

doc = nlp("I like being clean") # this full text
token = nlp("I also like soap")[3] # the word "soap"

print(doc.similarity(token))

0.37694991878301737


And, last but not least, a span with a document: 

In [27]:
# nlp = en_core_web_md.load()

# compare a span with a document

span = nlp("I like pizza and pasta")[2:5] # the words "pizza and pasta"
doc = nlp("McDonalds sells burgers") # this full text

print(span.similarity(doc))

0.6199092090831612


---

## BeautifulSoup + spaCy

Now, let's use some of the BeautifulSoup to analyze some text from an online source:

In [28]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.vice.com/en_us/article/a35ve5/what-it-would-take-for-the-next-president-to-cancel-all-student-debt') 
    # for more on the requests library check out this tutorial from RealPython: 
    # https://realpython.com/python-requests/
            
soup = BeautifulSoup(r.text,'html') # we are going to turn that URL into 'soup', aka, we are going to be 
                                    # able to see it's metadata For more on BeautifulSoup, check out: 
                                    # https://www.crummy.com/software/BeautifulSoup/bs4/doc/
            
print(soup)

<!DOCTYPE html>
<html dir="ltr" lang="en"><head><link as="script" href="//vice-web-statics-cdn.vice.com/sourcepoint/messaging.js" rel="preconnect dns-prefetch"/><link as="script" href="//securepubads.g.doubleclick.net/tag/js/gpt.js" rel="preconnect dns-prefetch"/><link as="script" href="//native.sharethrough.com/assets/sfp.js" rel="preconnect dns-prefetch"/><link href="//sourcepoint.mgr.consensu.org" rel="preconnect"/><link href="//api.amplitude.com" rel="preconnect"/><link href="//gum.criteo.com" rel="preconnect"/><link as="script" href="https://vice-web-statics-cdn.vice.com/vendor/ad-lib/v2.16.0/vice-ad-lib.js" id="ad-lib-preload" rel="preload"/><link href="//www.googletagmanager.com" rel="dns-prefetch"/><link href="//fonts.gstatic.com" rel="dns-prefetch"/><link href="//vice-dev-web-statics-cdn.vice.com" rel="dns-prefetch"/><link href="//vice-web-statics-cdn.vice.com" rel="dns-prefetch"/><script type="text/javascript">
            function DOMTokenListSupports(tokenList, token) {
   

We know that what we're interested in is the text of this article. So, let's see what that looks like in the HTML.

In [29]:
paragraphs = [i.get_text() for i in soup.find_all('p')] # find all of the <p> elements in our text

print(paragraphs) 

['Last Wednesday, President Donald Trump issued an executive order intended to wipe out the student loans of around 25,000 permanently disabled veterans, a move that came after dozens of state attorneys general said that it was way too complicated for wounded veterans to get rid of their student debt. The move will supposedly save these veterans an average of $30,000, but it represents a tiny fraction of the $1.6 trillion in student debt that Americans collectively owe.', "Disabled veterans aren't the only ones having difficulty navigating an extremely confusing loan forgiveness process, and though debates about what to do about the country's student loan crisis have emerged in the Democratic primary, Trump doesn't seem inclined to take the broader problem seriously. Last year, the highest-ranking federal official in charge of keeping lenders like Navient and Sallie Mae in line quit in protest of what he said were lax enforcement policies.", "Trump's move to forgive the debt of some ve

In [30]:
article = '\n'.join(paragraphs) # join all of those paragraphs together

print(article)

Last Wednesday, President Donald Trump issued an executive order intended to wipe out the student loans of around 25,000 permanently disabled veterans, a move that came after dozens of state attorneys general said that it was way too complicated for wounded veterans to get rid of their student debt. The move will supposedly save these veterans an average of $30,000, but it represents a tiny fraction of the $1.6 trillion in student debt that Americans collectively owe.
Disabled veterans aren't the only ones having difficulty navigating an extremely confusing loan forgiveness process, and though debates about what to do about the country's student loan crisis have emerged in the Democratic primary, Trump doesn't seem inclined to take the broader problem seriously. Last year, the highest-ranking federal official in charge of keeping lenders like Navient and Sallie Mae in line quit in protest of what he said were lax enforcement policies.
Trump's move to forgive the debt of some veterans s

---

# Exercise 3:

What if we want to know what entities are mentioned in the article? 

# Solution

In [31]:
nlp = spacy.load('en')

doc = nlp(article) # our text is going to be the article text above (that first 500 characters)

for ent in doc.ents: # for each entity in our Doc...
    print(ent.text, ent.label_) # print that entity aside its label

Last Wednesday DATE
Donald Trump PERSON
around 25,000 CARDINAL
dozens CARDINAL
30,000 MONEY
$1.6 trillion MONEY
Americans NORP
Democratic NORP
Trump PRODUCT
Last year DATE
Navient PERSON
Sallie Mae ORG
Trump PERSON
Congress ORG
Mark Kantrowitz PERSON
one CARDINAL
Trump LOC
2008 DATE
the Higher Education Act LAW
Kantrowitz PERSON
tens of thousands CARDINAL
American NORP
Social Security ORG
the Department of Education ORG
Trump PERSON
Kantrowitz PERSON
Trump PERSON
Congress ORG
the Anti-Deficiency Act LAW
Alan Collinge PERSON
about 70 percent PERCENT
U.S. GPE
Democratic NORP
Collinge ORG
Bernie Sanders PERSON
Elizabeth Warren PERSON
two CARDINAL
2020 CARDINAL
Republican NORP
Congress ORG
Collinge ORG
30 percent PERCENT
Sallie Mae ORG
the Department of Education ORG
Congress ORG
ADA ORG
daily DATE
Allie Conti PERSON


# Exercise 4:

What if we want to know the similarity between the first and last sentences of the aritcle? 

# Solution

In [33]:
doc1 = nlp(paragraphs[3])
doc2 = nlp(paragraphs[-3])

print(doc1.similarity(doc2))

0.7119196592774236


  "__main__", mod_spec)


---

An enormous thank you to spaCy, whose existing online courses (https://course.spacy.io/chapter1) were the basis for this Jupyter-ized tutorial. For more information on spaCy's existing online course, check out https://github.com/ines/spacy-course#-faq