# Natural Language Processing with spaCy

Today we are going to learn about NLP (Natural Language Processing) using spaCy, an open source library for advanced NLP. 

For more on spaCy, you can check out their site: https://spacy.io/

In [None]:
!python -m spacy download en # remember our bash command, allowing us to run commands on the terminal of our Colab instance

Natural Language Processing (NLP) is a subfield of Artificial Intelligence that deals with the 'understanding' of language. 

For the purposes of this tutorial, we are going to look at the very basics of NLP, including using spaCy, an open-source NLP library. 

In [None]:
from spacy.lang.en import English # import the English language class

nlp = English() # create an nlp object; this object contains the processing pipeline, which you 
                # ultimately use to analyze the text
    
doc = nlp("Hello world!") # "Hello world!" becomes the text that we want to analyze
                          # when you process a text with the nlp object, spaCy creates a Doc object
    
for token in doc: # for every token in our Doc object (a token being a word or character)...
    print(token.text) # simply print out that token 

# When working with libraries such as spaCy, which you are more unlikely to have already used, or to use as often, as something like Pandas, it's 
# always a good idea to read through the introductory documentation on the library. 

## Indexing

Similar to how you index through a list in Python, you can index through a Doc to retreive tokens. 

In [None]:
print(doc) # print the full text 
print(doc[1].text) # print the second token (remember, Python is 0-index) in our Doc object

print(type(doc[1])) # you'll note that this returns a token object...
print(type(doc[1].text)) # whereas using .text returns it as a string

## Spans

You can also use 'span' which lets you take a slice of the Doc

In [None]:
span = doc[0:2] # this will give us the first and second (again, remember, Python is 0-index)
                # the span is not inclusive, so we don't actually get the third token (second index)

print(doc) # print the full text
print(span.text) # print the second and third tokens 

## What else can we do with spaCy? 

Tokens have lots of attributes associated with them! For instance: 

1. is_alpha returns boolean indicating if a token consists of an alphanumeric value
2. is_punct returns boolean indicating if a token is punctuation
3. like_num returns boolean indicating if a token resembles a number 

These are all called "lexical attributes" – they refer to the entry in the vocabulary and don't depend on the token's context. (More on that later). 

In [None]:
doc = nlp("The earnings report will be released at 5 pm sharp.") # the text we want to work with 

print('Index: ', [token.i for token in doc]) # i being the index of the token in the Doc; remember this is just a for loop in a more 'Pythonic' format
print('Text: ', [token.text for token in doc]) # return the text of the token; remember this is just a for loop in a more 'Pythonic' format

print(" ") # just so we have some nice spacing in our results below...

print('is_alpha:', [token.is_alpha for token in doc]) # if token consists of an alphanumeric value
print('is_punct:', [token.is_punct for token in doc]) # if token is punctuation
print('like_num:', [token.like_num for token in doc]) # if token resembles a number (e.g., '10' or 'TEN')

---

# ⭕ **QUESTIONS?**

---

# Exercise 1 (together):

Imagine you are charged with reporting on a long press release, and you just want to know where in the document a percent increase or percent decrease is mentioned...

In [None]:
doc = nlp("In 2012, earnings were hovering around 60%, verus in 2019 where they are less than 4% – a 93% decrease.")

Use next_token, .like_num, .text, and find any percentage value mentiond in the doc.

# Solution

In [None]:
for token in doc: # for every token in our Doc object...
    if token.like_num: # if the token resembles a numerical value...
        next_token = doc[token.i + 1] # look at the token following that numerical value...
        if next_token.text == "%": # if that token is a "%" sign...
            next_token = doc[next_token.i + 1] # look at the token following the "%"
            if next_token.text == "increase" or next_token.text == "decrease": # if the token after the % is the word "decrease" or "increase"...
                print("Percentage found:", token.text, next_token.text ) # we know we have found a percentage value

---

# ⭕ **QUESTIONS?**

---

## Pre-Built Models:

In [None]:
import spacy 

nlp = spacy.load('en') # loading in the package we just downloaded in our first cell...

doc = nlp("Adidas AG and Gap Inc. are among those at the end of the long supply chain that travel through \
           China’s northwest region of Xinjiang.") # this is the text we want to analyze 
                                                   # that '\' above just lets me split the text into a new 
                                                   # line in my notebook, and isn't part of the text itself

for token in doc: # for each token in our Doc...
    print(token.text, token.pos_, token.dep_, token.head.text) # print the following:
    
    # .pos_ will give us the parts of speech for each token
    # .dep_ will give us the predicted dependency label 
    # .head.text will give us the 'syntactic head token' (think of it as the parent token this word is attached to)

## `ent.label_`

Can be used to decipher entities...

In [None]:
doc = nlp("Adidas AG and Gap Inc. are among those at the end of the long supply chain that travel through \
           China’s northwest region of Xinjiang.")

for ent in doc.ents: # for each entity in our Doc...
    print(ent.text, ent.label_) # print it alongside its label

## `.explain`

Can be used to get quick definitions of common tags and labels, you can use ".explain"

In [None]:
print("GPE = " + spacy.explain('GPE'))
print("ORG = " + spacy.explain('ORG'))

---

# ⭕ **QUESTIONS?**

---

spaCy also lets you write rules to find words and/or phrases in a text. Similar to Regular Expressions, but with some major benefits unique to spaCy. 

In particular, it allows you to match on Doc objects (not just strings), use the model's prediction capabilities, and match on tokens and token attributes. Match patterns in spaCy are comprised of lists of dictionaries, and each dictionary describes one token. 

The keys in the dictionary are the names of the token attributes, and are mapped to their expected value. 

In [None]:
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake.") # our text

from spacy.matcher import Matcher # import the matcher
matcher = Matcher(nlp.vocab) # initialize the matcher

pattern = [[{"TEXT":"iPhone"}, {"TEXT":"X"}]]   # match these exact token texts; you're passing through a list of patterns to match

matcher.add('IPHONE_PATTERN', pattern) # add the pattern to the matcher

matches = matcher(doc) # call the matcher on our Doc and store the result as a list called 'matches'

print(matches)

You'll note that the matcher returns a list of tuples. Each tuple (an immutable list of fixed size) consists of three values: 

    1. The match ID
    2. The start index of the matched span
    3. The end index of the matched span
    
Fortunately, we can iterate over our matches

In [None]:
for match_id, start, end in matches: 
    matched_span = doc[start:end] # start = start index of matched span; end = end index of matched span
    print(matched_span.text)

Remember, you can also match on lexical attributes and token attributions. For instance, below we are going to look for five tokens: 

1. A token consisting of only digits
2. Two, case-insensitive tokens for the words "revenue" and "up"
3. Another token that consists of only digits
4. A punctuation token

In [None]:
doc = nlp("Earnings are in today! 2019 Revenue up 45%! This is the highest revenue in 5 years.")

pattern = [[
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'LOWER': 'revenue'}, # looking for the word "revenue"
    {'LOWER': 'up'}, # looking for the word "up"
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'IS_PUNCT': True} # looking for a punctuation token 
]]

matcher.add('REVENUE_PATTERN', pattern) # add the pattern to the matcher

matches = matcher(doc)

for match_id, start, end in matches: 
    matched_span = doc[start:end] # start = start index of matched span; end = end index of matched span
    print(matched_span.text)

---

# ⭕ **QUESTIONS?**

---

## A note on Operators and Quantifiers.

Operators and Quantifiers let you define how often a token should be matched. 

An Operator can have one of four values: 

1. An "!" negates the token, so it's matched 0 times
2. A "?" makes the token optional, so it matches 0 or 1 times
3. A "+" matches a token 1 or more times
4. A "*" matches a token 0 or more times

Below, the "?" Operator makes the determiner token optional.

In [None]:
doc = nlp("Earnings are in today! 2019 Revenue up 45% for Company X. This is the highest revenue in 5 years.")

# note that our text above has changed... 

pattern = [[
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'LOWER': 'revenue'}, # looking for the word "revenue"
    {'LOWER': 'up'}, # looking for the word "up"
    {'IS_DIGIT': True}, # looking for a token consisting of only digits
    {'IS_PUNCT': True, 'OP' : '?'} # looking for an OPTIONAL punctuation token
]]

matcher.add('REVENUE_PATTERN', pattern) # add the pattern to the matcher

matches = matcher(doc)

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

## Luckily, Spacy also allows us to use Regular Expressions.

In [None]:
import re

doc = nlp("Earnings are in today! 2019 Revenue up 45% for Company X. This is the highest revenue in 5 years.")

expression = r'[Rr]evenue (up|down)' 

for match in re.finditer(expression, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end) #char_span is a spaCy object
    print("Found match:", span.text)

---

# ⭕ **QUESTIONS?**

---

# Exercise 2

Using the text provided below, create a pattern match that finds any news about a possible merger or acquisition.

In [None]:
# your code here

# Solution

In [None]:
doc = nlp("It is rumored that Google bought Apple.")

for ent in doc.ents:
    if ent.label_ == 'ORG':
        print(ent.text)
        next_token = doc[ent.start + 1]
        if next_token.text in ("bought", "sold", "acquired"):
            print(next_token.text) 

---

# ⭕ **QUESTIONS?**

---

## Vocabularies

spaCy stores all shared data in a vocabulary, which includes words, as well as the labeled schemas for tags and entities. It also uses a hash function to generate an ID for each string, which is stored in a string store and is available via nlp.vocab.strings

This string store is ultimately a lookup table whereby you can look up a string to get its hash, or, look up a hash to get the string. For instance:

In [None]:
doc = nlp("I love coffee") # our text

print('hash value:', nlp.vocab.strings['coffee']) # print the hash value given the text
print('string value:', nlp.vocab.strings[3197928453018144401]) # print the text value given the hash 

spaCy even lets you compare two objects to predict how similar they are. 

These objects can be documents, spans, or single tokens. 

In [None]:
# !sudo python -m spacy download en_core_web_md

!python -m spacy download en_core_web_md

import en_core_web_md # you only have to download it in the line above if you didn't earlier

nlp = en_core_web_md.load()

# compare two documents

doc1 = nlp("I like fast food") # doc 1 to be compared
doc2 = nlp("I like pizza") # doc 2 to be compared

print(doc1.similarity(doc2))

# king + woman - man = queen 

We can also compare two tokens: 

In [None]:
# nlp = en_core_web_md.load()

doc = nlp("I like pizza and pasta")

# compare two tokens

token1 = doc[2] # the word "pizza"
token2 = doc[4] # the word "pasta"

print(token1.similarity(token2)) 

In [None]:
doc = nlp("TV and books")

token1, token2 = doc[0], doc[2]

similarity = token1.similarity(token2) # get the similarity of the tokens "TV" and "books"

print(similarity)

Or, a document with a token: 

In [None]:
# nlp = en_core_web_md.load()

# compare a document with a token

doc = nlp("I like being clean") # this full text
token = nlp("I also like soap")[3] # the word "soap"

print(doc.similarity(token))

And, last but not least, a span with a document: 

In [None]:
# nlp = en_core_web_md.load()

# compare a span with a document

span = nlp("I like pizza and pasta")[2:5] # the words "pizza and pasta"
doc = nlp("McDonalds sells burgers") # this full text

print(span.similarity(doc))

---

# ⭕ **QUESTIONS?**

---

## BeautifulSoup + spaCy

Now, let's use some of the BeautifulSoup to analyze some text from an online source:

In [None]:
import time
import re
import csv
import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.vice.com/en_us/article/a35ve5/what-it-would-take-for-the-next-president-to-cancel-all-student-debt') 
    # for more on the requests library check out this tutorial from RealPython: 
    # https://realpython.com/python-requests/
            
soup = BeautifulSoup(r.text,'html') # we are going to turn that URL into 'soup', aka, we are going to be 
                                    # able to see it's metadata For more on BeautifulSoup, check out: 
                                    # https://www.crummy.com/software/BeautifulSoup/bs4/doc/
            
print(soup)

We know that what we're interested in is the text of this article. So, let's see what that looks like in the HTML.

In [None]:
paragraphs = [i.get_text() for i in soup.find_all('p')] # find all of the <p> elements in our text

print(paragraphs) 

In [None]:
article = '\n'.join(paragraphs) # join all of those paragraphs together

print(article)

---

# ⭕ **QUESTIONS?**

---

# Exercise 3:

What if we want to know what entities are mentioned in the article? 

# Solution

In [None]:
nlp = spacy.load('en')

doc = nlp(article) # our text is going to be the article text above (that first 500 characters)

for ent in doc.ents: # for each entity in our Doc...
    print(ent.text, ent.label_) # print that entity aside its label

# Exercise 4:

What if we want to know the similarity between the first and last sentences of the aritcle? 

# Solution

In [None]:
doc1 = nlp(paragraphs[3])
doc2 = nlp(paragraphs[-3])

print(doc1.similarity(doc2))

---

# ⭕ **QUESTIONS?**

---

An enormous thank you to spaCy, whose existing online courses (https://course.spacy.io/chapter1) were the basis for this Jupyter-ized tutorial. For more information on spaCy's existing online course, check out https://github.com/ines/spacy-course#-faq