# Beyond regex: Natural Language Processing

# Why learn NLP?
- Natural language = human language
- We use language to learn about the world
- How machines understand human langauge?
- How can we quantify the meaning of language?

## Applications?
- Probabily any service that uses text as information
- Search engine, SNS
    - What's the document about?
    - How do you determine the similarity?
- Virtual assistants: Alexa, Google Assistant, Cortana, etc. 
    - Understand the semantic information from your speech from parsed text
- Biology, genetics
    - Genetic information / DNA sequence as text
    - Draw networks of proteins/molecules from vast amount of scientific papers 

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide02.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide03.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide04.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide05.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide06.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide07.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide08.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide09.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide10.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide11.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide12.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide13.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide14.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide15.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide16.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide17.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide18.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide19.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide20.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide21.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide22.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide23.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide24.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide25.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide26.png)

![](https://github.com/umsi-data-science/si370/raw/master/resources/nlp/Slide27.png)

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# spaCy

- Fast, and extensible NLP package for Python
- <https://spacy.io/>
- NOTE: You will need to install this, and then (one time only as well) download the English corpus.

In [None]:
import spacy

In [None]:
# Uncomment *one* of the following lines 
#!pip install spacy
#!conda install -y spacy

In [None]:
import spacy

In [None]:
# Uncomment the following line ONCE only
#! python -m spacy download en

In [None]:
# loading up the language model: English
nlp = spacy.load('en')

# 0. Data cleaning

In [None]:
# from Project Gutenberg: Grimms' Fairy Tales
sentences = """
As soon as the time came when he was to declare the secret, he was taken
before the king with the three branches and the golden cup; and the
twelve princesses stood listening behind the door to hear what he would
say. And when the king asked him. ‘Where do my twelve daughters dance at
night?’ he answered, ‘With twelve princes in a castle under ground.’ And
then he told the king all that had happened, and showed him the three
branches and the golden cup which he had brought with him. Then the king
called for the princesses, and asked them whether what the soldier said
was true: and when they saw that they were discovered, and that it was
of no use to deny what had happened, they confessed it all. And the king
asked the soldier which of them he would choose for his wife; and he
answered, ‘I am not very young, so I will have the eldest.’--And they
were married that very day, and the soldier was chosen to be the king’s
heir."""

### Section goal: calculate the frequency of each word
- See which words are more frequent.
- Generate more meaningful summary for the above paragraph.

## 0-1. lowering the case

In [None]:
type(sentences)

In [None]:
sentences

In [None]:
sent_low = sentences.lower()

In [None]:
sent_low

## 0-2. remove punctuation and special characters

#### Exclude special characters one by one

In [None]:
# from https://www.programiz.com/python-programming/examples/remove-punctuation
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~‘’''' # list of special characters you want to exclude
sent_low_pnct = ""
for char in sent_low:
    if char not in punctuations:
        sent_low_pnct = sent_low_pnct + char

sent_low_pnct

#### Alternatively, we can use regular expression to remove punctuations
- So we don't have to list up all possible special characters that we want to remove
- https://docs.python.org/3.4/library/re.html
- https://en.wikipedia.org/wiki/Regular_expression

In [None]:
import re
sent_low_pnct2 = re.sub(r'[^\w\s]', ' ', sent_low)

In [None]:
sent_low_pnct2

- However, special character ```\n``` (linebreak) still exists in both cases. Let's remove these additionally.

In [None]:
import os
os.linesep

In [None]:
sent_low_pnct = sent_low_pnct.replace(os.linesep, " ")
sent_low_pnct

### So... at least 3 possible ways to replace characters!

## 0-3. Remove stop words

- Stop words usually refers to the most common words in a language
    - No single universal stopwords
    - Often stopwords are removed to improve the performance of NLP models
    - https://en.wikipedia.org/wiki/Stop_words
    - https://en.wikipedia.org/wiki/Most_common_words_in_English

#### Import the list of stop words from ```spaCy```

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
import numpy as np

In [None]:
STOP_WORDS

#### Goal: We are going to count the frequency of each word from the paragraph, to see which words can be used to represent the paragraph's content. 

#### What if we do not remove stopwords?

- Note that our paragraph is stored as a single string object...

In [None]:
sent_low_pnct

- Split the paragraph into a list of words

In [None]:
words = sent_low_pnct.split()

- Count the words from the list
- Words that can occur in any kind of paragraphs...?

In [None]:
from collections import Counter

In [None]:
Counter(words).most_common(10)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(words, order=pd.Series(words).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

#### When we removed stopwords:

In [None]:
# split sentence into words
words_nostop = list()
for word in words:
    if word not in STOP_WORDS:
        words_nostop.append(word)
# words_nostop = [word for word in words if word not in STOP_WORDS]

- More comprehensible, and unique list or words!

In [None]:
Counter(words_nostop).most_common(10)

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(words_nostop, order=pd.Series(words_nostop).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

# 1. Extracting linguistic features from spaCy

## 1-1. Tokenize
- Token: a semantic unit for analysis
    - (Loosely) equal term for word
        - ```sent_low_pnct.split()```
    - Tricky cases
        - aren't $\rightarrow$ ![](https://nlp.stanford.edu/IR-book/html/htmledition/img88.png) ![](https://nlp.stanford.edu/IR-book/html/htmledition/img89.png) ? ![](https://nlp.stanford.edu/IR-book/html/htmledition/img86.png) ?
        - O'Neil $\rightarrow$ ![](https://nlp.stanford.edu/IR-book/html/htmledition/img83.png) ? ![](https://nlp.stanford.edu/IR-book/html/htmledition/img84.png) ![](https://nlp.stanford.edu/IR-book/html/htmledition/img81.png) ?
        - https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
- In ```spaCy```:
    - Many token types, like word, puntuation symbol, whitespace, etc.

### Let's dissect the sentence!

- initiating the ```spaCy``` object 

In [None]:
# examples partially taken from https://nlpforhackers.io/complete-guide-to-spacy/
import spacy
nlp = spacy.load('en')

- Our sentence: "Hello World!"
    - Pass the sentence string to the ```spaCy``` object ```nlp```

In [None]:
doc = nlp("Hello World!")

- The sentence is considered as a short document.

In [None]:
print(type(doc), doc)

- As importing the sentence string above, ```spaCy``` split the sentence into tokens (tokenization!)

In [None]:
for i,token in enumerate(doc):
    print(i, token)

- With index information (location from the sentence) of each token

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11|
|---|---|---|---|---|---|---|---|---|---|---|---|
| H | e | l | l | o | _ | W | o | r | l | d | ! |

In [None]:
for i, token in enumerate(doc):
    print(i, token.text, token.idx) 


- And many more!
    - https://spacy.io/api/token#attributes

In [None]:
doc = nlp(sentences)

print("text\tidx\tlemma\tlower\tpunct\tspace\tshape\tPOS")
for token in doc:
    if token.is_space:
        print("SPACE")
    else:
        print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
            token.text,
            token.idx,
            token.lemma_,
            token.lower_,
            token.is_punct,
            token.is_space,
            token.shape_,
            token.pos_
    ))


## 1-2. Sentence detection

- For the document with multiple sentences, we would need to separate  each sentence.
- In ```spaCy```, the job is more convenient (and would cause less mistakes) than using regular expression

In [None]:
sentences

In [None]:
# same document, but initiate as the spaCy object...
doc = nlp(sentences)

- Sentences are stored as a generator object
    - Instead of storing sentences as a list, each sentence is stored as a item in the generator object 
    - Iteratable (i.e., can be used in a for loop)
    - More efficient memory use
    - https://wiki.python.org/moin/Generators

In [None]:
doc.sents

- Printing sentences with the index number

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

## 1-3. POS tagging

- I want to find words with particular part-of-speech!
- Different part-of-speech words carry different information
    - e.g., noun (subject), verb (action term), adjective (quality of the object) 
- https://spacy.io/api/annotation#pos-tagging

- Yelp review!

In [None]:
# from https://www.yelp.com/biz/ajishin-novi?hrid=juA4Zn2TX7845vNFn4syBQ&utm_campaign=www_review_share_popup&utm_medium=copy_link&utm_source=(direct)
doc = nlp("""One of the best Japanese restaurants in Novi. Simple food, great taste, amazingly price. I visit this place a least twice month.""")

- multiple sentences exist in a document

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

- Question: which words are adjective (ADJ)?

In [None]:
for i, sent in enumerate(doc.sents):
    print("__sentence__:", i)
    print("_token_ \t _POS_")
    for token in sent:
        print(token.text, "\t", token.pos_)

## Named Entity Recognition

In [None]:
doc = nlp("""Democrat Stacey Abrams was trailing in her bid to become the nation’s first female African-American governor, but her campaign said voting problems as well as uncounted absentee and provisional ballots could force a runoff.""")

print([(X.text, X.label_) for X in doc.ents])

for i, sent in enumerate(doc.sents):
    print("__sentence__:", i)
    print("_token_ \t _POS_")
    for token in sent:
        print(token.text, "\t", token.pos_)



In [None]:
url='https://www.nytimes.com/2018/11/07/us/politics/democrats-republicans-house.html'

In [None]:
#!pip install html5lib

In [None]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string(url)
article = nlp(ny_bb)
len(article.ents)

In [None]:
labels = [x.label_ for x in article.ents]
Counter(labels)

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(labels, order=pd.Series(labels).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

## 1-4. Gramatical dependency
- Words are gramatically related in a sentence.
- Conveys much semantic information about the sentential context.

In [None]:
spacy.displacy.render(article, style='dep', jupyter=True)