# SI 618: Data Manipulation and Analysis
## 07 Beyond regex: Natural Language Processing
### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.</small>
    
### Please ensure you have this version:
Version 2023.02.13.1.CT


### Top-Level Learning Objective
* To familiarize ourselves with the basics of NLP and how to implement NLP techniques in Python
   
### Things we'll learn on the way:
1. What the spaCy package does and why it's useful
2. Text processing steps:
   1. normalization,
   2. tokenization,
   3. stop word removal,
   4. lemmatization,
   5. part-of-speech tagging
   6. named entity recognition
   7. sentiment analysis

### How you know you've learned:
* Complete Homework 5

In [3]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
import spacy

# spaCy

- Fast, and extensible NLP package for Python
- <https://spacy.io/>
- NOTE: You will need to install this, and then (one time only as well) download the English corpus.

In [5]:
# Comment out the next line when you've run this cell successfully
# !python -m spacy download en_core_web_md

Collecting en-core-web-md==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.6.0/en_core_web_md-3.6.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [6]:
# loading up the language model: English
# note that Windows users might need to figure out where
# the previous cell installed the library and change the following line accordingly
nlp = spacy.load('en_core_web_md')

# 0. Data cleaning

In [8]:
# from Make It Stick: The Science of Successful Learning
sentences = """
Michael Young is a high-achieving fourth-year medical student at
Georgia Regents University who pulled himself up from rock bottom
by changing the way he studies.
Young entered medical school without the usual foundation of premed
coursework. His classmates all had backgrounds in biochemistry,
pharmacology, and the like. Medical school is plenty tough under
any circumstances, but in Young's case even more so for lack of a footing.
"""

In [9]:
sentences

"\nMichael Young is a high-achieving fourth-year medical student at\nGeorgia Regents University who pulled himself up from rock bottom\nby changing the way he studies.\nYoung entered medical school without the usual foundation of premed\ncoursework. His classmates all had backgrounds in biochemistry,\npharmacology, and the like. Medical school is plenty tough under\nany circumstances, but in Young's case even more so for lack of a footing.\n"

### Section goal: calculate the frequency of each word
- See which words are more frequent.
- Generate more meaningful summary for the above paragraph.

## 0-1. lowering the case

In [10]:
type(sentences)

str

In [11]:
sentences

"\nMichael Young is a high-achieving fourth-year medical student at\nGeorgia Regents University who pulled himself up from rock bottom\nby changing the way he studies.\nYoung entered medical school without the usual foundation of premed\ncoursework. His classmates all had backgrounds in biochemistry,\npharmacology, and the like. Medical school is plenty tough under\nany circumstances, but in Young's case even more so for lack of a footing.\n"

In [12]:
sent_low = sentences.lower()

In [13]:
sent_low

"\nmichael young is a high-achieving fourth-year medical student at\ngeorgia regents university who pulled himself up from rock bottom\nby changing the way he studies.\nyoung entered medical school without the usual foundation of premed\ncoursework. his classmates all had backgrounds in biochemistry,\npharmacology, and the like. medical school is plenty tough under\nany circumstances, but in young's case even more so for lack of a footing.\n"

## 0-2. remove punctuation and special characters

#### Exclude special characters one by one

In [None]:
# from https://www.programiz.com/python-programming/examples/remove-punctuation
punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~‘’''' # list of special characters you want to exclude
sent_low_pnct = ""
for char in sent_low:
    if char not in punctuations:
        sent_low_pnct = sent_low_pnct + char

sent_low_pnct

#### Alternatively, we can use regular expression to remove punctuations
- So we don't have to list up all possible special characters that we want to remove
- https://docs.python.org/3.4/library/re.html
- https://en.wikipedia.org/wiki/Regular_expression

In [None]:
sent_low

In [None]:
import re
sent_low_pnct2 = re.sub(r'[^\w\s]+', ' ', sent_low)

In [None]:
sent_low_pnct2

- However, special character ```\n``` (linebreak) still exists in both cases. Let's remove these additionally.

In [None]:
import os
os.linesep

In [None]:
sent_low_pnct = sent_low_pnct.replace(os.linesep, " ")
sent_low_pnct

### And one more way...

In [None]:
import string

string.punctuation

In [None]:
table = str.maketrans(dict.fromkeys(string.punctuation))
no_punctuation= sent_low.translate(table)

print(no_punctuation)

Regular expressions:

^ means "beginning of string"
UNLESS it's in [ ], in which case it means "not"

r'^The' # means The at the beginning of a string
r'^[The]' # means any one of T or h or e at the beginning of a string
t'^[^The]' # means any character other than T,h,or e at the beginning

### So... at least 3 possible ways to replace characters!

## 0-3. Remove stop words

- Stop words usually refers to the most common words in a language
    - No single universal stopwords
    - Often stopwords are removed to improve the performance of NLP models
    - https://en.wikipedia.org/wiki/Stop_words
    - https://en.wikipedia.org/wiki/Most_common_words_in_English

#### Import the list of stop words from ```spaCy```

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
STOP_WORDS

#### Goal: We are going to count the frequency of each word from the paragraph, to see which words can be used to represent the paragraph's content.

#### What if we do not remove stopwords?

- Note that our paragraph is stored as a single string object...

In [None]:
sent_low_pnct

- Split the paragraph into a list of words

In [None]:
words = sent_low_pnct.split()

In [None]:
words[:10]

- Count the words from the list
- Words that can occur in any kind of paragraphs...?

In [None]:
d = {}
for word in words:
    if word in d:
        d[word] = d[word] + 1
    else:
        d[word] = 1
d

In [None]:
from collections import Counter

In [None]:
Counter(words).most_common(20)

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(x=words, order=pd.Series(words).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

#### When we removed stopwords:

In [None]:
words_nostop = list()
for word in words:
    if word not in STOP_WORDS:
        words_nostop.append(word)

### <font color="magenta"> Q1: Re-implement the code in the previous cell using a list comprehension</font>

In [None]:
# insert your code here

- More comprehensible, and unique list or words!

### <font color="magenta">Q2: Use a `Counter` to find the frequencies of each word in the `words_nostop` list.</font>

In [None]:
# insert your code here

### <font color="magenta">Q3: Create a bar chart showing the frequencies of the 10 most common words, alphabetically sorted.</font>

In [None]:
# insert your code here

# 1. Extracting linguistic features from spaCy

## 1-1. Tokenize
- Token: a semantic unit for analysis
    - (Loosely) equal term for word
        - ```sent_low_pnct.split()```
    - Tricky cases
        - aren't $\rightarrow$ ![](https://nlp.stanford.edu/IR-book/html/htmledition/img88.png) ![](https://nlp.stanford.edu/IR-book/html/htmledition/img89.png) ? ![](https://nlp.stanford.edu/IR-book/html/htmledition/img86.png) ?
        - O'Neil $\rightarrow$ ![](https://nlp.stanford.edu/IR-book/html/htmledition/img83.png) ? ![](https://nlp.stanford.edu/IR-book/html/htmledition/img84.png) ![](https://nlp.stanford.edu/IR-book/html/htmledition/img81.png) ?
        - https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
- In ```spaCy```:
    - Many token types, like word, punctuation symbol, whitespace, etc.

### Let's dissect the sentence!

- initiating the ```spaCy``` object

In [None]:
# examples partially taken from https://nlpforhackers.io/complete-guide-to-spacy/
import spacy
nlp = spacy.load('en_core_web_md')

In [None]:
type(nlp)

- Our sentence: "Hello World!"
    - Pass the sentence string to the ```spaCy``` object ```nlp```

In [None]:
doc = nlp("Hello World!")

- The sentence is considered as a short document.

In [None]:
print(type(doc), doc)

- As importing the sentence string above, ```spaCy``` split the sentence into tokens (tokenization!)

In [None]:
for i,token in enumerate(doc):
    print(i, token)

- With index information (location from the sentence) of each token

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11|
|---|---|---|---|---|---|---|---|---|---|---|---|
| H | e | l | l | o | _ | W | o | r | l | d | ! |

In [None]:
for i, token in enumerate(doc):
    print(i, token.text, token.idx)


- And many more!
    - https://spacy.io/api/token#attributes

In [None]:
sentences

In [None]:
doc = nlp(sentences)

print("text\tidx\tlemma\tlower\tpunct\tspace\tshape\tPOS")
for token in doc:
    if token.is_space:
        print("SPACE")
    else:
        print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
            token.text,
            token.idx,
            token.lemma_,
            token.lower_,
            token.is_punct,
            token.is_space,
            token.shape_,
            token.pos_
    ))

In [None]:
doc = nlp(sentences)

print("text\tidx\tlemma\tlower\tpunct\tspace\tshape\tPOS")
for token in doc:
    if token.is_space:
        print("SPACE")
    else:
        print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
            token.text,
            token.idx,
            token.lemma_,
            token.lower_,
            token.is_punct,
            token.is_space,
            token.shape_,
            token.pos_
    ))


## 1-2. Sentence detection

- For the document with multiple sentences, we would need to separate  each sentence.
- In ```spaCy```, the job is more convenient (and would cause less mistakes) than using regular expression

In [None]:
sentences

In [None]:
# same document, but initiate as the spaCy object...
doc = nlp(sentences)

- Sentences are stored as a generator object
    - Instead of storing sentences as a list, each sentence is stored as a item in the generator object
    - Iteratable (i.e., can be used in a for loop)
    - More efficient memory use
    - https://wiki.python.org/moin/Generators

In [None]:
doc.sents

- Printing sentences with the index number

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

## 1-3. POS tagging

- I want to find words with particular part-of-speech!
- Different part-of-speech words carry different information
    - e.g., noun (subject), verb (action term), adjective (quality of the object)
- https://spacy.io/usage/linguistic-features#pos-tagging

- Yelp review!

In [None]:
# from https://www.yelp.com/biz/ajishin-novi?hrid=juA4Zn2TX7845vNFn4syBQ&utm_campaign=www_review_share_popup&utm_medium=copy_link&utm_source=(direct)
doc = nlp("""One of the best Japanese restaurants in Novi. Simple food, great taste, amazingly price. I visit this place a least twice month.""")

- multiple sentences exist in a document

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

- Question: which words are adjective (ADJ)?

In [None]:
for i, sent in enumerate(doc.sents):
    #print("__sentence__:", i)
    #print("_token_ \t _POS_")
    for token in sent:
        if token.pos_ == 'ADJ':
            print(token.text, "\t", token.pos_)

## Named Entity Recognition

In [None]:
doc = nlp(sentences)
print([(X.text, X.label_) for X in doc.ents])

In [None]:
url = 'https://fivethirtyeight.com/features/remembering-alex-trebek-the-man-with-all-the-answers/'

In [None]:
!pip install html5lib bs4

In [None]:
from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string(url)
article = nlp(ny_bb)
len(article.ents)

In [None]:
article

In [None]:
labels = [(x.label_,x.text) for x in article.ents]
Counter(labels)

In [None]:
labels = [x.label_ for x in article.ents]
Counter(labels)

In [None]:
labels

In [None]:
plt.figure(figsize=(45,10))
sns.countplot(x=labels, order=pd.Series(labels).value_counts().index)
# sns.countplot(words_nostop, order=[counted[0] for counted in Counter(words_nostop).most_common()])
plt.xticks(rotation=90)
plt.show()

# BREAK

# NLP Part II

# 1. Word embedding

#### Word2Vec
- Developed by [Mikolov et al., 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- Represent the meaning of the words as a vector
    - Vector: numeric array
    - Output of a neural network model that predicts the next word
- Surprisingly, many different semantic informations can be represented from word vectors of ```Word2Vec```
- (More explanation in here: https://www.tensorflow.org/tutorials/representation/word2vec)

In [None]:
! pip install gensim

You will also need to download a pretrained language model: https://github.com/eyaler/word2vec-slim/raw/master/GoogleNews-vectors-negative300-SLIM.bin.gz

In [None]:
import gensim

Change the filepath in the next cell to correspond to the location of the pretrained model file you downloaded above.

In [None]:
w2v_mod = gensim.models.KeyedVectors.load_word2vec_format("~/Downloads/GoogleNews-vectors-negative300-SLIM.bin.gz", binary=True)

## 1-1. Calculating similarity between words

- Q: What's similarity between *school* and *student*?

- the word vector for *school* looks like this:

In [None]:
w2v_mod['school']

In [None]:
len(w2v_mod['school'])

- and the word vector for *student* looks like this:

In [None]:
w2v_mod['student']

- the similarity between two word vectors is:

In [None]:
w2v_mod.similarity('school', 'student')

### <font color='magenta'> Q4: Find a word that is more similar to school using this model </font>

In [None]:
# insert your code here

### <font color='magenta'>Q5 Find two words that have a cosine similarity less than .1 </font>
- How would you interprete the results?

In [None]:
# insert your code here

### <font color='magenta'> Q6 Try some other words. Any other interesting findings? </font>
- Give 2 more examples.
- How would you interprete the results?

In [None]:
# insert your code here

### Let's try with some example: words in a semantic space
$\rightarrow$ https://projector.tensorflow.org

### <font color='magenta'> Q7 Any interesting findings from TensorFlow Projector page? </font>

(type in your response here)

## 1-2. Analogy from word vectors

<img src="https://www.tensorflow.org/images/linear-relationships.png" width="800">

#### Can we approximate the relationship between words by doing - and + operations?

- $woman - man + king \approx ?$
- How this works?
    - $woman:man \approx x:king $
    - $\rightarrow woman - man \approx x - king $
    - $\rightarrow woman - man + king \approx x$
    - List top-10 words ($x$) that can solve the equation!

In [None]:
w2v_mod.most_similar(positive=['woman', 'king'], negative=['man'])

- $Spain - Germany + Berlin \approx ?$
    - $\rightarrow Spain - Germany \approx x -  Berlin $

In [None]:
w2v_mod.most_similar(positive=['Spain', 'Berlin'], negative=['Germany'])

### <font color='magenta'> Q8 Any other interesting examples? </font>
- Give 3 more examples.
- How would you interprete the results?

In [None]:
# insert your code here

## 1-3. Constructing interpretable semantic scales

- So far, we saw that word vectors effectively carries (although not perfect) the semantic information.
- Can we design something more interpretable results from using the semantic space?

- Let's re-try with real datapoints in [here](https://projector.tensorflow.org): *politics* words in a *bad-good* PCA space

In [None]:
from scipy import spatial

def cosine_similarity(x, y):
    return(1 - spatial.distance.cosine(x, y))

- Can we regenerate this results with our embedding model?

### Let's plot words in the 2D space
- Using Bad & Good axes
- Calculate cosine similarity between an evaluating word (violence, discussion, and issues) with each scale's end (bad, and good)

In [None]:
pol_words_sim_2d = pd.DataFrame([[cosine_similarity(w2v_mod['violence'], w2v_mod['good']), cosine_similarity(w2v_mod['violence'], w2v_mod['bad'])],
                                 [cosine_similarity(w2v_mod['discussion'], w2v_mod['good']), cosine_similarity(w2v_mod['discussion'], w2v_mod['bad'])],
                                 [cosine_similarity(w2v_mod['issues'], w2v_mod['good']), cosine_similarity(w2v_mod['issues'], w2v_mod['bad'])]],
                                index=['violence', 'discussion', 'issues'], columns=['good', 'bad'])

In [None]:
pol_words_sim_2d

- If we plot this:

In [None]:
sns.scatterplot(x='good', y='bad', data=pol_words_sim_2d, hue=pol_words_sim_2d.index)

- violence: less good, more bad
- discussion: less bad, more good
- issues: both bad and good

### Can we do this in an 1D scale?
(bad) --------------------?---- (good)

- First, let's create the vector for *bad-good* scale

In [None]:
scale_bad_good = w2v_mod['good'] - w2v_mod['bad']

In [None]:
cosine_similarity(w2v_mod['good'], w2v_mod['bad'])

In [None]:
len(scale_bad_good)

- Calculate the cosine similarity score of the word *violence* in the *bad-good* scale
    - $sim(V(violence), V(bad) - V(good))$

In [None]:
violence_score = cosine_similarity(w2v_mod['violence'], scale_bad_good)
violence_score

In [None]:
discussion_score = cosine_similarity(w2v_mod['discussion'], scale_bad_good)
discussion_score

# 2. Sentiment Analysis with NLTK

"The Natural Language Toolkit (NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language."
for more information see: https://www.nltk.org/

We are going to use NLTK and Spacy to determine if text expresses positive sentiment, negative sentiment, or if it's neutral.

In [None]:
# adapted from https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/NLP%20with%20SpaCy-%20Adding%20Extensions%20Attributes%20in%20SpaCy(How%20to%20use%20sentiment%20analysis%20in%20SpaCy).ipynb
import nltk

In [None]:
!python -m nltk.downloader book

"VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."

for more see: https://github.com/cjhutto/vaderSentiment

In [None]:
nltk.download('vader_lexicon')

We are going to extend the spacy functionality with the SentimentIntensityAnalyzer function from NLTK.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sent_analyzer = SentimentIntensityAnalyzer()
def sentiment_scores(docx):
    return sent_analyzer.polarity_scores(docx.text)

In [None]:
import spacy

In [None]:
# loading up the language model: English
nlp = spacy.load('en_core_web_md')

In [None]:
from spacy.tokens import Doc
Doc.set_extension("sentimenter",getter=sentiment_scores)

In [None]:
nlp("This introduction was great but the conclusions were terrible")._.sentimenter

Let's apply this sentiment analysis to product reviews on Amazon

In [None]:
r = pd.read_csv('https://raw.githubusercontent.com/umsi-data-science/data/main/small_reviews.csv',index_col=0)
#random sample of original dataset at https://www.kaggle.com/snap/amazon-fine-food-reviews

In [None]:
r.head()

We'll use the apply function to transform text with spacy's nlp function.

In [None]:
r['rating'] = r['Text'].apply(lambda x: nlp(x)._.sentimenter['compound'])

In [None]:
r[['Score','rating','Text']].head(10)

In [None]:
r.iloc[6].Text

In [None]:
sns.scatterplot(x='Score',y='rating',data=r)

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
model0 = smf.ols("rating ~ Score ", data=r)
model0.fit().summary()