<a href="https://colab.research.google.com/github/brentn1975/RESTfulAPI/blob/master/intro_to_nlp_workshop_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Google Colab

Colab is essentially Google's way of hosting a [jupyter notebook](https://jupyter.org/). A very popular tool to use as a data scientist!

It allows us to write code, documentation, and output visuals all in one place.

To be able to and edit the code in this workshop. Please make a copy for yourself

`file > save a copy in drive`

This should open a new tab with your own copy of this notebook. It can take a minute to load.

Colab comes with a lot of great data science libraries pre-installed. 

Colab also gives you some options for running complicated computations such as training deep learning model. To see access those options:

`Runtime > change runtime type` Select `GPU`, `TPU`, or `None`

We don't need to change anything for this workshop, but its a great resource if you start learning deep learning and don't have a powerful GPU at home. 

# Introduction to NLP

# What is NLP?

Natural Language Processing is the field of leveraging computers to 
understand, analyze, manipulate, and even generate human language.

## Why do we need NLP?

> We can harness the untapped potential of unstructured text. 

> Whereas traditional databases and datasets contain highly structured data in row/columnar format, text is highly irregular and traditional approaches would not



## High-Level Capabilities:

*   Content Categorization
*   Document Summarization
*   Machine Translation
*   Sentiment Analysis
*   Speech-to-text and Text-to-speech
*   Topic Modeling


## How do we prepare the text for processing?

Steps of text cleaning:

1.   Acquire the text
2.   Remove punctuation
3.   Tokenize
4.   Remove stopwords
5.   Stemming or Lemmatization

After the text is clean we can follow up by vectorizing the data:
1.   Bag-of-words
2.   TF-IDF


## We can walk through some simple examples to start

We'll use an example sentence to then take through the text cleaning and vectorization process.

Example:

> "Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.[1][2] Data science is related to data mining, machine learning and big data."


In [None]:
example = "Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.[1][2] Data science is related to data mining, machine learning and big data."

# lowercase it all
lower = example.lower()

Regular Expressions are a powerful way to manipulate text. [Here](https://www.dataquest.io/blog/regex-cheatsheet/) is an additional resource on some of the most common uses.

In [None]:
# We can use the re library to create regular expressions to parse text
import re
import string

# string.punctuation contains most traditional punctuation marks
print(string.punctuation)
pattern = r"[{}]".format(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
no_punctuation = re.sub(pattern, " ", lower)
print(no_punctuation)

data science is an inter disciplinary field that uses scientific methods  processes  algorithms and systems to extract knowledge and insights from many structural and unstructured data  1  2  data science is related to data mining  machine learning and big data 


In [None]:
no_nums = re.sub(r"[\d+]", " ", no_punctuation)

print(no_nums)

data science is an inter disciplinary field that uses scientific methods  processes  algorithms and systems to extract knowledge and insights from many structural and unstructured data        data science is related to data mining  machine learning and big data 


In [None]:
# Next step is to tokenize, or split the data into individual chunks
tokens = no_nums.split(' ')

# Now let's inspect it
tokens[0:20]

['data',
 'science',
 'is',
 'an',
 'inter',
 'disciplinary',
 'field',
 'that',
 'uses',
 'scientific',
 'methods',
 '',
 'processes',
 '',
 'algorithms',
 'and',
 'systems',
 'to',
 'extract',
 'knowledge']

In [None]:
# remove the blanks
no_blanks = []

for token in tokens:
  if token != '':
    no_blanks.append(token)

In [None]:
no_blanks[0:20]

['data',
 'science',
 'is',
 'an',
 'inter',
 'disciplinary',
 'field',
 'that',
 'uses',
 'scientific',
 'methods',
 'processes',
 'algorithms',
 'and',
 'systems',
 'to',
 'extract',
 'knowledge',
 'and',
 'insights']

In [None]:
# We have to download the stopwords prior to use with NLTK (if it is the first
# time that you've done this on a new computer)
import nltk
nltk.download('stopwords')

# So not we can move onto stopword removal
from nltk.corpus import stopwords

# What do the stopwords look like?
print(stopwords.words('english')[0:10])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [None]:
# We'll remove the punctuation in stopwords as well
stops_nopunct = [re.sub(pattern, "", stop) for stop in stopwords.words('english')]

print(stops_nopunct[0:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'youre']


In [None]:
def remove_stopwords(words):
  no_stops = []
  for word in words:
    if word not in stops_nopunct:
      no_stops.append(word)
  return no_stops

In [None]:
no_stops = remove_stopwords(no_blanks)
no_stops

['data',
 'science',
 'inter',
 'disciplinary',
 'field',
 'uses',
 'scientific',
 'methods',
 'processes',
 'algorithms',
 'systems',
 'extract',
 'knowledge',
 'insights',
 'many',
 'structural',
 'unstructured',
 'data',
 'data',
 'science',
 'related',
 'data',
 'mining',
 'machine',
 'learning',
 'big',
 'data']

Additional Information on [PorterStemmer](https://tartarus.org/martin/PorterStemmer/) and [SnowballStemmer](http://snowball.tartarus.org/algorithms/english/stemmer.html)

---



In [None]:
# Must also download WordNet on NLTK (first time)
nltk.download('wordnet')

# Stemming vs Lemmatization
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer

# Stemmers
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

# Lemmatizer
wn = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Let's try a word that has different tenses
test_words = ['waited', 'waits', 'waiting']

for word in test_words:
  print(ps.stem(word), ls.stem(word), ss.stem(word), wn.lemmatize(word))

wait wait wait waited
wait wait wait wait
wait wait wait waiting


Try each type of stemmer and lemmatizer on any words that you can think of and observe how they change!

In [None]:
# Your code here


In [None]:
import pandas as pd

# We'll create an empty list to store the outputs so we can compare them easily
diff_forms = []

# Process every word with every type of stemmer and also lemma
for word in no_blanks:
  diff_forms.append([ps.stem(word), ls.stem(word), ss.stem(word), 
                          wn.lemmatize(word)])

processer_names = ['Porter', 'Lancaster', 'Snowball', 'WordNet_Lemma']

diff_stems_df = pd.DataFrame(diff_forms, columns=processer_names)
diff_stems_df.head(10)

Unnamed: 0,Porter,Lancaster,Snowball,WordNet_Lemma
0,data,dat,data,data
1,scienc,sci,scienc,science
2,is,is,is,is
3,an,an,an,an
4,inter,int,inter,inter
5,disciplinari,disciplin,disciplinari,disciplinary
6,field,field,field,field
7,that,that,that,that
8,use,us,use,us
9,scientif,sci,scientif,scientific


Resource on [List Comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)

In [None]:
# The Vectorizers work better with string forms, we rejoin the words
stemmed_text = [' '.join([ps.stem(word) for word in no_stops])]

In [None]:
stemmed_text

['data scienc is an inter disciplinari field that use scientif method process algorithm and system to extract knowledg and insight from mani structur and unstructur data data scienc is relat to data mine machin learn and big data']

In [None]:
# Now we can move to vectorize the processed text

# Necessary imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = CountVectorizer()
tfidf = TfidfVectorizer()

In [None]:
bow = cv.fit_transform(stemmed_text)
bow

<1x28 sparse matrix of type '<class 'numpy.int64'>'
	with 28 stored elements in Compressed Sparse Row format>

But what does this output look like?

In [None]:
print(bow.todense())

[[1 1 4 1 5 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1]]


In [None]:
print(cv.get_feature_names())

['algorithm', 'an', 'and', 'big', 'data', 'disciplinari', 'extract', 'field', 'from', 'insight', 'inter', 'is', 'knowledg', 'learn', 'machin', 'mani', 'method', 'mine', 'process', 'relat', 'scienc', 'scientif', 'structur', 'system', 'that', 'to', 'unstructur', 'use']


In [None]:
pd.DataFrame(bow.todense(), columns=cv.get_feature_names())

Unnamed: 0,algorithm,an,and,big,data,disciplinari,extract,field,from,insight,inter,is,knowledg,learn,machin,mani,method,mine,process,relat,scienc,scientif,structur,system,that,to,unstructur,use
0,1,1,4,1,5,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,2,1,1,1,1,2,1,1


$tfidf(t, d, D) = tf(t,d) \cdot idf(t,D)$

In [None]:
tfidf_out = tfidf.fit_transform(stemmed_text)
tfidf_out.toarray()

array([[0.11470787, 0.11470787, 0.45883147, 0.11470787, 0.57353933,
        0.11470787, 0.11470787, 0.11470787, 0.11470787, 0.11470787,
        0.11470787, 0.22941573, 0.11470787, 0.11470787, 0.11470787,
        0.11470787, 0.11470787, 0.11470787, 0.11470787, 0.11470787,
        0.22941573, 0.11470787, 0.11470787, 0.11470787, 0.11470787,
        0.22941573, 0.11470787, 0.11470787]])

### Now for a real dataset

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
# We'll load the IMDB Reviews dataset via Tensorflow's dataset library
imdb = tfds.load('imdb_reviews', split='train', shuffle_files=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGNZ24N/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteGNZ24N/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

KeyboardInterrupt: ignored

In [None]:
# This cell may take a moment to run...
df = tfds.as_dataframe(imdb)

In [None]:
df = pd.DataFrame(df)

In [None]:
df.head()

In [None]:
# We'll convert the text column to text from bytes
df.text = df.text.apply(bytes.decode)

In [None]:
# How many positive reviews vs negative reviews are there?
df.label.value_counts()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

df.label.plot(kind='hist');

In [None]:
df.head()

In [None]:
type(df)

In [None]:
# What does an example review actually look like?
df.text[0]

In [None]:
df.columns

In [None]:
# Fortunately, there exist many libraries that can help us do all 
# the data cleaning we need to in a couple simple steps

In [None]:
import gensim

from gensim import parsing

In [None]:
# This may take a few seconds to run...
processed_text = parsing.preprocess_documents(df.text)

In [None]:
processed_text[0][0:10]

In [None]:
df['cleaned_text'] = [' '.join(word) for word in processed_text]

In [None]:
df.head(3)

In [None]:
from sklearn.model_selection import train_test_split

# Now we will split our dataset into training and test sets
text_train, text_test, y_train, y_test = train_test_split(df.drop(['label', 'text'], axis=1),
                                                          df['label'], stratify=df['label'],
                                                          random_state=2020)

In [None]:
cv = CountVectorizer()
tfidf = TfidfVectorizer()

In [None]:
# We fit on only the training set
tfidf.fit(text_train.cleaned_text)

In [None]:
# And transform the text using our fit transformer
tfidf_out = tfidf.transform(text_train.cleaned_text)

In [None]:
tfidf_out

In [None]:
from sklearn.linear_model import LogisticRegression

import numpy as np

# Here we create a logistic regression model that can be used for binary predictions
lr = LogisticRegression()
lr.fit(tfidf_out, list(y_train))

In [None]:
from sklearn import metrics

# We use the same vectorizer on the test set to simulate so we don't include
# words that were not in our training dataset

test_tfidf_out = tfidf.transform(text_test.cleaned_text)

# Then we use that output matrix and store the logistic regression model's
# predictions

predicted = lr.predict(test_tfidf_out)

Learn all about [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [None]:
accuracy = metrics.accuracy_score(y_test, predicted)
print(f'Accuracy: {accuracy}')

precision = metrics.precision_score(y_test, predicted)
print(f'Precision: {precision}')

recall = metrics.recall_score(y_test, predicted)
print(f'Recall: {recall}')

f1 = metrics.f1_score(y_test, predicted)
print(f'F1 Score: {f1}')

In [None]:
metrics.plot_roc_curve(lr, test_tfidf_out, y_test)

metrics.plot_precision_recall_curve(lr, test_tfidf_out, y_test);

In [None]:
test_review="This was the best movie ever, except for the Matrix"

test_cleaned=parsing.preprocess_documents([test_review])
test_review_tfidf=tfidf.transform([' '.join(x) for x in test_cleaned])
test_review_tfidf.todense().sum(axis=1)
lr.predict(test_review_tfidf)

# Whats next?

### Keep Learning:
The most important thing to do it keep learning!
- [Intro to Python Part 2](https://www.eventbrite.com/e/intro-to-python-part-2-live-online-tickets-127409317699)
- [Data Science Prep Course](https://bit.ly/DSIPREP-32q7lQj) 


See all upcoming Galvanize online events [here](https://www.hackreactor.com/webinars)

### More challenge ideas:
- Find new datasets to try and apply NLP
- Try using NLP to create features for a regression model
- Can you find any relationships between positive reviews and the product price?

### Stay Connected:
- Linkedin: [https://www.linkedin.com/in/andrewmeans/](https://www.linkedin.com/in/andrewmeans/)

- email: andrew.means@galvanize.com

Additional Exercise:

Try the preprocessing steps above on the dataset below.

In [None]:
# Pet Supply Reviews from Amazon provided by Stanford
!wget http://snap.stanford.edu/data/amazon/Pet_Supplies.txt.gz

--2020-12-08 01:12:33--  http://snap.stanford.edu/data/amazon/Pet_Supplies.txt.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49313367 (47M) [application/x-gzip]
Saving to: ‘Pet_Supplies.txt.gz’


2020-12-08 01:12:38 (9.08 MB/s) - ‘Pet_Supplies.txt.gz’ saved [49313367/49313367]



In [None]:
# Also required for the parser
!pip install simplejson

Collecting simplejson
[?25l  Downloading https://files.pythonhosted.org/packages/73/96/1e6b19045375890068d7342cbe280dd64ae73fd90b9735b5efb8d1e044a1/simplejson-3.17.2-cp36-cp36m-manylinux2010_x86_64.whl (127kB)
[K     |██▋                             | 10kB 18.3MB/s eta 0:00:01[K     |█████▏                          | 20kB 14.1MB/s eta 0:00:01[K     |███████▊                        | 30kB 13.6MB/s eta 0:00:01[K     |██████████▎                     | 40kB 12.6MB/s eta 0:00:01[K     |████████████▉                   | 51kB 9.1MB/s eta 0:00:01[K     |███████████████▍                | 61kB 8.1MB/s eta 0:00:01[K     |██████████████████              | 71kB 9.2MB/s eta 0:00:01[K     |████████████████████▌           | 81kB 10.2MB/s eta 0:00:01[K     |███████████████████████         | 92kB 10.5MB/s eta 0:00:01[K     |█████████████████████████▋      | 102kB 8.7MB/s eta 0:00:01[K     |████████████████████████████▏   | 112kB 8.7MB/s eta 0:00:01[K     |███████████████████████

In [None]:
# The following code was provided on the original dataset website

import gzip
import simplejson

def parse(filename):
  f = gzip.open(filename, 'rt')
  entry = {}
  for l in f:
    l = l.strip()
    l = str(l)
    colonPos = l.find(':')
    if colonPos == -1:
      yield entry
      entry = {}
      continue
    eName = l[:colonPos]
    rest = l[colonPos+2:]
    entry[eName] = rest
  yield entry

In [None]:
review_rows = [e for e in parse("Pet_Supplies.txt.gz")]

In [None]:
review_df = pd.DataFrame.from_records(review_rows)

In [None]:
review_df.head()

Unnamed: 0,product/productId,product/title,product/price,review/userId,review/profileName,review/helpfulness,review/score,review/time,review/summary,review/text
0,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A2FEQ9XL6ML51C,Just an everyday Dad,1/1,5.0,1286064000,"Little Ball, for Little Dogs...","Great Toy, hard to find! We get ours online he..."
1,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A183LI95B2WNUQ,"V. J. Mcmillen ""vmcmillen""",1/1,5.0,1230249600,glow ball,I have bought several of these small Orbee Tuf...
2,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A1LSSENM0XIQQR,gcoronado4,0/0,3.0,1309046400,Too Big,It is a quality ball but the small is still to...
3,B000O1CRYW,Orbee Tuff Ball Orange - SMALL,6.95,A2E5PZE1PZVK38,jerry,0/0,5.0,1308873600,no good,I gave it 5 stars because my little dog had so...
4,B0002ARHAE,Kent Marine Pro-Clear Freshwater Clarifier,3.73,A3PXLJE4OPIQTY,"M. Thomas ""sea_anemone""",0/0,5.0,1356912000,Best clarifier ever,I've used many products to try and help the wa...


In [None]:
# Download the vader sentiment analyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

vader = SentimentIntensityAnalyzer()



In [None]:
for i in range(5):
  print(review_df['review/text'][i])
  print(vader.polarity_scores(review_df['review/text'][i]))
  print()

Great Toy, hard to find! We get ours online here or shipped from a friend in America!Our little Papillion-Yorkie mix loves it. Every night I am in the back garden kicking the ball for him! He destroys tennis balls by chewing off the fluff and our wee dog finds them to big to carry/catch. We stumbled across this in our local petstore and our dog was hooked! Sadly, local pet store no longer imports them.Stars all around for this one!
{'neg': 0.108, 'neu': 0.764, 'pos': 0.128, 'compound': 0.5999}

I have bought several of these small Orbee Tuff balls, because my dog simple loves them! The orange ball does glow in the dark that makes it easy to find. I have nothing but good things to say about these small balls. The only down side is Trix, my dog, keeps loosing them!
{'neg': 0.056, 'neu': 0.857, 'pos': 0.087, 'compound': 0.1962}

It is a quality ball but the small is still too big and heavy for my little yorkie to play with comfortably.
{'neg': 0.0, 'neu': 0.748, 'pos': 0.252, 'compound': 

In [None]:
first_500 = review_df['review/text'][0:500].apply(vader.polarity_scores)[0:500]

In [None]:
plt.bar(review_df['review/score'][0:500], [x.get('pos') for x in first_500]);