
*Unit 4, Sprint 1, Module 1*

---
<h1 id="moduleTitle"> Natural Language Processing Introduction (Prepare)</h1>

"Natural" meaning - not computer languages but spoken/written human languages. The hard thing about NLP is that human languages are far less structured or consistent than computer languages. This is perhaps the largest source of difficulty when trying to get computers to "understand" human languages. How do you get a machine to understand sarcasm, and irony, and synonyms, connotation, denotation, nuance, and tone of voice --all without it having lived a lifetime of experience for context? If you think about it, our human brains have been exposed to quite a lot of training data to help us interpret languages, and even then we misunderstand each other pretty frequently.
    

<h2 id='moduleObjectives'>Learning Objectives</h2>

By the end of end of this module, a student should be able to:
* <a href="#p1">Objective 1</a>: Tokenize text
* <a href="#p1">Objective 2</a>: Remove stop words from text
* <a href="#p3">Objective 3</a>: Perform stemming and lemmatization on tokens

## Conda Environments (OMIT)

You will be completing each module this sprint on your machine. We will be using conda environments to manage the packages and their dependencies for this sprint's content. In a classroom setting, instructors typically abstract away environment for you. However, environment management is an important professional data science skill. We showed you how to manage environments using pipvirtual env during Unit 3, but in this sprint, we will introduce an environment management tool common in the data science community:

> __conda__: Package, dependency and environment management for any language—Python, R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

The easiest way to install conda on your machine is via the [Anaconda Distribution](https://www.anaconda.com/distribution/) of Python & R. Once you have conda installed, read ["A Guide to Conda Environments"](https://towardsdatascience.com/a-guide-to-conda-environments-bc6180fc533). This article will provide an introduce into some of the conda basics. If you need some additional help getting started, the official ["Setting started with conda"](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html) guide will point you in the right direction.

:snake:

To get the sprint environment setup:

1. Open your command line tool (Terminal for MacOS, Anaconda Prompt for Windows)
2. Navigate to the folder with this sprint's content. There should be a `requirements.txt`
3. Run `conda create -n U4-S1-NLP python==3.7` => You can also rename the environment if you would like. Once the command completes, your conda environment should be ready.
4. Now, we are going to add in the require python packages for this sprint. You will need to 'activate' the conda environment: `source activate U4-S1-NLP` on Terminal or `conda activate U4-S1-NLP` on Anaconda Prompt. Once your environment is activate, run `pip install -r requirements.txt` which will install the required packages into your environment.
5. We are going to also add an Ipython Kernel reference to your conda environment, so we can use it from JupyterLab.
6. Next run `python -m ipykernel install --user --name U4-S1-NLP --display-name "U4-S1-NLP (Python3)"` => This will add a json object to an ipython file, so JupterLab will know that it can use this isolated instance of Python. :)
7. Last step, we need to install the models for Spacy. Run these commands `python -m spacy download en_core_web_md` and `python -m spacy download en_core_web_lg`
8. Deactivate your conda environment and launch JupyterLab. You should know see "U4-S1-NLP (Python3)" in the list of available kernels on launch screen.

# 0. Colab notebook setup
Start running the notebook here.

## 0.1 Download the required spacy module that we'll use later
*Note -- you need to restart the runtime right after running this cell!*

In [None]:
%%time
# You'll use en_core_web_sm for the sprint challenge due memory constraints on Codegrader
#!python -m spacy download en_core_web_sm

# Locally (or on colab) let's use en_core_web_lg
!python -m spacy download en_core_web_md # Can do lg, takes awhile
# Also on Colab, need to restart runtime after this step!
#      or else Colab won't find spacy

## 0.2 Restart the runtime!
Click on "Runtime" in the menu bar, and select "Restart runtime" from the dropdown menu.

## 0.3 Install dependencies

In [None]:
# Dependencies for the week (instead of conda)
# Run if you're using colab, otherwise you should have a local copy of the data
!wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/main/requirements.txt
!pip install -r requirements.txt

##0.4 Import libraries and load packages

In [None]:
%%time
"""
Import Statements
"""

# Base
from collections import Counter
import re
import pandas as pd

# Plotting
import squarify
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# NLP Libraries
import spacy
from spacy.tokenizer import Tokenizer
from nltk.stem import PorterStemmer

## 0.5 Get the Amazon reviews data and `unzip` it
We can access the Amazon reviews data in this Colab notebook by cloning the `Unit-4-Sprint-1` repo!

In [None]:
# clone the Unit-4-Sprint-1 repo
!git clone https://github.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP.git
# Find the path to the reviews data zip file, using the file icon on the left sidebar
!unzip /content/DS-Unit-4-Sprint-1-NLP/module1-text-data/data/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv.zip

In [None]:
df = pd.read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')
print(df.shape)
df.head()

In [None]:
df['reviews.text'][100]

In [None]:
df['primaryCategories'][100]

In [None]:
type(df['reviews.text'][100])

In [None]:
df.info()

# 1. Tokenization and Text Preprocessing, Part 1
<a id="p1"></a>

## Overview

> **token**: an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing

> [_*Introduction to Information Retrival*_](https://nlp.stanford.edu/IR-book/)


### The attributes of good tokens

* Should be stored in an iterable data structure
  - Allows analysis of the "semantic unit"
* Should be all the same case
  - Reduces the complexity of our data
* Should be free of non-alphanumeric characters (ie punctuation, whitespace)
  - Removes information that is probably not relevant to the analysis

Let's pretend we are trying analyze the random sequence here. Question: what is the most common character in this sequence?

In [None]:
random_seq = "AABAAFBBBBCGCDDEEEFCFFDFFAFFZFGGGGHEAFJAAZBBFCZ"

A useful unit of analysis for us is going to be a letter or character

In [None]:
tokens = list(random_seq)
print(tokens)

Our tokens are already "good": in an iterable datastructure, all the same case, and free of noise characters (punctuation, whitespace), so we can jump straight into analysis.

In [None]:
plt.figure(figsize=(7,7))
sns.countplot(tokens);

The most common character in our sequence is  "F". We can't just glance at the the sequence to know which character is the most common. We (humans) struggle to subitize complex data (like random text sequences).

> __Subitize__ is the ability to tell the number of objects in a set, quickly, without counting.  

We need to chunk the data into countable pieces "tokens" for us to analyze them. This inability subitize text data is the motivation for our discussion today.

### 1.1 Tokenizing with Pure Python

In [None]:
sample = "Friends, Romans, countrymen, lend me your ears;"

In [None]:
sample2 = sample +'..., 911'
print(sample2)

Use `regex` library

In [None]:
import re
re.sub('[^a-zA-Z 0-9]','', sample2)

##### Iterable Tokens

A string object in Python is already iterable. However, the item you iterate over is a character not a token:

```
from time import sleep
for num, character in enumerate(sample):
    sleep(.5)
    print(f"Char {num} - {character}", end="\r")
```

If we instead care about the words in our sample (our semantic unit), we can use the string method `.split()` to separate the whitespace and create iterable units. :)

In [None]:
sample.split()

In [None]:
sample.split(',')

###1.2 Case Normalization
A common data cleaning data cleaning task with token is to standardize or normalize the case. Normalizing case reduces the chance that you have duplicate records for things which have practically the same semantic meaning. You can use either the `.lower()` or `.upper()` string methods to normalize case.

Consider the following example:

In [None]:
# Get the count of how many times each unique brand occurs
# Notice anything odd here?
print(df['brand'].unique())
print(df['brand'].value_counts())

#### Let's use `pandas` to fix the problem!
We `apply` the `.lower()` method

In [None]:
### BEGIN SOLUTION

### END SOLUTION

# Much cleaner
df['brand'].value_counts()

###1.4 Using `regex` to remove punctuation
`regex` is a powerful mini- language that allows you to search and match patterns in strings. If you haven't used it before, no worries, now is your chance to learn a bit about it! As a software engineer you will find yourself using `regex` surprisingly often!


Read this excellent article [Easiest way to remember Regular Expressions (Regex)](https://towardsdatascience.com/easiest-way-to-remember-regular-expressions-regex-178ba518bebd) as a quick introduction! <br>

Also useful is the [regular expressions cheat sheet](https://www.dataquest.io/blog/regex-cheatsheet/)) from dataquest.io

Finally [regex101](https://regex101.com/) offers an interactive `regex` checker, where you can test whether your `regex` code does what you intended it to do!

#### First `regex` example
Suppose we want to keep only alphanumeric characters and spaces.
Everything else is probably noise: just punctuation, and other special characters. This one is little bit more complicated than our previous example. Here we will have to import the `regex` package `re` (regular expressions). <br>

The `regex` expression pattern for this task is `'[^a-zA-Z 0-9]'` which matches character which are **not** in the set {lower case letters, upper case letters, spaces, and numbers}

In [None]:
sample = sample + '..., 911'
print(sample)

We'll use the `re.sub()` method to replace the characters matching that pattern with `''`, an empty string, effectively getting rid of them.

In [None]:
# replace (sub) "everything that is NOT lower-case or upper-case or numerical or space" with empty string ""
import re
sample = re.sub('[^a-zA-Z 0-9]', '', sample)

In [None]:
sample

Next we can use `python`'s  `lower()` and `split()` methods <br>
to convert upper case characters to lower case, then split the string on whitespace, producing a list of tokens.

In [None]:
#split into words and lower case
sample.lower().split()

Congratulations, you have just learned all the steps to clean and tokenize a text string!

### 1.5 Five Minute Challenge: build your own tokenizer
- Complete the function `tokenize` below
- Combine the methods which we discussed above to clean and tokenize a text string.
- Your function should remove punctuation and special characters, split the text string into words, and lower case all capital letters
- You can put the methods in any order you want

In [None]:
def tokenize(text):
    """Parses a string into a list of semantic units (words)

    Args:
        text (str): The string that the function will tokenize.

    Returns:
        list: tokens parsed out by the mechanics of your choice
    """

    ### BEGIN SOLUTION

    ### END SOLUTION
    return tokens

In [None]:
# this should be your output
tokenize(sample)

# 2. Tokenization and text preprocessing, part 2

Our inability to analyze text data becomes quickly amplified in a business context. Consider the following:

A business which sells widgets also collects customer reviews of those widgets. When the business first started out, they had a human read the reviews to look for patterns. Now, the business sells thousands of widgets a month. The human readers can't keep up with the pace of reviews to synthesize an accurate analysis. They need some science to help them analyze their data.

Now, let's pretend that business is Amazon, and the widgets are Amazon products such as the Alexa, Echo, or other AmazonBasics products. Let's analyze their reviews with some counts. This dataset is available on [Kaggle](https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products/).

In [None]:
df.head(2)

###  2.1 Counting word occurrences

In [None]:
# Counting occurrences of words in the raw text
df['reviews.text'].value_counts()[:10]

In [None]:
# Counting fractional word occurrences (divide by total number of words)
# Look at the first 10 reviews
df['reviews.text'].value_counts(normalize=True)[:10]

### 2.2 Use your tokenizer to tokenize the reviews

In [None]:
### BEGIN SOLUTION

# tokenize reviews.text


### END SOLUTION

Document is a text string

In [None]:
df['reviews.text'].iloc[0]

Tokenized document is a list of tokens

In [None]:
df['tokens'].iloc[0]

Let's take a smaller subset of the data so that our demonstration codes will run faster.

In [None]:
# view count of primaryCategories
df['primaryCategories'].value_counts()

In [None]:
# Take a subset of the df where the only primaryCategories is Electronics
df = df[df['primaryCategories'] == 'Electronics'].copy()
print(df.shape)
df.head()

In [None]:
df['tokens'][:5]

###2.3 Analyzing Tokens

In [None]:
# Object from Base Python
from collections import Counter

### BEGIN SOLUTION
# The object `Counter` takes an iterable, but you can instantiate an empty one and update it.


# Update it based on a split of each of our documents


# Print out the 10 most common words


### END SOLUTION

In [None]:
type(word_counts)



Below we have create a function `count()` which takes a corpus of tokenized documents <br>`df['tokens']` as its input and returns a dataframe of word counts and associated descriptive statistics.<br><br>

To understand this `count()` function, let's use a "top-down" approach: <br>
so first, we'll scroll down to and read "Summary of the descriptive token statistics",<br> then read the code cell below "Make our count object" and have a look at the dataframe that is produced.<br><br>
Keeping the structure of that dataframe in mind will make it easier to see what the `count()` function is doing.<br>
Let's go through the code below and understand it line by line:

In [None]:
def count(token_lists):
    """
    Calculates some basic statistics about tokens in our corpus (i.e. corpus means collections text data)
    """
    # stores the count of each token
    word_counts = Counter()

    # stores the number of docs that each token appears in
    appears_in_docs = Counter()

    total_docs = len(token_lists)

    for token_list in token_lists:
        # stores count of every appearance of a token
        word_counts.update(token_list)

        # use set() in order to not count duplicates, thereby count the num of docs that each token appears in
        appears_in_docs.update(set(token_list))

    # build word count dataframe
    word_count_dict = zip(word_counts.keys(), word_counts.values())
    wc = pd.DataFrame(word_count_dict, columns = ['word', 'count'])

    # rank the the word counts
    wc['rank'] = wc['count'].rank(method='first', ascending=False)
    total = wc['count'].sum()

    # calculate the percent total of each token
    wc['fraction_of_total'] = wc['count'].apply(lambda token_count: token_count / total)

    # calculate the cumulative percent total of word counts
    wc = wc.sort_values(by='rank')
    wc['cumulative_fraction_of_total'] = wc['fraction_of_total'].cumsum()

    # create dataframe for document stats
    t2 = zip(appears_in_docs.keys(), appears_in_docs.values())
    ac = pd.DataFrame(t2, columns=['word', 'appears_in_docs'])

    # merge word count stats with doc stats
    wc = ac.merge(wc, on='word')

    wc['appears_in_fraction_of_docs'] = wc['appears_in_docs'].apply(lambda x: x / total_docs)

    return wc.sort_values(by='rank')

Note that we use `set(token_list)` to update the count in `appears_in_docs` <br>
In `python`, `{}` denotes a `set`, which is a list of **unique** values.<br>
Thus the Counter for a given word in `appears_in_docs` gets updated once each time that word is used in a document.<br>
So the `appears_in_docs` Counter registers the total number of documents each word appears in.

In [None]:
this_is_a_set = {1, 2, 2, 2, 3, 4, 5}
print(type(this_is_a_set))
print(this_is_a_set)

#### Summary of the descriptive token statistics

`word` The specific token that is being analyzed

`appears_in_docs` Number of documents that the word/token appears in

`count` The total number of appearances of that token within the corpus

`rank` Ranking of tokens by count

`fraction_of_total` Fraction of the total tokens that this token makes up

`cumulative_fraction_of_total` Sum of fractional total of ranked tokens, down to and including this token.

`appears_in_fraction_of_docs` Fraction of documents that token appears in

Make our `count` object

In [None]:
# Use the count function
wc  = count(df['tokens'])
print(wc.shape)
wc.head()

In [None]:
# Cumulative Distribution Plot
plt.figure(figsize=(7,7))
sns.lineplot(x='rank', y='cumulative_fraction_of_total', data=wc);
plt.grid()

In [None]:
wc[wc['rank'] <= 250]['cumulative_fraction_of_total']

In [None]:
wc[wc['rank'] <= 100]['cumulative_fraction_of_total']

### `squarify` shows the most frequent words

In [None]:
import squarify
import matplotlib.pyplot as plt

wc_top20 = wc[wc['rank'] <= 20]

plt.figure(figsize=(7,7))
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

### 2.4 Processing Raw Text with Spacy

Spacy's datamodel for documents is unique among NLP libraries. Instead of storing the documents components in various data structures, Spacy indexes components and simply stores the lookup information.

This is often why Spacy is considered to be more production grade than a library like NLTK.

In [None]:
import spacy
nlp = spacy.load('en_core_web_md')

In [None]:
sample = """
Natural Language Processing Summary
The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics (Wikipedia).

“Natural Language Processing is a field that covers computer understanding and manipulation of human language, and it’s ripe with possibilities for news gathering,” Anthony Pesce said in Natural Language Processing in the kitchen. “You usually hear about it in the context of analyzing large pools of legislation or other document sets, attempting to discover patterns or root out corruption.”
"""

In [None]:
sample

In [None]:
doc = nlp(sample)
doc

In [None]:
print(type(doc))
dir(doc)

Let's create a tokenizer using `spacy`

In [None]:
%%time
# create a tokenizer using spacy

### BEGIN SOLUTION



# save tokens to df
df['spacy_tokens'] = ...

### END SOLUTION

Using `squarify` we can graphically display occurrence rates for the most common tokens

In [None]:
# pass df through count for stats
wc = count(df['spacy_tokens'])

# sort and keep top 20 tokens for plotting
wc_top20 = wc[wc['rank'] <= 20]

# plot stats
plt.figure(figsize=(7,7))
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

In [None]:
wc_next20 = wc[(wc['rank'] > 20) & (wc['rank'] <= 40)]

# plot stats
plt.figure(figsize=(7,7))
squarify.plot(sizes=wc_next20['fraction_of_total'], label=wc_next20['word'], alpha=.8 )
plt.axis('off')
plt.show()

In [None]:
wc_3rd20 = wc[(wc['rank'] > 40) & (wc['rank'] <= 60)]

# plot stats
plt.figure(figsize=(7,7))
squarify.plot(sizes=wc_3rd20['fraction_of_total'], label=wc_3rd20['word'], alpha=.8 )
plt.axis('off')
plt.show()

## Challenge

In the module project, you will apply tokenization to another set of review data and produce visualizations of those tokens!

# 3. Stop Words (Learn)
<a id="p2"></a>

## Overview
Section Agenda
- What are they?
- How do we get rid of them using Spacy?
- Visualization
- Libraries of Stop Words
- Extending Stop Words
- Statistical trimming

If the visualizations above, you began to notice a pattern. Most of the words don't really add much to our understanding of product reviews. Words such as "I", "and", "of", etc. have almost no semantic meaning to us. We call these useless words "stop words," because we should 'stop' ourselves from including them in the analysis.

Most NLP libraries have built in lists of stop words that common english words: conjunctions, articles, adverbs, pronouns, and common verbs. The best practice, however, is to extend/customize these standard english stopwords for your problem's domain. If I am studying political science, I may want to exclude the word "politics" from my analysis; it's so common it does not add to my understanding.

## Follow Along

### Default Stop Words
Let's take a look at the standard stop words that came with our Spacy model:

In [None]:
# Spacy's Default Stop Words
print(nlp.Defaults.stop_words)
print(len(nlp.Defaults.stop_words))

Let's improve our `spacy` tokenizer by removing stop words and punctuation and lower-casing the tokens

In [None]:
%%time
# Use spacy to create a tokenizer that removes stop words

tokens = []

""" Update those tokens w/o stopwords"""
for doc in nlp.pipe(df['reviews.text']):

    doc_tokens = []

    for token in doc:
        if (not token.is_stop) & (not token.is_punct):
            doc_tokens.append(token.text.lower())

    tokens.append(doc_tokens)

df['spacy_tokens_v2'] = tokens

In [None]:
df['spacy_tokens_v2']

In [None]:
# plot the stats
# pass tokens through count function
wc = count(df['spacy_tokens_v2'])

# sort and keep the top 20 words
wc_top20 = wc[wc['rank'] <= 20]

plt.figure(figsize=(8,8))
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

### Extending Stop Words

In [None]:
print(type(nlp.Defaults.stop_words))
print(len(nlp.Defaults.stop_words))

In [None]:
STOP_WORDS = nlp.Defaults.stop_words.union(['batteries','I', 'amazon', 'i', 'Amazon', 'it', "it's", 'it.', 'the', 'this',])
print(len(STOP_WORDS))

In [None]:
# use spacy to create a tokenizer that removes stopwords using STOP_WORDS

tokens = []
for doc in nlp.pipe(df['reviews.text'], batch_size=500):

    doc_tokens = []

    for token in doc:
        if (not token.is_punct) & (token.text.lower() not in STOP_WORDS):
            doc_tokens.append(token.text.lower())

    tokens.append(doc_tokens)

df['spacy_tokens_v3'] = tokens

wc = count(df['spacy_tokens_v3'])
wc_top20 = wc[wc['rank'] <= 20]

plt.figure(figsize=(8,8))
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

### Statistical Trimming

So far, we have talked about stop word in relation to either broad english words or domain specific stop words. Another common approach to stop word removal is via statistical trimming. The basic idea: preserve the words that give the most about of variation in your data.

Do you remember this graph?

In [None]:
sns.lineplot(x='rank', y='cumulative_fraction_of_total', data=wc);
plt.grid()

This graph tells us that only a *handful* of words represented 80% of words in the overall corpus. We can interpret this in two ways:
1. The words that appear most frequently may not provide any insight into the mean on the documents since they are so prevalent.
2. Words that appear infrequency (at the end of the graph) also probably do not add much value, because the are mentioned so rarely.

Let's take a look at the words at the bottom and the top and make a decision for ourselves:

In [None]:
# Frequency of appears in documents
sns.distplot(wc['appears_in_fraction_of_docs']);

In [None]:
# Tree-Map w/ Words that appear in at least 2.5% of documents.
wc = wc[wc['appears_in_fraction_of_docs'] >= 0.025]
sns.distplot(wc['appears_in_fraction_of_docs']);

## Challenge

In the module project, you will apply stop word removal to a new corpus. You will focus on applying dictionary based stop word removal, but as a stretch goal, you should consider applying statistical stopword trimming.

# 4. Stemming & Lemmatization (Learn)
<a id="p3"></a>

## Overview

You can see from our example above there is still some normalization to do to get a clean analysis. You notice that there many words (*i.e.* 'batteries', 'battery') which share the same root word. We can use either the process of stemming or lemmatization to trim our words down to the 'root' word.

__Section Agenda__:

- Which is which
- why use one v. other
- show side by side visualizations
- how to do it in spacy & nltk
- introduce PoS in here as well

## Follow Along

### 4.1 Stemming

> *a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.* - [Martin Porter](https://tartarus.org/martin/PorterStemmer/)

Some examples include:
- 'ing'
- 'ed'
- 's'

These rules are by no means comprehensive, but they are somewhere to start. Most stemming is done by well documented algorithms such as Porter, Snowball, and Dawson. Porter and its newer version Snowball are the most popular stemming algorithms today. For more information on various stemming algorithms check out [*"A Comparative Study of Stemming Algorithms"*](https://pdfs.semanticscholar.org/1c0c/0fa35d4ff8a2f925eb955e48d655494bd167.pdf)


Spacy does not do stemming out of the box, but instead uses a different technique called *lemmatization* which we will discuss in the next section. Let's turn to an antique python package `nltk` for stemming.

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ["is", "was", "be", "are", "messed", "mess", "feed", "mixed", "tied", "learned", "wanted", "tried", "finds", "learning"]

for word in words:
    print(ps.stem(word))

### 4.1.1 Five Minute Challenge

Apply the Porter stemming algorithm to the tokens in the `df` dataframe. Visualize the results in the tree graph we have been using for this session.

In [None]:
from tqdm import tqdm
tqdm.pandas()

### BEGIN SOLUTION

# Put in a new column `stems`
def get_stems(text):

    return # TODO

df['stems'] = df['reviews.text'].progress_apply(get_stems)

### END SOLUTION

wc = count(df['stems'])
wc_top20 = wc[wc['rank'] <= 20]

plt.figure(figsize=(8,8))
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

### 4.2 Lemmatization

You notice immediately that results are kinda funky - words just oddly chopped off. The Porter algorithm did exactly what it knows to do: chop off endings. Stemming works well in applications where humans don't have to worry about reading the results. Search engines and more broadly information retrieval algorithms use stemming. Why? Because it's fast.

Lemmatization on the other hand is more methodical. The goal is to transform a word into its base form called a lemma. Plural nouns with funky spellings get transformed to singular tense. Verbs are all transformed to the transitive. Nice tidy data for a visualization. :) However, this tidy data can come at computational cost. Spacy does a pretty freaking good job of it though. Let's take a look:

In [None]:
sent = "men man women woman wolf wolves run runs running go going went gone"
doc = nlp(sent)

# Lemma Attributes
for token in doc:
    print(token.text, "\t", token.lemma_)

In [None]:
# spacy document object
type(doc)

In [None]:
# spacy token object
type(doc[0])

Let's write a function to create tokens using the `spacy` lemmatizer

In [None]:
# Re-load nlp without parser/tagger to speed up pipeline
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger'])

In [None]:
# Wrap it all in a function
def get_lemmas(text):

    lemmas = []
    doc = nlp(text)

    ### BEGIN SOLUTION

    ### END SOLUTION

    return lemmas

In [None]:
df['lemmas'] = df['reviews.text'].progress_apply(get_lemmas)

In [None]:
cols = ['lemmas', 'reviews.text']
df[cols].head()

In [None]:
wc = count(df['lemmas'])
wc_top20 = wc[wc['rank'] <= 20]

plt.figure(figsize=(8,8))
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

In [None]:
# To make this comparison more interesting, let's compare: All Amazon Reviews, Fire HD 8 only, and Kindle only
df['FireHD_8'] = df['name'].str.contains('fire hd 8', case=False)
df['Kindle'] = df['name'].str.contains('kindle', case=False)

# Use the Function for all reviews, Fire HD 8 only, and Kindle only
wc = count(df['lemmas'])
wc_fire_hd_8 = count(df[df['FireHD_8'] == True]['lemmas'])
wc_kindle = count(df[df['Kindle'] == True]['lemmas'])
print(wc.shape, wc_fire_hd_8.shape, wc_kindle.shape)

# Get top 20 word occurences for each set of data
wc_top20 = wc[wc['rank'] <= 20]
wc_fire_top20 = wc_fire_hd_8[wc_fire_hd_8['rank'] <= 20]
wc_kindle_top20 = wc_kindle[wc_kindle['rank'] <= 20]

fig, axes = plt.subplots(1, 3, figsize=(20, 8))

axes[0].set_title('All Amazon Reviews')
squarify.plot(sizes=wc_top20['fraction_of_total'], label=wc_top20['word'], alpha=.8, ax=axes[0])
axes[0].axis('off')

axes[1].set_title('Fire HD 8 Tablet')
squarify.plot(sizes=wc_fire_top20['fraction_of_total'], label=wc_fire_top20['word'], alpha=.8, ax=axes[1])
axes[1].axis('off')

axes[2].set_title('Kindle')
squarify.plot(sizes=wc_kindle_top20['fraction_of_total'], label=wc_kindle_top20['word'], alpha=.8, ax=axes[2])
axes[2].axis('off')
plt.show()

## Challenge

You should know how to apply lemmatization with Spacy to a corpus of text.

## (Bonus Material) ScatterText

To run this section, go to your terminal and execute:

- pip install scattertext

## Challenge

You should know how to apply lemmatization with Spacy to a corpus of text.

## Scattertext Kindle vs. FireHD Comparison

To run this section, go to your terminal and execute:

- pip install scattertext

In [None]:
!pip install scattertext

In [None]:
# Create a copy and add column with product tags
subset_df = df.copy()
subset_df.loc[subset_df['name'].str.contains('kindle', case=False), 'product'] = 'Kindle'
subset_df.loc[subset_df['name'].str.contains('fire hd 8', case=False), 'product'] = 'Fire HD 8'

# Drop Review that aren't Kindle/Fire HD 8
subset_df.dropna(subset=['product'], inplace=True)

# Confirm shape and distribution of reviews
print(subset_df.shape)
subset_df['product'].value_counts()

In [None]:
import spacy
import scattertext as st

nlp = spacy.load("en_core_web_md")

corpus = st.CorpusFromPandas(subset_df,
                             category_col='product',
                             text_col='reviews.text',
                             nlp=nlp).build()

html = st.produce_scattertext_explorer(
    corpus,
    category='Kindle',
    category_name='Kindle',
    not_category_name='Fire HD 8',
    width_in_pixels=1000,
    metadata=subset_df['reviews.rating'])

open('./kindle_vs_firehd8.html', 'w').write(html)

# Review

In this module project, you've seen us apply Natural Language Processing techniques (tokenization, stopword removal, and lemmatization) to a corpus of Amazon text reviews. We analyzed those reviews using these techniques and discovered that Amazon customers are generally satisfied with the battery life of Amazon products and generally appear satisfied.

You will apply similar techniques to today's [module project assignment](https://colab.research.google.com/drive/1tAShxk2KAL0iMp5kC7JGk7UujOEuIuSI?usp=sharing) to analyze coffee shop reviews from yelp. Remember that the techniques of processing the text are just the beginning. There are many ways to slice and dice the data.

# Sources

* Spacy 101 - https://course.spacy.io
* NLTK Book - https://www.nltk.org/book/
* An Introduction to Information Retrieval - https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

## Advanced Resources & Techniques
- Named Entity Recognition (NER)
- Dependency Trees
- Generators
- Major libraries (NLTK, Spacy, Gensim)
