# Natural Language Pre-Processing & Vectorization

In [None]:
# Use this to install nltk if needed
# !pip install nltk
# !conda install -c anaconda nltk

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
from nltk import pos_tag
import string
import re

# Notice that these vectorizers are from `sklearn` and not `nltk`!
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,\
HashingVectorizer

In [None]:
# Use this to download the stopwords if you haven't already - only ever needs to be run once
nltk.download("stopwords")

In [None]:
# Same with wordnet!
nltk.download('wordnet')

In [None]:
# And with parts of speech tagging!
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')

# Learning Goals

- Describe the basic concepts of NLP
- Use pre-processing methods for NLP
    - Tokenization
    - Stopwords removal
- normalize a lexicon with stemming and lemmatization
- run feature engineering algorithms for NLP
    - bag-of-Words
    - vectorization

# Overview of NLP

NLP allows computers to interact with text data in a structured and sensible way. In short, we will be breaking up series of texts into individual words (or groups of words), and isolating the words with **semantic value**.  We will then compare texts with similar distributions of these words, and group them together.

In this section, we will discuss some steps and approaches to common text data analytic procedures. Some of the applications of natural language processing are:
- Chatbots 
- Speech recognition and audio processing 
- Classifying documents 

Here is an example that uses some of the tools we use in this notebook.  
  -[chicago_justice classifier](https://github.com/chicago-justice-project/article-tagging/blob/master/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb)

We will introduce you to the preprocessing steps, feature engineering, and other steps you need to take in order to format text data for machine learning tasks. 

We will also introduce you to [**NLTK**](https://www.nltk.org/) (Natural Language Toolkit), which will be our main tool for engaging with textual data.

<img src="images/nlp_process.png" style="width:1000px;">

# Preprocessing for NLP

The goal when pre-processing text data for NLP is to remove as many unnecessary words as possible while preserving as much semantic meaning as possible. This will improve your model performance dramatically.

You can think of this sort of like dimensionality reduction. The unique words in your corpus form a **vocabulary**, and each word in your vocabulary is essentially another feature in your model. So we want to get rid of unnecessary words and consolidate words that have similar meanings.

We will be working with a dataset which includes both satirical (The Onion) and real news (Reuters) articles. We refer to the entire set of articles as the **corpus**.  

Each article in the corpus is refered to as a **document**.

![the_onion](images/the_onion.jpeg) ![reuters](images/reuters.png)

In [None]:
# Read in the data, check it out
corpus = pd.read_csv('data/satire_nosatire.csv')
corpus.shape

In [None]:
# What does the data look like
corpus.tail()

Our goal is to detect satire, so our target class of 1 is associated with The Onion articles.  

In [None]:
# Check out a satirical article
corpus.loc[10]

In [None]:
# Checkout out a non-satirical article
corpus.loc[502]

It is a balanced dataset with 500 documents of each category. 

In [None]:
corpus.target.value_counts()

**Discussion:** Let's think about the use cases of being able to correctly separate satirical from authentic news. What might be a real-world use case?  

- 



# NLP Processing By Hand!

Let's go over the many steps involved in processing text data. While some functions or classes we'll use later will do many of these steps for us, we'll walk through each of these manually first so we can discuss why they're important!

## Tokenization 

In order to convert the texts into data suitable for machine learning, we need to break down the documents into smaller parts. 

The first step in doing that is **tokenization**.

Tokenization is the process of splitting documents into units of observations. We usually represent the tokens as __n-grams__, where n represent the number of consecutive words occuring in a document that we will consider a unit. In the case of unigrams (one-word tokens), the sentence "David works here" would be tokenized into:

- "David", "works", "here";

If we want (also) to consider bigrams, we would (also) consider:

- "David works" and "works here".

Let's consider a particular document in our corpus:

In [None]:
sample_document = corpus.iloc[1].body

In [None]:
sample_document

There are many ways to tokenize our document. 

It is a long string, so the first way we might consider is to split it by spaces.

In [None]:
# Code here to split it by spaces
tokens = None
tokens[:20]

But this is not ideal. We are trying to create a set of tokens with **high semantic value**.  In other words, we want to isolate text which best represents the meaning in each document.

## Text Cleaning

Most NL Pre-Processing will include the following tasks:

  1. Remove capitalization  
  2. Remove punctuation  
  3. Remove stopwords  
  4. Remove numerals
  
We can manually perform all of these tasks with string operations.

### Capitalization

When we create our matrix of words associated with our corpus, **capital letters** will mess things up.  The semantic value of a word used at the beginning of a sentence is the same as that same word in the middle of the sentence - but Python won't treat them the same!

In [None]:
# This will prove that 'excessive' and 'Excessive' aren't the same
sentence_one =  "Excessive gerrymandering in small counties suppresses turnout." 
sentence_two =  "Turnout is suppressed in small counties by excessive gerrymandering."

Excessive = sentence_one.split(' ')[0]
excessive = sentence_two.split(' ')[-2]
print(excessive, Excessive)
excessive == Excessive

While 'excessive' has the same semantic value, it will be treated as different tokens because of capitals. Let's fix it!

In [None]:
# Code here to remove capital letters and make everything lowercase
# Let's do this for our tokens
manual_cleanup = None
manual_cleanup[:20]

In [None]:
# Check the length of the original number of tokens in sample_document
print(f"Our initial token set for our sample document is {len(tokens)} words long")

In [None]:
# Check the length of the unique number of tokens in sample_document
print(f"Our initial token set for our sample document has {len(set(tokens))} unique words")

In [None]:
# Check the length of the unique number of tokens after lowercasing
print(f"After removing capitals, our sample document has {len(set(manual_cleanup))} unique words")

So - it didn't change the number of unique words in our sample document. Part of the reason why is because it's still dealing with punctuation!

### Punctuation

Like capitals, splitting on white space will create tokens which include punctuation that will muck up our semantics.  

Returning to the above example, 'gerrymandering' and 'gerrymandering.' will be treated as different tokens.

In [None]:
# Check out how the same word with and without punctuation aren't the same
no_punct = sentence_one.split(' ')[1]
punct = sentence_two.split(' ')[-1]
print(no_punct, punct)
no_punct == punct

In [None]:
# Can access punctuation using the string library
string.punctuation

In [None]:
# Fun way to clean this is using the translate function!
manual_cleanup = [s.translate(str.maketrans('', '', string.punctuation))\
                  for s in manual_cleanup]

In [None]:
print(f"After removing punctuation, our sample document has \
{len(set(manual_cleanup))} unique words")

We can see that in our sample document, we now have 3 fewer unique tokens! Still more work to do though...

### Stopwords

Stopwords are the **filler** words in a language: prepositions, articles, conjunctions. They have low semantic value, and often need to be removed.  

Luckily, NLTK has lists of stopwords ready for our use.

In [None]:
# Let's check out some English language stopwords
stopwords.words('english')[:10]

In [None]:
# NLTK has stopwords in other languages too!
stopwords.words('greek')[:10]

Let's see which stopwords are present in our sample document.

In [None]:
# Check out the first 10 stopwords in our sample document
stops = [token for token in manual_cleanup if token in stopwords.words('english')]
stops[:10]

In [None]:
print(f'There are {len(stops)} instances of {len(set(stops))} \
stopwords in the sample document')

In [None]:
print(f'The {len(stops)} instances make up \
{len(stops)/len(manual_cleanup): 0.2%} of our text')

Let's also use the **FreqDist** tool to look at the makeup of our text before and after removal:

In [None]:
# This FreqDist class from NLTK will help us visualize the frequency of words
fdist = FreqDist(manual_cleanup)
plt.figure(figsize=(10, 10))
fdist.plot(30);

In [None]:
# Now let's remove stopwords - use list comprehension
sw = stopwords.words('english')
manual_cleanup = None

In [None]:
print(f'After removing stopwords, there are {len(set(manual_cleanup))} unique words left')

In [None]:
# Now let's check out that frequency distribution
fdist = FreqDist(manual_cleanup)
plt.figure(figsize=(10, 10))
fdist.plot(30);

### Numerals

Numerals also usually have low semantic value. Their removal can help improve our models. 

In [None]:
manual_cleanup = [s.translate(str.maketrans('', '', '0123456789')) \
                  for s in manual_cleanup]

In [None]:
# drop empty strings
manual_cleanup = [s for s in manual_cleanup if s != '']

In [None]:
print(f'After removing numerals, there are {len(set(manual_cleanup))} unique words left')

## Regex

Regex allows us to match strings based on a pattern.  This pattern comes from a language of identifiers, which we can begin exploring on the cheatsheet found here:
  -   https://regexr.com/

A few key symbols:
  - . : matches any character
  - \d, \w, \s : represent digit, word, whitespace  
  - *, ?, +: matches 0 or more, 0 or 1, 1 or more of the preceding character  
  - [A-Z]: matches any capital letter  
  - [a-z]: matches lowercase letter  

Other helpful resources:
  - https://regexcrossword.com/
  - https://www.regular-expressions.info/tutorial.html

We can use regex to isolate numerals:

In [None]:
sample_document

In [None]:
pattern = '[0-9]'
number = re.findall(pattern, sample_document)
number

In [None]:
pattern2 = '[0-9]+'
number2 = re.findall(pattern2, sample_document)
number2

# Text Processing Using NLTK

## `RegexpTokenizer()`

SKLearn and NLTK provide us with a suite of **tokenizers** for our text preprocessing convenience. So we don't have to do this all by hand every time!

Let's use a provided Regex pattern to do all that work for us!

Documentation: https://tedboy.github.io/nlps/generated/generated/nltk.tokenize.RegexpTokenizer.html


Can use [RegExr](https://regexr.com/) to explore what our pattern, `([a-zA-Z]+(?:'[a-z]+)?)`, is doing exactly.

In [None]:
sample_document

In [None]:
# This pattern tokenizes AND removes punctuation
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
# Need to save the tokenizer that uses the regex pattern
tokenizer = RegexpTokenizer(pattern)
# And now tokenize our sample_document with tokenizer.tokenize
sample_tokens = tokenizer.tokenize(sample_document)

In [None]:
# Check out one of the middle sentences - notice what happened to I'm
sample_tokens[42:52]

In [None]:
# Still need to lowercase
sample_tokens = [token.lower() for token in sample_tokens]
# And remove stopwords
sample_tokens = [token for token in sample_tokens if token not in sw]

In [None]:
sample_document

In [None]:
print(f'We are down to {len(set(sample_tokens))} unique words')

# Exercise: Initial Text Pre-Processing

**Activity:** Use what you've learned to preprocess the fourth article. How does the length and number of unique words in the article change?

In [None]:
# Grab the fourth document
fourth_document = corpus.iloc[3].body
fourth_document

In [None]:
# Check out how many words it has originally
print(f'The 4th document starts off with {len(set(fourth_document.split(" ")))} unique words')

In [None]:
# Your code here to process and reduce down to useful words!


## Stemming and Lemmatizing

### Stemming
Most of the semantic meaning of a word is held in the root, which is usually the beginning of a word.  Conjugations and plurality do not change the semantic meaning. "eat", "eats", and "eating" all have essentially the same meaning. The rest is grammatical variation for the sake of marking things like tense or person or number.   

Stemmers consolidate similar words by chopping off the ends of the words.

<img src="images/stemmer.png" width=200>

There are different stemmers available.  The two we will use here are the **Porter** and **Snowball** stemmers.  A main difference between the two is how aggressively it stems, Porter being less aggressive.

In [None]:
p_stemmer = nltk.stem.PorterStemmer()
s_stemmer = nltk.stem.SnowballStemmer(language="english")

In [None]:
sample_tokens[0]

In [None]:
p_stemmer.stem(sample_tokens[0])

In [None]:
s_stemmer.stem(sample_tokens[0])

In [None]:
for word in sample_tokens:
    p_word = p_stemmer.stem(word)
    s_word = s_stemmer.stem(word)
    
    if p_word != s_word:
        print(word, p_word, s_word)

In [None]:
sample_doc = [p_stemmer.stem(word) for word in sample_tokens]

In [None]:
fdist = FreqDist(sample_tokens)
plt.figure(figsize=(10, 10))
fdist.plot(30);

In [None]:
print(f'Stemming slightly reduced our token count: {len(set(sample_tokens))} unique tokens')

### Lemmatizing

Lemmatizing is a bit more sophisticated than the stem choppers. Lemmatizing uses part-of-speech tagging to determine how to transform a word.

- Unlike Stemming, Lemmatization reduces the inflected words, properly ensuring that the root word belongs to the language. It can handle words such as "mouse", whose plural "mice" the stemmers would not lump together with the original. 

- In Lemmatization, the root word is called the "lemma". 

- A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

<img src="images/lemmer.png" width=300>

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
print(f'"Mice" becomes: {lemmatizer.lemmatize("mice")}')
print(f'"Media" becomes: {lemmatizer.lemmatize(sample_tokens[76])}')

In [None]:
# However, look at the output below:
sentence = "He saw the trees get sawed down"
lemmed_sentence = [lemmatizer.lemmatize(token) for token in sentence.split(' ')]
lemmed_sentence

Lemmatizers depend, for their full functionality, on POS tagging, and **the default tag is 'noun'**.

With a little bit of work, we can POS tag our text.

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
tokenizer = RegexpTokenizer(pattern)
sample_doc = tokenizer.tokenize(sample_document)
sample_doc = [token.lower() for token in sample_doc]
sample_doc = [token for token in sample_doc if token not in sw]
sample_document

In [None]:
# Use nltk's pos_tag to tag our words
# Does a pretty good job, but does make some mistakes

sample_doc_tagged = pos_tag(sample_doc)
sample_doc_tagged

In [None]:
# Then transform the tags into the tags of our lemmatizers

def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
sample_doc_tagged

In [None]:
sample_doc_tagged = [(token[0], get_wordnet_pos(token[1]))
             for token in sample_doc_tagged]

In [None]:
sample_doc_tagged

In [None]:
sample_doc_lemmed = [lemmatizer.lemmatize(token[0], token[1]) for token in sample_doc_tagged]

In [None]:
sample_doc[:20]

In [None]:
sample_doc_lemmed[:20]

In [None]:
print(f'There are {len(set(sample_doc_lemmed))} unique lemmas.')

In [None]:
fdist = FreqDist(sample_doc_lemmed)
plt.figure(figsize=(10, 10))
fdist.plot(30);

# Preparing Text for Modeling

The machine learning algorithms we have encountered so far represent features as the variables that take on different value for each observation. For example, we represent individuals with distinct education levels, incomes, and such. However, in NLP, features are represented in a very different way. In order to pass text data to machine learning algorithms and perform classification, we need to represent the features in a sensible way. One such method is called **Bag-of-words (BoW)**.

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

- A vocabulary of known words.
- A measure of the presence of known words.

It is called a “bag” of words **because any information about the order or structure of words in the document is discarded**. The model is only concerned with whether known words occur in the document, not with **where** they may occur in the document. The intuition behind BoW is that a document is similar to another if they have similar contents. The Bag of Words method can be represented as a **Document Term Matrix**, in which each column is a unique vocabulary n-gram and each observation is a document. Consider, for example, the following **corpus** of documents:

- Document 1: "I love dogs."
- Document 2: "I love cats."
- Document 3: "I love all animals."
- Document 4: "I hate dogs."

This corpus can be represented as:

$\downarrow$Doc\|Word$\rightarrow$|I|love|dogs|cats|all|animals|hate
-|-|-|-|-|-|-|-
Document_1|1|1|1|0|0|0|0
Document_2|1|1|0|1|0|0|0
Document_3|1|1|0|0|1|1|0
Document_4|1|0|1|0|0|0|1

## Vectorization

In order to get these tokens from our documents, we're going to use tools called "vectorizers".

Documentation! https://scikit-learn.org/stable/modules/feature_extraction.html

### `CountVectorizer`

The most straightforward vectorizer in `sklearn.feature_extraction.text` is the `CountVectorizer`, which will simply count the number of each word type in each document.

In [None]:
# Convert a collection of text documents to a matrix of token counts

# Note that our Vectorizer wants our text BEFORE tokenizing
# So we need to use .join to un-tokenize
X = [" ".join(sample_doc_lemmed)]

# Instantiate a CountVectorizer with just default arguments
vec = None
X_vec = None # Then fit_transform on X

df = pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())
df.head()

That is not very exciting for one document. The idea is to make a document term matrix for all of the words in our corpus.

In [None]:
# Revisit our corpus
corpus.head()

In [None]:
# Note that our vectorizer can use that same regex pattern to tokenize 
# while removing punctuation, AND it'll remove stopwords!
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", 
                      stop_words=sw)
# Just exploring for the first 20 documents
X_vec = vec.fit_transform(corpus.body[:20])

df_tokens = pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())
df_tokens.head()

In [None]:
# We can expand the number of ngrams! AKA not single word tokens
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", 
                      stop_words=sw,
                      ngram_range=[1, 2])
# Just exploring for the first 20 documents
X_vec = vec.fit_transform(corpus.body[:20])

df_bigrams = pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())
df_bigrams.head()

Our document term matrix gets bigger and bigger, with more and more zeros, becoming sparser and sparser.

In [None]:
# One more time...
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", 
                      stop_words=sw,
                      ngram_range=[1, 2])
# But now fit to the entire corpus
X_vec = vec.fit_transform(corpus.body)

df = pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())
df.head()

Luckily, we can set upper and lower limits to the word frequency:

In [None]:
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)",
                      stop_words=sw, 
                      ngram_range=[1, 2],
                      min_df=2,  # Min # of times a word has to appear
                      max_df=100) # Max # of times a word has to appear
X_vec = vec.fit_transform(corpus.body)

df_cv = pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())
df_cv.head()

### `TfidfVectorizer`

There are many schemas for determining the values of each entry in a document term matrix, and one of the most common uses the TF-IDF algorithm -- "Term Frequency-Inverse Document Frequency". Essentially, tf-idf *normalizes* the raw count of the document term matrix. And it represents how important a word is in the given document. 

> The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

**TF (Term Frequency)**: Term frequency is the frequency of the word in the document divided by the total words in the document.

**IDF (inverse document frequency)**: Inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is generally calculated as the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

$$\text{idf}(w) = log (\frac{\text{number of documents}}{\text{num of documents containing} w})$$

tf-idf is the product of term frequency and inverse document frequency, or tf * idf. 

In [None]:
tf_vec = TfidfVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", stop_words=sw)
X_vec = tf_vec.fit_transform(corpus.body)

df = pd.DataFrame(X_vec.toarray(), columns=tf_vec.get_feature_names())
df.head()

In [None]:
corpus.iloc[313].body

In [None]:
df.iloc[313].sort_values(ascending=False)[:10]

Let's compare the tfidf to the count vectorizer output for one document.

In [None]:
vec = CountVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)", stop_words=sw)
X_vec = vec.fit_transform(corpus.body)

df_cv = pd.DataFrame(X_vec.toarray(), columns=vec.get_feature_names())

In [None]:
df_cv.iloc[313].sort_values(ascending=False)[:10]

The tfidf lessoned the importance of some of the more common words, including a word, "also", which might have made it into the stopword list.

It also assigns "nerds" more weight than power.  

In [None]:
print(f'"Nerds" only shows up in document 313: {len(df_cv[df.nerds!=0])} document.')
print(f'"Power" shows up in {len(df_cv[df.power!=0])} documents!')

All the words are stored in a `.vocabulary_` attribute:

In [None]:
tf_vec.vocabulary_

### `HashingVectorizer`

There is also a hashing vectorizer, which will encrypt all the words of the corpus.

In [None]:
hvec = HashingVectorizer(token_pattern=r"([a-zA-Z]+(?:'[a-z]+)?)",
                         stop_words=sw)
X_vec = hvec.fit_transform(corpus.body)

df_cv = pd.DataFrame(X_vec.toarray())
df_cv.head()

### Vectorizer Summary:

Some rules of thumb about these vectorizers:

**Tf-Idf**: Probably the most commonly used. Useful when the goal is to distinguish the **content** of documents from others in the corpus.

**Count**: Useful when the words themselves matter. If the goal is instead about identifying authors by their words, then the fact that some word appears in many documents of the corpus may be important.

**Hashing**: The advantage here is speed and low memory usage. The disadvantage is that you lose the identities of the words being tokenized. Useful for very large datasets where the ultimate model may be a bit of a black box.

## Exercise

For a final exercise, work through the following:

Create a document term matrix of the full 1000-document corpus. 

The vocabulary should have no stopwords, numbers, or punctuation, and it should be lemmatized. 

Use a `TfidfVectorizer`

In [None]:
# Your code here!