<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Natural-Language-Pre-Processing" data-toc-modified-id="Natural-Language-Pre-Processing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Natural Language Pre-Processing</a></span></li><li><span><a href="#Learning-Goals" data-toc-modified-id="Learning-Goals-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Learning Goals</a></span></li><li><span><a href="#Overview-of-NLP" data-toc-modified-id="Overview-of-NLP-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Overview of NLP</a></span></li><li><span><a href="#Preprocessing-for-NLP" data-toc-modified-id="Preprocessing-for-NLP-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Preprocessing for NLP</a></span><ul class="toc-item"><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Tokenization</a></span></li></ul></li><li><span><a href="#Text-Cleaning" data-toc-modified-id="Text-Cleaning-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Text Cleaning</a></span><ul class="toc-item"><li><span><a href="#Capitalization" data-toc-modified-id="Capitalization-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Capitalization</a></span></li><li><span><a href="#Punctuation" data-toc-modified-id="Punctuation-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Punctuation</a></span></li><li><span><a href="#Stopwords" data-toc-modified-id="Stopwords-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Stopwords</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Numerals" data-toc-modified-id="Numerals-5.3.0.1"><span class="toc-item-num">5.3.0.1&nbsp;&nbsp;</span>Numerals</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Regex" data-toc-modified-id="Regex-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Regex</a></span><ul class="toc-item"><li><span><a href="#RegexpTokenizer()" data-toc-modified-id="RegexpTokenizer()-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span><code>RegexpTokenizer()</code></a></span></li></ul></li><li><span><a href="#Exercise:-NL-Pre-Processing" data-toc-modified-id="Exercise:-NL-Pre-Processing-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Exercise: NL Pre-Processing</a></span></li></ul></div>

# Natural Language Pre-Processing

In [None]:
# Use this to install nltk if needed
# !pip install nltk
# !conda install -c anaconda nltk

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.pyplot as plt
import string
import re

In [None]:
# Use this to download the stopwords if you haven't already - only ever needs to be run once

nltk.download("stopwords")

# Learning Goals

- Describe the basic concepts of NLP
- Use pre-processing methods for NLP
    - Tokenization
    - Stopwords removal

# Overview of NLP

NLP allows computers to interact with text data in a structured and sensible way. In short, we will be breaking up series of texts into individual words (or groups of words), and isolating the words with **semantic value**.  We will then compare texts with similar distributions of these words, and group them together.

In this section, we will discuss some steps and approaches to common text data analytic procedures. Some of the applications of natural language processing are:
- Chatbots 
- Speech recognition and audio processing 
- Classifying documents 

Here is an example that uses some of the tools we use in this notebook.  
  -[chicago_justice classifier](https://github.com/chicago-justice-project/article-tagging/blob/master/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb)

We will introduce you to the preprocessing steps, feature engineering, and other steps you need to take in order to format text data for machine learning tasks. 

We will also introduce you to [**NLTK**](https://www.nltk.org/) (Natural Language Toolkit), which will be our main tool for engaging with textual data.

<img src="images/nlp_process.png" style="width:1000px;">

# Preprocessing for NLP

The goal when pre-processing text data for NLP is to remove as many unnecessary words as possible while preserving as much semantic meaning as possible. This will improve your model performance dramatically.

You can think of this sort of like dimensionality reduction. The unique words in your corpus form a **vocabulary**, and each word in your vocabulary is essentially another feature in your model. So we want to get rid of unnecessary words and consolidate words that have similar meanings.

We will be working with a dataset which includes both satirical** (The Onion) and real news (Reuters) articles. We refer to the entire set of articles as the **corpus**.  

![the_onion](images/the_onion.jpeg) ![reuters](images/reuters.png)

In [None]:
corpus = pd.read_csv('data/satire_nosatire.csv')
corpus.shape

In [None]:
corpus.tail()

Our goal is to detect satire, so our target class of 1 is associated with The Onion articles.  

In [None]:
corpus.loc[10].body

In [None]:
corpus.loc[10].target

In [None]:
corpus.loc[502].body

In [None]:
corpus.loc[502].target

Each article in the corpus is refered to as a **document**.

It is a balanced dataset with 500 documents of each category. 

In [None]:
corpus.target.value_counts()

**Discussion:** Let's think about the use cases of being able to correctly separate satirical from authentic news. What might be a real-world use case?  

In [None]:
# Thoughts here



## Tokenization 

In order to convert the texts into data suitable for machine learning, we need to break down the documents into smaller parts. 

The first step in doing that is **tokenization**.

Tokenization is the process of splitting documents into units of observations. We usually represent the tokens as __n-grams__, where n represent the number of consecutive words occuring in a document that we will consider a unit. In the case of unigrams (one-word tokens), the sentence "David works here" would be tokenized into:

- "David", "works", "here";

If we want (also) to consider bigrams, we would (also) consider:

- "David works" and "works here".

Let's consider a particular document in our corpus:

In [None]:
sample_document = corpus.iloc[1].body

In [None]:
sample_document

There are many ways to tokenize our document. 

It is a long string, so the first way we might consider is to split it by spaces.

**Knowledge Check:** How would we split our documents into words using spaces?

<p>
</p>
<details>
    <summary><b><u>Click Here for Answer Code</u></b></summary>

    sample_document.split(' ')
    
</details>

In [None]:
# code


But this is not ideal. We are trying to create a set of tokens with **high semantic value**.  In other words, we want to isolate text which best represents the meaning in each document.

# Text Cleaning

Most NL Pre-Processing will include the following tasks:

  1. Remove capitalization  
  2. Remove punctuation  
  3. Remove stopwords  
  4. Remove numerals

We could manually perform all of these tasks with string operations.

## Capitalization

When we create our matrix of words associated with our corpus, **capital letters** will mess things up.  The semantic value of a word used at the beginning of a sentence is the same as that same word in the middle of the sentence.  In the two sentences:

sentence_one =  "Excessive gerrymandering in small counties suppresses turnout."   
sentence_two =  "Turnout is suppressed in small counties by excessive gerrymandering."  

'excessive' has the same semantic value, but will be treated as different tokens because of capitals.

In [None]:
sentence_one =  "Excessive gerrymandering in small counties suppresses turnout." 
sentence_two =  "Turnout is suppressed in small counties by excessive gerrymandering."

Excessive = sentence_one.split(' ')[0]
excessive = sentence_two.split(' ')[-2]
print(excessive, Excessive)
excessive == Excessive

In [None]:
manual_cleanup = [word.lower() for word in sample_document.split(' ')]

In [None]:
print(f"Our initial token set for our sample document is {len(manual_cleanup)} words long")

In [None]:
print(f"Our initial token set for our sample document has \
{len(set(sample_document.split(' ')))} unique words")

In [None]:
print(f"After removing capitals, our sample document has \
{len(set(manual_cleanup))} unique words")

## Punctuation

Like capitals, splitting on white space will create tokens which include punctuation that will muck up our semantics.  

Returning to the above example, 'gerrymandering' and 'gerrymandering.' will be treated as different tokens.

In [None]:
no_punct = sentence_one.split(' ')[1]
punct = sentence_two.split(' ')[-1]
print(no_punct, punct)
no_punct == punct

In [None]:
## Manual removal of punctuation

string.punctuation

In [None]:
manual_cleanup = [s.translate(str.maketrans('', '', string.punctuation))\
                  for s in manual_cleanup]

In [None]:
print(f"After removing punctuation, our sample document has \
{len(set(manual_cleanup))} unique words")

In [None]:
manual_cleanup[:20]

## Stopwords

Stopwords are the **filler** words in a language: prepositions, articles, conjunctions. They have low semantic value, and often need to be removed.  

Luckily, NLTK has lists of stopwords ready for our use.

In [None]:
stopwords.words('english')[:10]

In [None]:
stopwords.words('greek')[:10]

Let's see which stopwords are present in our sample document.

In [None]:
stops = [token for token in manual_cleanup if token in stopwords.words('english')]
stops[:10]

In [None]:
print(f'There are {len(stops)} instances of {len(set(stops))} \
stopwords in the sample document')

In [None]:
print(f'The {len(stops)} instances make up \
{len(stops)/len(manual_cleanup): 0.2%} of our text')

Let's also use the **FreqDist** tool to look at the makeup of our text before and after removal:

In [None]:
fdist = FreqDist(manual_cleanup)
plt.figure(figsize=(10, 10))
fdist.plot(30);

In [None]:
manual_cleanup = [token for token in manual_cleanup if\
                  token not in stopwords.words('english')]

In [None]:
sample_document

In [None]:
manual_cleanup[:10]

In [None]:
# We can also customize our stopwords list

custom_sw = stopwords.words('english')
custom_sw.extend(["i'd","say"] )
custom_sw[-10:]

In [None]:
sw = stopwords.words('english')
manual_cleanup = [token for token in manual_cleanup if token not in sw]

In [None]:
print(f'After removing stopwords, there are {len(set(manual_cleanup))} unique words left')

In [None]:
fdist = FreqDist(manual_cleanup)
plt.figure(figsize=(10, 10))
fdist.plot(30);

#### Numerals

Numerals also usually have low semantic value. Their removal can help improve our models. 

In [None]:
manual_cleanup = [s.translate(str.maketrans('', '', '0123456789')) \
                  for s in manual_cleanup]

In [None]:
# drop empty strings

manual_cleanup = [s for s in manual_cleanup if s != '']

In [None]:
print(f'After removing numerals, there are {len(set(manual_cleanup))} unique words left')

# Regex

Regex allows us to match strings based on a pattern.  This pattern comes from a language of identifiers, which we can begin exploring on the cheatsheet found here:
  -   https://regexr.com/

A few key symbols:
  - . : matches any character
  - \d, \w, \s : represent digit, word, whitespace  
  - *, ?, +: matches 0 or more, 0 or 1, 1 or more of the preceding character  
  - [A-Z]: matches any capital letter  
  - [a-z]: matches lowercase letter  

Other helpful resources:
  - https://regexcrossword.com/
  - https://www.regular-expressions.info/tutorial.html

We can use regex to isolate numerals:

In [None]:
sample_document

In [None]:
pattern = '[0-9]'
number = re.findall(pattern, sample_document)
number

In [None]:
pattern2 = '[0-9]+'
number2 = re.findall(pattern2, sample_document)
number2

## `RegexpTokenizer()`

Sklearn and NLTK provide us with a suite of **tokenizers** for our text preprocessing convenience. So we don't have to do this all by hand every time!

In [None]:
sample_document

In [None]:
# Remember that the '?' indicates 0 or 1!

re.findall(r"([a-zA-Z]+(?:'[a-z]+)?)", "I'd")

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
tokenizer = RegexpTokenizer(pattern)
sample_doc = tokenizer.tokenize(sample_document)

In [None]:
sample_doc = [token.lower() for token in sample_doc]
sample_doc = [token for token in sample_doc if token not in sw]

In [None]:
sample_document

In [None]:
sample_doc[:10]

In [None]:
print(f'We are down to {len(set(sample_doc))} unique words')

# Exercise: NL Pre-Processing

**Activity:** Use what you've learned to preprocess the fourth article. How does the length and number of unique words in the article change?


In [None]:
## Your code here
