## Text Normalization

Normalizing text is a critical component of text mining and a step we'll take on every single analysis. Eventually it'll get to the point that it's basically second nature. This notebook accompanies the lecture, where we mention six common types of text normalization: 

1. Case folding
1. Removing punctuation
1. Handling numbers, dates, and times
1. Extracting special information
1. Removing stopwords
1. Correcting spelling

We'll work through a few examples of most of these, although we'll save spelling correction for another day.

In [1]:
import nltk
from nltk.book import *
from collections import Counter
from nltk.corpus import stopwords

from string import punctuation

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### 1. Case Folding

We'll often discover that having a mixture of upper and lower case doesn't serve us very well. Case folding helps us handle this. Let's start by finding all the words that appear in the top 1000 most frequent words in the chat corpus with multiple capitalizations.

In [None]:
# Your code here

Now for a slightly easier one, how many times are "the" and "The" used in _Moby Dick_? 

In [None]:
# Your code here

### 2. Punctuation

Punctuation can be tricky to handle. The easiest thing is to remove it, but that's not always the best thing to do. To practice playing around with it, count the number of **unique** words that have punctuation in them _Beowulf_. Print out a few to look at (although there are a lot, so maybe don't print them all).

In [None]:
beowulf = open("beowulf.txt").read()

In [None]:
# Your code here

Now let's count the number of words that have punctuation in the _middle_ of the word. Let's also throw them in a `Counter` object and look at the most common. 

In [None]:
# Your code here

### Stopwords

There are many common words that don't help analysis that much (and can take up a lot of space). These are called stopwords. Let's play around with the English stopwords.
1. Load in the English stopwords and assign them to a variable called `sw`. Print them out. Any surprises?
1. Look at the top words in _Moby Dick_ and _Sense and Sensibility_.
1. Look at the top words in both of those that _aren't_ stopwords. 

In [None]:
# Your code here

## Stemming

Stemming is the process by which we move from a token to some "root" of that word. Let's explore one of the stemmers available through NLTK.

First, let's find all the words in the NLTK words corpus that end in "ing", then let's find those that have no vowels before an instance of "ing". You can access the words corpus with the confusing call of `nltk.corpus.words.words()`. To make it easier to deal with "y", let's just consider it a vowel.

In [None]:
# Your code here

Now let's play around with the Porter Stemmer in NLTK. First we'll look at a few hundred characters of inaugural addresses both stemmed and not stemmed.

In [None]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 200

print(" ".join(text4[start:(start + distance)]))
print("\n\n")
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))



Now for you: how many words are in the inaugural addresses? How many lowercase stems are in them? 

In [None]:
# Your code here

---

Okay, let's have some "fun" and play around with some sets of characters that aren't words. Text 5 is the chat corpus. Find the emojis in there (doesn't have to be perfect) and count up the happy and sad ones.

In [None]:
chat = text5 # give it a nice name. 

In [None]:
# Your code here