## Text Normalization

Normalizing text is a critical component of text mining and a step we'll take on every single analysis. Eventually it'll get to the point that it's basically second nature. This notebook accompanies the lecture, where we mention six common types of text normalization: 

1. Case folding
1. Removing punctuation
1. Handling numbers, dates, and times
1. Extracting special information
1. Removing stopwords
1. Correcting spelling

We'll work through a few examples of most of these, although we'll save spelling correction for another day.

In [2]:
import nltk
from nltk.book import *
from collections import Counter
from nltk.corpus import stopwords

from string import punctuation

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### 1. Case Folding

We'll often discover that having a mixture of upper and lower case doesn't serve us very well. Case folding helps us handle this. Let's start by finding all the words that appear in the top 1000 most frequent words in the chat corpus with multiple capitalizations.

In [None]:
chat = text5
chat_count = Counter(chat)

I'll use a dictionary to hold all the words in the top 1000. The key will be the lowercase word and the value will be a list of every word that maps onto that lowercase word. 

In [None]:
case_collisions = dict() # make a set to hold lowercase versions

for word, count in chat_count.most_common(1000) :
    lc_word = word.lower()
    
    if lc_word not in case_collisions : 
        case_collisions[lc_word] = [word]
    else :
        case_collisions[lc_word].append(word)

In [None]:
for word, wlist in case_collisions.items() :
    if len(wlist) > 1 :
        print(f'The words {",".join(wlist)} map onto {word}') 
        # using the new-ish f strings

Now for a slightly easier one, how many times are "the" and "The" used in _Moby Dick_? 

In [10]:
Counter(text1)['The']

612

In [9]:
Counter(text1)['the']

13721

### 2. Punctuation

Punctuation can be tricky to handle. The easiest thing is to remove it, but that's not always the best thing to do. To practice playing around with it, count the number of **unique** words that have punctuation in them _Beowulf_. Print out a few to look at (although there are a lot, so maybe don't print them all).

In [None]:
beowulf = open("beowulf.txt").read()

In [None]:
# Let's grab every word with punctuation. 
# One straightforward way to do this is to make punctuation a 
# set and intersect it with the set of characters in the word. 

punct_set = set(punctuation)
punct_words = set() # since we want uniques

for word in beowulf.split() :
    wset = set(word)
    if punct_set.intersection(wset) :
        punct_words.add(word)
    
print(len(punct_words))

# Let's print 20 or so
print(list(punct_words)[:20])


In [None]:
# While we're here, we can use the `isalnum` function to test if a string is alphanumeric. 
# This makes the code much simpler. There are also functions like isalpha and isnumeric
# https://docs.python.org/3/library/stdtypes.html#str.isalpha
punct_set_2 = set() 

for word in beowulf.split() :
    if not word.isalnum() :
        punct_set_2.add(word)

print(len(punct_set_2))

Lots of that punctuation is at the end of words (e.g., "gallows." and "vain;"). Let's count the number of words that have punctuation in the _middle_ of the word. Let's also throw them in a `Counter` object and look at the most common. 

In [None]:
punct_mid_words = [] # Use a list so we can use a counter. 

for word in beowulf.split() :
    if not word.isalnum() and len(word) > 1:
        # now we're in the case of punctuation somewhere
        # need to test if it's start or end. 
        if (not word[1] in punctuation and
            not word[-1] in punctuation) :
            punct_mid_words.append(word)


In [None]:
Counter(punct_mid_words).most_common(20)

### Stopwords

There are many common words that don't help analysis that much (and can take up a lot of space). These are called stopwords. Let's play around with the English stopwords.
1. Load in the English stopwords and assign them to a variable called `sw`. Print them out. Any surprises?
1. Look at the top words in _Moby Dick_ and _Sense and Sensibility_.
1. Look at the top words in both of those that _aren't_ stopwords. 

In [None]:
sw = stopwords.words("english")
sw

In [None]:
Counter(text1).most_common(10)

In [None]:
Counter(text2).most_common(10)

To look at the same stats but without stopwords and non-alpha strings, I'm going to use a list comprehension. If you haven't seen these before, here's a nice [tutorial](https://www.youtube.com/watch?v=AhSvKGTh28Q). 

In [None]:
Counter([w for w in text1 if w.lower() not in sw and w.isalpha()]).most_common(10)

In [None]:
Counter([w for w in text2 if w.lower() not in sw and w.isalpha()]).most_common(10)

## Stemming

Stemming is the process by which we move from a token to some "root" of that word. Let's explore one of the stemmers available through NLTK.

First, let's find all the words in the NLTK words corpus that end in "ing", then let's find those that have no vowels before an instance of "ing". You can access the words corpus with the confusing call of `nltk.corpus.words.words()`. To make it easier to deal with "y", let's just consider it a vowel.

In [None]:
words = nltk.corpus.words.words()
vowels = set('aeiouy')

In [None]:
ing_words = [w for w in words if len(w) > 3 and w[-3:]=="ing"]

In [None]:
len(ing_words)

In [None]:
# Now let's find the subset that don't have a vowel before the 'ing'
ing_no_vowel = []

for word in ing_words :
    remainder = word[:-3]
    if len(set(remainder).intersection(vowels))==0 :
        ing_no_vowel.append(word)
        
ing_no_vowel

Now let's play around with the Porter Stemmer in NLTK. First we'll look at a few hundred characters of inaugural addresses both stemmed and not stemmed.

In [None]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 200

print(" ".join(text4[start:(start + distance)]))
print("\n\n")
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))



Now for you: how many words are in the inaugural addresses? How many lowercase stems are in them? 

In [None]:
# words in inaugural addresses
print(len(set(text4)))

In [None]:
inaug_stemmed = {porter.stem(w.lower()) for w in text4}

print(len(inaug_stemmed))

print(len(set(text4))/len(inaug_stemmed))

---

Okay, let's have some "fun" and play around with some sets of characters that aren't words. Text 5 is the chat corpus. Find the emojis in there (doesn't have to be perfect) and count up the happy and sad ones.

In [None]:
chat = text5 # give it a nice name. 

# Let's find emojis in chat. 
potential_emojis = {w for w in chat if ":" in w or ";" in w or "=" in w}

In [None]:
potential_emojis

Clearly we're catching some non-emojis, but let's assume we're getting most of the list. 

In [None]:
# Count happy vs sad
happy = [w for w in chat if w in {":-)",":)",":D",";-)","=)"}]
sad = [w for w in chat if w in {":-(",":(",";-(","=("}]

print(len(happy))
print(len(sad))