## Text Normalization

Normalizing text is a critical component of text mining and a step we'll take on every single analysis. Eventually it'll get to the point that it's basically second nature. This notebook accompanies the lecture, where we mention six common types of text normalization: 

1. Case folding
1. Removing punctuation
1. Handling numbers, dates, and times
1. Extracting special information
1. Removing stopwords
1. Correcting spelling

We'll work through a few examples of most of these, although we'll save spelling correction for another day.

In [13]:
import nltk
from nltk.book import *
from collections import Counter
from nltk.corpus import stopwords

from string import punctuation

### 1. Case Folding

We'll often discover that having a mixture of upper and lower case doesn't serve us very well. Case folding helps us handle this. Let's start by finding all the words that appear in the top 1000 most frequent words in the chat corpus with multiple capitalizations.

In [14]:
# Your code here
texts()
chat = text5
chat_count = Counter(chat)

collisions = dict()

for word, count in chat_count.most_common(1000) :
    lc_word = word.lower

if lc_word not in collisions :
    collisions[lc_word] = [word]

else :
    collisions[lc_word].append(word)

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [15]:
for word, wlist in collisions.items():
    if len(wlist) > 1 :
        print(f'The words {",".join(wlist)} map onto {word}')

Now for a slightly easier one, how many times are "the" and "The" used in _Moby Dick_? 

In [16]:
# Your code here
texts()
Counter(text1)['the' or 'The']

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


13721

### 2. Punctuation

Punctuation can be tricky to handle. The easiest thing is to remove it, but that's not always the best thing to do. To practice playing around with it, count the number of **unique** words that have punctuation in them _Beowulf_. Print out a few to look at (although there are a lot, so maybe don't print them all).

In [17]:
beowulf = open("beowulf.txt").read()

In [18]:
# Your code here

punct_set = set(punctuation)
punct_words = set() # since we want uniques

for word in beowulf.split() :
    wset = set(word)
    if punct_set.intersection(wset) :
        punct_words.add(word)
    
print(len(punct_words))

# Let's print 20 or so
print(list(punct_words)[:20])

3478
['wiles,', "God's", 'burgstead!', 'chose.', 'other.', 'Gold-gay', 'well-known', 'gold-crown', 'gone.', '{35a}', 'kin;', 'Gifths,', 'taunt,', 'threatened.', 'battle,', 'precious,', 'watch,', 'face-cloth.', 'freight!', 'man.']


Now let's count the number of words that have punctuation in the _middle_ of the word. Let's also throw them in a `Counter` object and look at the most common. 

In [19]:
# Your code here

punct_set_2 = set() 

for word in beowulf.split() :
    if not word.isalnum() :
        punct_set_2.add(word)

print(len(punct_set_2))

3478


### Stopwords

There are many common words that don't help analysis that much (and can take up a lot of space). These are called stopwords. Let's play around with the English stopwords.
1. Load in the English stopwords and assign them to a variable called `sw`. Print them out. Any surprises?
1. Look at the top words in _Moby Dick_ and _Sense and Sensibility_.
1. Look at the top words in both of those that _aren't_ stopwords. 

In [23]:
# Your code here

sw = stopwords.words("english")
sw

[(',', 9397),
 ('to', 4063),
 ('.', 3975),
 ('the', 3861),
 ('of', 3565),
 ('and', 3350),
 ('her', 2436),
 ('a', 2043),
 ('I', 2004),
 ('in', 1904)]

In [24]:
Counter(text1).most_common(10)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982)]

In [25]:
Counter(text2).most_common(10)

[(',', 9397),
 ('to', 4063),
 ('.', 3975),
 ('the', 3861),
 ('of', 3565),
 ('and', 3350),
 ('her', 2436),
 ('a', 2043),
 ('I', 2004),
 ('in', 1904)]

In [26]:
Counter([w for w in text1 if w.lower() not in sw and w.isalpha()]).most_common(10)

[('whale', 906),
 ('one', 889),
 ('like', 624),
 ('upon', 538),
 ('man', 508),
 ('ship', 507),
 ('Ahab', 501),
 ('ye', 460),
 ('old', 436),
 ('sea', 433)]

## Stemming

Stemming is the process by which we move from a token to some "root" of that word. Let's explore one of the stemmers available through NLTK.

First, let's find all the words in the NLTK words corpus that end in "ing", then let's find those that have no vowels before an instance of "ing". You can access the words corpus with the confusing call of `nltk.corpus.words.words()`. To make it easier to deal with "y", let's just consider it a vowel.

In [27]:
# Your code here

words = nltk.corpus.words.words()
vowels = set('aeiouy')

In [28]:
ing_words = [w for w in words if len(w) > 3 and w[-3:]=="ing"]
len(ing_words)

5557

Now let's play around with the Porter Stemmer in NLTK. First we'll look at a few hundred characters of inaugural addresses both stemmed and not stemmed.

In [29]:
porter = nltk.PorterStemmer() # give it a short name.
start = 30000
distance = 200

print(" ".join(text4[start:(start + distance)]))
print("\n\n")
print(" ".join([porter.stem(w) for w in text4[start:(start + distance)]]))



aid of that Almighty Power which has hitherto protected me and enabled me to bring to favorable issues other important but still greatly inferior trusts heretofore confided to me by my country . The broad foundation upon which our Constitution rests being the people -- a breath of theirs having made , as a breath can unmake , change , or modify it -- it can be assigned to none of the great divisions of government but to that of democracy . If such is its theory , those who are called upon to administer it must recognize as its leading principle the duty of shaping their measures so as to produce the greatest good to the greatest number . But with these broad admissions , if we would compare the sovereignty acknowledged to exist in the mass of our people with the power claimed by other sovereignties , even by those which have been considered most purely democratic , we shall find a most essential difference . All others lay claim to power limited only by their own will . The majority of

Now for you: how many words are in the inaugural addresses? How many lowercase stems are in them? 

In [30]:
# Your code here

print(len(set(text4)))

10025


In [32]:
inaug_stemmed = {porter.stem(w.lower()) for w in text4}

print(len(inaug_stemmed))

5631


---

Okay, let's have some "fun" and play around with some sets of characters that aren't words. Text 5 is the chat corpus. Find the emojis in there (doesn't have to be perfect) and count up the happy and sad ones.

In [34]:
chat = text5 # give it a nice name. 

# Let's find emojis in chat. 
potential_emojis = {w for w in chat if ":" in w or ";" in w or "=" in w}

print(potential_emojis)

{':P', ':):):)', ':-)', '9:10', ';)', '2:55', "='s", '3:45', '6:38', ';p', '.;)', '=]', ']:)', '=(', ':/', ';', '=/', ':', '=)', ':-o', ':-(', '=p', ':blush:', ';0', '=O', ':(', ';-)', '.:', '=-\\', 'n;t', '10:49', '//www.wunderground.com/cgi-bin/findweather/getForecast?query=95953#FIR', ':-@', '6:53', ':o *', '6:41', 'http://www.shadowbots.com', '; ..', '4:03', '=', ';]', 'd=', '6:51', '7:45', '>:->', ':|', ':love:', '=D', ':tongue:', ';-(', '!=', ':]', 'o<|=D', 'capab;e', ':D', ':@', ':O', '=[', ':)', ':p', ':beer:', ':.', 'http://forums.talkcity.com/tc-adults/start '}


In [35]:
# Your code here

# Count happy vs sad
happy = [w for w in chat if w in {":-)",":)",":D",";-)","=)"}]
sad = [w for w in chat if w in {":-(",":(",";-(","=("}]

print(len(happy))
print(len(sad))

159
20
