# Normalizing text

## The problem

The general situation here is this: 

* We have extracted the text as a string - a sequence of characters
* Now we want to convert that into a list of tokens. These will, generally speaking, correspond to words.
* We have to think, though, about what sequences of characters will count as instances of a token.

I'm going to get one of the seasons transcripts for us to use.

In [None]:
from seasons_module import load_entire_directory
from xml.etree import ElementTree
import re
def separate_turns(raw_text):
    return re.findall(r"(\w+?)\t(.*?)\n", raw_text)
raw_corpus_files = load_entire_directory('corpora/seasons')
corpus_dict = {}
for fname in raw_corpus_files.keys():
    raw_file = raw_corpus_files[fname]
    new_etree = ElementTree.XML(raw_file)
    entry_text = new_etree.find("BODY/SEASONS").text
    student_name = new_etree.find("PNAME").text
    human_code = new_etree.find("TCODE").text
    separated_text = separate_turns(entry_text)
    student_text = ""
    for turn in separated_text:
        if turn[0] == student_name:
            student_text += turn[1] + "\n"
    short_name = re.sub(r"\.xml", "", fname)
    corpus_dict[short_name] = student_text

In [None]:
corpus_dict.keys()

Let's work with the transcript for **angela**.

In [None]:
txt = corpus_dict["angelapre"]

In [None]:
txt

## Part 1: Tokenize the text

In [None]:
def my_tokenizer(txt):
    words = re.findall("(\S*)\s", txt)
    return words

In [None]:
print(my_tokenizer(txt))

Some issues at this point:

* Do we want "That's" and "that's" to be the same token? If so, we should convert to lower case.
* Some words have punctuation at the end: "rotating," "sun." Those will be treated as distinct tokens if we don't split it off.
* I see some words with "[]" around them.
* We have blank tokens and tokens that are just some symbols: '//', '-'

First, let's get rid of the punctuation and the brackets

In [None]:
def my_tokenizer(txt):
    words = re.findall("\[?(\S*?)[\s,.\]]", txt)
    return words

In [None]:
print(my_tokenizer(txt))

Now let's get rid of tokens that don't have any content, and also convert to lower case.

In [None]:
words = my_tokenizer(txt)
new_words = []
for w in words:
    if len(re.findall("\w", w)) > 0:
           new_words.append(w.lower())

In [None]:
print(new_words)

For fun, let's compare to ntlk's default tokenizer.
Note that it makes some different decisions than us.

In [None]:
import nltk
print(nltk.word_tokenize(txt))

## Stemming

At this point, we are treating all forms of a word as distinct. For example Angela says
"orbit" and "orbits." If we were counting up occurrences, we would currently put these into separate piles.
But sometimes we will want to reduce each word to its root.

This is hard. nltk includes multiple stemmers that we can use. It does some weird things.
Note that it even produces some words that aren't real words.

In [None]:
import nltk
porter = nltk.PorterStemmer()
stemmed_words = [porter.stem(w) for w in new_words]
print(stemmed_words)

One immediate improvement we can make is to simply revert any stemmed words that aren't real words.

nltk's wordnet interface provides us with something that we can use like a dictionary.

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets("dog")

In [None]:
stemmed_words = []
for w in new_words:
    w_temp = porter.stem(w)
    if len(wn.synsets(w_temp)) == 0:
        w_temp = w
    stemmed_words.append(w_temp)
print(stemmed_words)

The wordnet lemmatizer is supposed to do both steps for us:

In [None]:
wnl = nltk.WordNetLemmatizer()

In [None]:
wstemmed_words = [wnl.lemmatize(w) for w in new_words]
print(wstemmed_words)

There's quite a bit of variability in what the stemmers do. 
Let's add a third stemmer, and look at what they all do with a list of test words:

In [None]:
lancaster = nltk.LancasterStemmer()

In [None]:
test_words = ["orbits", "books", "men", "women", "lying", "orbiting", "running", "run", "jumped", "ran", "jumps"]

In [None]:
results = []
for w in test_words:
    results.append(porter.stem(w))
print(results)

In [None]:
results = []
for w in test_words:
    results.append(lancaster.stem(w))
print(results)

In [None]:
results = []
for w in test_words:
    results.append(wnl.lemmatize(w))
print(results)