This notebook accompanies the lectures on tokenization, normalization, and stemming. 

Before we get started, you're going to need the NLTK book corpus. Here are the steps to install it: 

1. Open a console or command window.
1. Type `python` to start using python. 
1. Type `import nltk` and hit enter.
1. Type `nltk.download()` and hit enter.
1. This will open a little window. 
1. Click "All Packages" at the top of the list. 
1. Click "Download"

Let me know if you run into any issues!

---

Now we can get started in earnest.

In [None]:
import nltk
from nltk.book import *
from collections import Counter

## Tokenization

Tokenization is the process by which we split text up into tokens. The simplest tokens are those split by whitespace. Let's begin by counting the words in a file that I've included the in repo: the text of _Beowulf_. 

1. Read the file into a variable that holds a (large) string. 
1. Look at the first 1000 characters of that string.
1. Split that string on whitespace (spaces, returns, tabs, etc.)
1. Count the number of tokens. 
1. Determine the most common 10 tokens. 
1. Find the tokens that include punctuation within the token. 

In [None]:
beowulf = open("beowulf.txt").read()

In [None]:
beowulf[:1000]

In [None]:
beo_tokens = beowulf.split()

In [None]:
len(beo_tokens)

There are many ways to count things. The handiest is the `Counter` data type. I'll illustrate it's use and show you how you could do the same thing with a dictionary. 

In [None]:
# First, the Counter version
from collections import Counter # Typically you'd do this at the top of the notebook.

beo_counter = Counter(beo_tokens)

In [None]:
# Now we can do things like find the most common tokens
beo_counter.most_common(10)

A `Counter` is really just a dictionary, where the keys are the elements of the list that you fed in, and the values are the integer counts. Here's how you'd do the same thing with a dictionary. Notice how much more difficult it is, and the weird construction to sort the dictionary by values for reading out.

In [None]:
beo_dict = dict()

for t in beo_tokens :
    
    # Have to create the spot in the dictionary if it's not in there.
    if t not in beo_dict :
        beo_dict[t] = 0
    
    # And now increment the count
    beo_dict[t] += 1

In [None]:
# Printing out the top 10 is pretty tricky. Do the work to understand what's happening below here:
num_printed = 0

for token, count in sorted(beo_dict.items(), key=lambda item: -1*item[1]) :
    print(token + " had " + str(count) + " instances.")
    num_printed += 1
    
    if num_printed == 10 :
        break

In [None]:
# Now let's count the number of tokens that have punctuation in them. 
# Python has an object that holds punctuation, so we can use that. 
from string import punctuation # usually would do this at the top

# We'll use a set trick, so need punctuation in a set
punct_set = set(punctuation)

beo_tokens_punct = []

for w in beo_tokens :
    w_set = set(w)
    overlap = w_set.intersection(punct_set)
    
    if len(overlap) > 0 :
        beo_tokens_punct.append(w)

        
print(beo_tokens_punct[:100])
print(len(beo_tokens_punct))

As you look at those tokens with punctuation, what do you notice? 

1. Lots of commas just stuck onto words.
1. Some tokens that are *just* punctuation (e.g., "--").
1. Lots of capitals that might not be what we want in our tokenization. 
1. A lot of tokens with punctuation. The next cell calculates the fraction. (Note that it's not unique counts yet.)

In [1]:
len(beo_tokens_punct)/len(beo_tokens)

NameError: name 'beo_tokens_punct' is not defined

### Tokenization Second Exercise

Now let's try working with some NLTK data. Count the words (or, more precisely, tokens) in one of the first three books included in the book corpus (_Moby Dick_, _Sense and Sensibility_, and The Book of Genesis from the _Bible_).

1. Pick one of the texts and assign it to a new variable. It'll have a name like `text1` before you assign it. That variable was created when we imported everything from `nltk.book`.
1. Look at the structure of variable.
1. Count the tokens as above. Use the `Counter` object.
1. Display the 10 most common tokens.

In [None]:
sense = text2

In [None]:
Counter(sense).most_common(10)

We pull in 10 copora when we load the books, but there are a *ton* of books we get with NLTK. Here are the ones that come with Project Gutenberg. 

In [None]:
for book in nltk.corpus.gutenberg.fileids() :
    print(book)