This notebook accompanies the lectures on tokenization, normalization, and stemming. 

Before we get started, you're going to need the NLTK book corpus. Here are the steps to install it: 

1. Open a console or command window.
1. Type `python` to start using python. 
1. Type `import nltk` and hit enter.
1. Type `nltk.download()` and hit enter.
1. This will open a little window. 
1. Click "All Packages" at the top of the list. 
1. Click "Download"

Let me know if you run into any issues!

---

Now we can get started in earnest.

In [1]:
import nltk
from nltk.book import *
from collections import Counter

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Tokenization

Tokenization is the process by which we split text up into tokens. The simplest tokens are those split by whitespace. Let's begin by counting the words in a file that I've included the in repo: the text of _Beowulf_. 

1. Read the file into a variable that holds a (large) string. 
1. Look at the first 1000 characters of that string.
1. Split that string on whitespace (spaces, returns, tabs, etc.)
1. Count the number of tokens. 
1. Determine the most common 10 tokens. 
1. Find the tokens that include punctuation within the token. 

In [2]:
# Read in beowulf here
beolargestring = open("beowulf.txt").read()


In [3]:
# Display first 1000 characters
beolargestring[:1000]

"BEOWULF\n\nBy Anonymous\n\nTranslated by Gummere\n\n\n\n\nBEOWULF\n\n\n\n\nPRELUDE OF THE FOUNDER OF THE DANISH HOUSE\n\n\n\nLO, praise of the prowess of people-kings\nof spear-armed Danes, in days long sped,\nwe have heard, and what honor the athelings won!\nOft Scyld the Scefing from squadroned foes,\nfrom many a tribe, the mead-bench tore,\nawing the earls. Since erst he lay\nfriendless, a foundling, fate repaid him:\nfor he waxed under welkin, in wealth he throve,\ntill before him the folk, both far and near,\nwho house by the whale-path, heard his mandate,\ngave him gifts:  a good king he!\nTo him an heir was afterward born,\na son in his halls, whom heaven sent\nto favor the folk, feeling their woe\nthat erst they had lacked an earl for leader\nso long a while; the Lord endowed him,\nthe Wielder of Wonder, with world's renown.\nFamed was this Beowulf:  {0a} far flew the boast of him,\nson of Scyld, in the Scandian lands.\nSo becomes it a youth to quit him well\nwith his father's

In [7]:
# Split it on whitespace
beowords = beolargestring.split()
len(beowords)

26116

In [10]:
# Calculate the number of tokens
from collections import Counter
beotokens = Counter(beowords)
len(beotokens)

6824

In [11]:
# Determine the 10 most common tokens
# Hint: check out the Counter object in the collections library
beotokens.most_common(10)

[('the', 1701),
 ('of', 1032),
 ('and', 689),
 ('to', 531),
 ('in', 452),
 ('his', 428),
 ('that', 322),
 ('he', 312),
 ('with', 286),
 ('was', 240)]

In [15]:
# Now calculate how many tokens have punctuation in them. 
# Hint: the string library has an object with all the punctuation
# marks in it
from string import punctuation
punc_set = set(punctuation)

beotokens_punc = []

for p in beotokens :
    p_set = set(p)
    overlap = p_set.intersection(punc_set)
    
    if len(overlap) > 0 :
        beotokens_punc.append(p)

print(len(beotokens_punc))      
print(beotokens_punc[:100])




3478
['LO,', 'people-kings', 'spear-armed', 'Danes,', 'sped,', 'heard,', 'won!', 'foes,', 'tribe,', 'mead-bench', 'tore,', 'earls.', 'friendless,', 'foundling,', 'him:', 'welkin,', 'throve,', 'folk,', 'near,', 'whale-path,', 'mandate,', 'gifts:', 'he!', 'born,', 'halls,', 'while;', 'him,', 'Wonder,', "world's", 'renown.', 'Beowulf:', '{0a}', 'Scyld,', 'lands.', "father's", 'friends,', 'gift,', 'aged,', 'days,', 'willing,', 'nigh,', 'loyal:', 'clan.', 'moment,', 'God.', "ocean's", 'billow,', 'clansmen,', 'them,', 'ruled....', 'ring-dight', 'vessel,', 'ice-flecked,', 'outbound,', "atheling's", 'barge:', 'boat,', 'breaker-of-rings,', '{0b}', 'one.', 'him.', 'battle,', 'blade:', "o'er", 'away.', 'gifts,', "thanes'", 'treasure,', 'seas,', 'child.', 'standard,', 'gold-wove', 'banner;', 'ocean.', 'spirits,', 'mood.', 'sooth,', "'neath", 'heaven,', '--', 'freight!', 'Scyldings,', 'beloved,', 'world,', 'heir,', 'Healfdene,', 'life,', 'sturdy,', 'glad.', 'Then,', 'one,', 'four:', 'Heorogar,', 'H

As you look at those tokens with punctuation, what do you notice? A majority of them are commas. There are several hyphenated words.

What fraction of tokens contain punctuation? 3478/6824--- almost exactly half.

In [19]:
len(beotokens_punc)/len(beotokens)

0.5096717467760844

### Tokenization Second Exercise

Now let's try working with some NLTK data. Count the words (or, more precisely, tokens) in one of the first three books included in the book corpus (_Moby Dick_, _Sense and Sensibility_, and The Book of Genesis from the _Bible_).

1. Pick one of the texts and assign it to a new variable. It'll have a name like `text1` before you assign it. That variable was created when we imported everything from `nltk.book`.
1. Look at the structure of variable.
1. Count the tokens as above. Use the `Counter` object.
1. Display the 10 most common tokens.

In [21]:
# Assign the text to the new variable.
bog = text3

In [23]:
# Count the tokens
bogtokens = Counter(bog)
len(bogtokens)

2789

In [24]:
# Display the 10 most common tokens.
bogtokens.most_common(10)

[(',', 3681),
 ('and', 2428),
 ('the', 2411),
 ('of', 1358),
 ('.', 1315),
 ('And', 1250),
 ('his', 651),
 ('he', 648),
 ('to', 611),
 (';', 605)]