This notebook accompanies the lectures on tokenization, normalization, and stemming. 

Before we get started, you're going to need the NLTK book corpus. Here are the steps to install it: 

1. Open a console or command window.
1. Type `python` to start using python. 
1. Type `import nltk` and hit enter.
1. Type `nltk.download()` and hit enter.
1. This will open a little window. 
1. Click "All Packages" at the top of the list. 
1. Click "Download"

Let me know if you run into any issues!

---

Now we can get started in earnest.

In [1]:
import nltk

In [3]:
nltk.download()



showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
import nltk
from nltk.book import *
from collections import Counter

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Tokenization

Tokenization is the process by which we split text up into tokens. The simplest tokens are those split by whitespace. Let's begin by counting the words in a file that I've included the in repo: the text of _Beowulf_. 

1. Read the file into a variable that holds a (large) string. 
1. Look at the first 1000 characters of that string.
1. Split that string on whitespace (spaces, returns, tabs, etc.)
1. Count the number of tokens. 
1. Determine the most common 10 tokens. 
1. Find the tokens that include punctuation within the token. 

In [3]:
beowulf = open("beowulf.txt").read()

In [4]:
# Display first 1000 characters

beowulf[ : 1000]


"BEOWULF\n\nBy Anonymous\n\nTranslated by Gummere\n\n\n\n\nBEOWULF\n\n\n\n\nPRELUDE OF THE FOUNDER OF THE DANISH HOUSE\n\n\n\nLO, praise of the prowess of people-kings\nof spear-armed Danes, in days long sped,\nwe have heard, and what honor the athelings won!\nOft Scyld the Scefing from squadroned foes,\nfrom many a tribe, the mead-bench tore,\nawing the earls. Since erst he lay\nfriendless, a foundling, fate repaid him:\nfor he waxed under welkin, in wealth he throve,\ntill before him the folk, both far and near,\nwho house by the whale-path, heard his mandate,\ngave him gifts:  a good king he!\nTo him an heir was afterward born,\na son in his halls, whom heaven sent\nto favor the folk, feeling their woe\nthat erst they had lacked an earl for leader\nso long a while; the Lord endowed him,\nthe Wielder of Wonder, with world's renown.\nFamed was this Beowulf:  {0a} far flew the boast of him,\nson of Scyld, in the Scandian lands.\nSo becomes it a youth to quit him well\nwith his father's

In [5]:
# Split it on whitespace

beo_tokens = beowulf.split()

In [6]:
# Calculate the number of tokens

len(beo_tokens)


26116

In [7]:
# Determine the 10 most common tokens
# Hint: check out the Counter object in the collections library

beo_counter = Counter(beo_tokens)

beo_counter.most_common(10)

[('the', 1701),
 ('of', 1032),
 ('and', 689),
 ('to', 531),
 ('in', 452),
 ('his', 428),
 ('that', 322),
 ('he', 312),
 ('with', 286),
 ('was', 240)]

In [22]:
# Now calculate how many tokens have punctuation in them. 
# Hint: the string library has an object with all the punctuation
# marks in it
from string import punctuation

punct_set = set(punctuation)

beo_punct_tokens = []

for word in beo_tokens:
    w_set = set(word)
    overlap = w_set.intersection(punct_set)
    
    if len(overlap) > 0:
        beo_punct_tokens.append(word)
        
print(beo_punct_tokens[:100])
print(len(beo_punct_tokens))
        
        
len(beo_punct_tokens)/len(beo_tokens)

        

['LO,', 'people-kings', 'spear-armed', 'Danes,', 'sped,', 'heard,', 'won!', 'foes,', 'tribe,', 'mead-bench', 'tore,', 'earls.', 'friendless,', 'foundling,', 'him:', 'welkin,', 'throve,', 'folk,', 'near,', 'whale-path,', 'mandate,', 'gifts:', 'he!', 'born,', 'halls,', 'folk,', 'while;', 'him,', 'Wonder,', "world's", 'renown.', 'Beowulf:', '{0a}', 'him,', 'Scyld,', 'lands.', "father's", 'friends,', 'gift,', 'him,', 'aged,', 'days,', 'willing,', 'nigh,', 'loyal:', 'clan.', 'moment,', 'God.', "ocean's", 'billow,', 'clansmen,', 'them,', 'Scyld,', 'ruled....', 'ring-dight', 'vessel,', 'ice-flecked,', 'outbound,', "atheling's", 'barge:', 'boat,', 'breaker-of-rings,', '{0b}', 'one.', 'him.', 'battle,', 'blade:', "o'er", 'away.', 'gifts,', "thanes'", 'treasure,', 'seas,', 'child.', "o'er", 'standard,', 'gold-wove', 'banner;', 'him,', 'ocean.', 'spirits,', 'mood.', 'sooth,', 'halls,', "'neath", 'heaven,', '--', 'freight!', 'Scyldings,', 'beloved,', 'folk,', 'world,', 'heir,', 'Healfdene,', 'life

0.2267192525654771

As you look at those tokens with punctuation, what do you notice? 

What fraction of tokens contain punctuation?

### Tokenization Second Exercise

Now let's try working with some NLTK data. Count the words (or, more precisely, tokens) in one of the first three books included in the book corpus (_Moby Dick_, _Sense and Sensibility_, and The Book of Genesis from the _Bible_).

1. Pick one of the texts and assign it to a new variable. It'll have a name like `text1` before you assign it. That variable was created when we imported everything from `nltk.book`.
1. Look at the structure of variable.
1. Count the tokens as above. Use the `Counter` object.
1. Display the 10 most common tokens.

In [23]:
# Assign the text to the new variable.

mobydick = text1

In [24]:
# Count the tokens

Counter(mobydick).most_common(10)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982)]

In [25]:
# Display the 10 most common tokens.

for book in nltk.corpus.gutenberg.fileids():
    print(book)

austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
bryant-stories.txt
burgess-busterbrown.txt
carroll-alice.txt
chesterton-ball.txt
chesterton-brown.txt
chesterton-thursday.txt
edgeworth-parents.txt
melville-moby_dick.txt
milton-paradise.txt
shakespeare-caesar.txt
shakespeare-hamlet.txt
shakespeare-macbeth.txt
whitman-leaves.txt
