### Let's write an elementary tokenizer that uses words as tokens.

We will use Mark Twain's _Life On The Mississippi_ as a test bed. The text is in the accompanying file 'Life_On_The_Mississippi.txt'

Here's a not-terribly-good such tokenizer:

In [1]:
wdict = {}
with open('Life_On_The_Mississippi.txt', 'r') as L:
    line = L.readline()
    nlines = 1
    while line:

        words = line.split()
        for word in words:
            if wdict.get(word) is not None:
                wdict[word] += 1
            else:
                wdict[word] = 1
        line = L.readline()
        nlines += 1

nitem = 0 ; maxitems = 100
for item in wdict.items():
    nitem += 1
    print(item)
    if nitem == maxitems: break


('\ufeffThe', 1)
('Project', 79)
('Gutenberg', 22)
('eBook', 4)
('of', 4469)
('Life', 5)
('on', 856)
('the', 8443)
('Mississippi', 104)
('This', 127)
('ebook', 2)
('is', 1076)
('for', 1017)
('use', 34)
('anyone', 4)
('anywhere', 8)
('in', 2381)
('United', 36)
('States', 26)
('and', 5692)
('most', 119)
('other', 223)
('parts', 5)
('world', 40)
('at', 676)
('no', 325)
('cost', 18)
('with', 1053)
('almost', 37)
('restrictions', 2)
('whatsoever.', 2)
('You', 92)
('may', 85)
('copy', 12)
('it,', 199)
('give', 67)
('it', 1382)
('away', 107)
('or', 561)
('re-use', 2)
('under', 112)
('terms', 22)
('License', 8)
('included', 2)
('this', 591)
('online', 4)
('www.gutenberg.org.', 4)
('If', 85)
('you', 813)
('are', 361)
('not', 680)
('located', 9)
('States,', 8)
('will', 287)
('have', 557)
('to', 3518)
('check', 4)
('laws', 13)
('country', 50)
('where', 152)
('before', 150)
('using', 10)
('eBook.', 2)
('Title:', 1)
('Author:', 1)
('Mark', 2)
('Twain', 2)
('Release', 1)
('date:', 1)
('July', 7)
('1

This is unsatisfactory for a few reasons:

* There are non-ASCII (Unicode) characters that should be stripped (the so-called "Byte-Order Mark" or BOM \ufeff at the beginning of the text);

* There are punctuation marks, which we don't want to concern ourselves with;

* The same word can appear capitalized, or lower-case, or with its initial letter upper-cased, whereas we want them all to be normalized to lower-case.

Part 1 of this assignment: insert code in this loop to operate on the str variable 'line' so as to fix these problems before 'line' is split into words.

A hint to one possible way to do this: use the 'punctuation' character definition in the Python 'string' module, the 'maketrans' and 'translate' methods of Python's str class, to eliminate punctuation, and the regular expression ('re') Python module to eliminate any Unicode---it is useful to know that the regular expression r'[^\x00-x7f]' means "any character not in the vanilla ASCII set.

Part 2: Add code to sort the contents of wdict by word occurrence frequency.  What are the top 100 most frequent word tokens?  Adding up occurrence frequencies starting from the most frequent words, how many distinct words make up the top 90% of word occurrences in this "corpus"?

For this part, the docs of Python's 'sorted' and of the helper 'itemgetter' from 'operator' reward study.

Write your modified code in the cell below.

In [3]:
import string
import re
wdict = {}
with open('Life_On_The_Mississippi.txt', 'r') as L:
    line = L.readline()
    nlines = 1
    while line:
        
        #Make everything lowercase
        line=line.lower()
        #Remove punctuation by "translating" to nothing using String methods
        line=line.translate(str.maketrans('', '', string.punctuation))
        #Replace non-ASCII Characters using re module
        line=re.sub(r'[^\x00-\x7F]','', line)
        
        words = line.split()
        for word in words:
            if wdict.get(word) is not None:
                wdict[word] += 1
            else:
                wdict[word] = 1
        line = L.readline()
        nlines += 1

nitem = 0 ; maxitems = 100
for item in wdict.items():
    nitem += 1
    print(item)
    if nitem == maxitems: break


('the', 9255)
('project', 90)
('gutenberg', 87)
('ebook', 13)
('of', 4532)
('life', 89)
('on', 947)
('mississippi', 159)
('this', 781)
('is', 1148)
('for', 1095)
('use', 48)
('anyone', 5)
('anywhere', 18)
('in', 2593)
('united', 37)
('states', 54)
('and', 5892)
('most', 124)
('other', 270)
('parts', 9)
('world', 68)
('at', 750)
('no', 422)
('cost', 25)
('with', 1081)
('almost', 38)
('restrictions', 2)
('whatsoever', 2)
('you', 1033)
('may', 89)
('copy', 17)
('it', 2293)
('give', 81)
('away', 172)
('or', 581)
('reuse', 2)
('under', 119)
('terms', 26)
('license', 24)
('included', 3)
('online', 4)
('wwwgutenbergorg', 5)
('if', 381)
('are', 387)
('not', 722)
('located', 9)
('will', 301)
('have', 571)
('to', 3592)
('check', 4)
('laws', 17)
('country', 77)
('where', 174)
('before', 208)
('using', 11)
('title', 3)
('author', 3)
('mark', 24)
('twain', 26)
('release', 1)
('date', 18)
('july', 7)
('10', 10)
('2004', 1)
('245', 1)
('recently', 4)
('updated', 2)
('january', 3)
('1', 13)
('2021', 1

Some of the words are still a little weird, but that's alright for a first go. We've at least tackled the problems listed and can now move on to sorting the dictionary.

In [4]:
from operator import itemgetter

In [9]:
wdict_sorted=sorted(wdict.items(), key=itemgetter(1), reverse=True)

In [15]:

for i in range(100):
    print(i+1, " most common word: ", wdict_sorted[i])

1  most common word:  ('the', 9255)
2  most common word:  ('and', 5892)
3  most common word:  ('of', 4532)
4  most common word:  ('a', 4053)
5  most common word:  ('to', 3592)
6  most common word:  ('in', 2593)
7  most common word:  ('it', 2293)
8  most common word:  ('i', 2205)
9  most common word:  ('was', 2093)
10  most common word:  ('that', 1724)
11  most common word:  ('he', 1402)
12  most common word:  ('is', 1148)
13  most common word:  ('for', 1095)
14  most common word:  ('with', 1081)
15  most common word:  ('you', 1033)
16  most common word:  ('his', 961)
17  most common word:  ('had', 961)
18  most common word:  ('but', 952)
19  most common word:  ('on', 947)
20  most common word:  ('as', 881)
21  most common word:  ('this', 781)
22  most common word:  ('they', 758)
23  most common word:  ('at', 750)
24  most common word:  ('not', 722)
25  most common word:  ('all', 720)
26  most common word:  ('by', 713)
27  most common word:  ('one', 686)
28  most common word:  ('there',

In [39]:
#Find how many words make up 90%
total_words = sum(wdict.values())
print(total_words)

147420


In [40]:
#Sort frequencies
sorted_freq = sorted(wdict.values(), reverse=True)
#Initialize running total
running_total = 0

#Iterate over word entries
for i in range(len(sorted_freq)):
    
    num_words = i
    #Take the length minus the iteration number since sorted gives list as ascneding
    running_total+=sorted_freq[i]
    
    ratio = running_total/total_words
    if ratio > .9:
        break
        
print("The number of unique words that makes up 90% of the text is: ", num_words)

The number of unique words that makes up 90% of the text is:  3731
