# Word Counting

This notebook introduces some of the basic tools and idea for working with natural language (text), including tokenization and word counting.

## Imports

In [1]:
import types

## Tokenization

In [2]:
PUNCTUATION = '`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t\n'

Write a generator function, `remove_punctuation`, that removes punctuation from an iterator of words and yields the cleaned words:

* Strip the punctuation characters at the beginning and end of each word.
* Replace `-` by a space if found in the middle of the word and split on that white space to yield multiple words.
* If a word is all punctuation, don't yield it at all.

In [3]:
def gen(words, punctuation):
    a = list(punctuation)
    iter_words = iter(words)
    
    while True:
        try:
            word = next(iter_words)
        except StopIteration:
            break
        else:
            for next_ in a:
                while len(word) != 0 and next_ == word[0]:
                    word = word[1::]
                while len(word) != 0 and next_ == word[len(word)-1]:
                    word = word[:len(word)-1:]
            if len(word) != 0:
                word = list(word)
                for index, item in enumerate(word):
                    if item == '-':
                        word[index] = ' '
                        
                word = ''.join(word)
                word = word.split(' ')   
                for item in word:
                    yield item

In [4]:
def remove_punctuation(words, punctuation=PUNCTUATION):
    vals = gen(words, punctuation)
    return vals
    

In [5]:
remove_punctuation(['!data;'])

<generator object gen at 0x7f439839b948>

In [6]:
print(type(remove_punctuation(['!!'])))

<class 'generator'>


In [7]:
print(isinstance(remove_punctuation(['!!']), types.GeneratorType))

True


In [8]:
assert list(remove_punctuation(['!data;']))==['data']
assert list(remove_punctuation(['!data-science:']))==['data', 'science']
assert list(remove_punctuation(['!!']))==[]
assert isinstance(remove_punctuation(['!!']), types.GeneratorType)

Write a generator function, `lower_words`, that makes each word in an iterator lowercase, yielding each lowercase word:

In [9]:
def to_lower(words):
    iter_words = iter(words)
    
    while True:
        try:
            word = next(iter_words)
        except:
            break
        else:
            word = list(word)
            for index, item in enumerate(word):
                word[index] = item.lower()
                
            word = ''.join(word)
            yield word
                
    

In [10]:
def lower_words(words):
    words = to_lower(words)
    return words

In [11]:
assert isinstance(lower_words('AAA'), types.GeneratorType)
assert list(lower_words('This IS NOT LoWerCaSe'.split(' ')))==['this', 'is', 'not', 'lowercase']

[Stop words](https://en.wikipedia.org/wiki/Stop_words) are common words in text that are typically filtered out when performing natural language processing. Typical stop words are *and*, *of*, *a*, *the*, etc.

Write a generator function, `remove_stop_words`, that removes stop words from an iterator, yielding the results:

In [12]:
type([1,2])

list

In [13]:
def remove_words(words, stop):
    iter_word = iter(words)
    
    while True:
        try:
            word = next(iter_word)
        except:
            break
        else:
            if stop:
                if type(stop) != list:
                    stop = stop.split(' ')
                    
                if word not in stop:
                    yield word
            else:
                yield word
                

In [14]:
def remove_stop_words(words, stop_words=None):
    vals = remove_words(words, stop_words)
    return vals

In [15]:
assert list(remove_stop_words('the begin to the end a of the day'.split(' '), stop_words='a the')) == \
    ['begin', 'to', 'end', 'of', 'day']
assert list(remove_stop_words('the begin to the end a of the day'.split(' '), stop_words=['a', 'the'])) == \
    ['begin', 'to', 'end', 'of', 'day']
assert list(remove_stop_words('the begin to the end a of the day'.split(' '))) == \
    ['the', 'begin', 'to', 'the', 'end', 'a', 'of', 'the', 'day']

[Tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking a string or line of text and returning a sequence of words, or *tokens*, with the following transforms applied

* Punctuation removed
* All words lowercased
* Stop words removed

Write a generator function, `tokenize_line`, that yields tokenized words from a an input line of text. 

In [16]:
def tokenize_line(line, stop_words=None, punctuation=PUNCTUATION):
    rev_punc = remove_punctuation(line.split(' '))
    lower_case = lower_words(list(rev_punc))
    rev_stop = remove_stop_words(list(lower_case), stop_words)
    return rev_stop

In [17]:
assert isinstance(tokenize_line("This, is the way; that things will end"), types.GeneratorType)
assert list(tokenize_line("This, is the way; that things will end", stop_words=['the', 'is'])) == \
    ['this', 'way', 'that', 'things', 'will', 'end']

Write a generator function, `tokenize_lines`, that can yield the tokens in an iterator of lines of text.

In [18]:
def token_lines(lines, stop, punc):
    iter_lines = iter(lines)
    while True:
        try:
            line = next(iter_lines)
        except:
            break
        else:
            tokens = tokenize_line(line, stop, punc)
            for token in tokens:
                yield token

In [19]:
def tokenize_lines(lines, stop_words=None, punctuation=PUNCTUATION):
    token = token_lines(lines, stop_words, punctuation)
    return token
    

In [20]:
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert isinstance(tokenize_lines(wasteland.splitlines()), types.GeneratorType)

assert list(tokenize_lines(wasteland.splitlines(), stop_words='is the of and')) == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']

## Counting words

Write a function, `count_words`, that takes an iterator of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts. Be careful to not ever assume that the input iterator is a concrete list/tuple.

In [21]:
def counting_words(words):
    curr_dict = {}
    
    while True:
        try:
            word = next(words)
        except:
            break
        else:
            if word in curr_dict:
                curr_dict[word] += 1
            else:
                curr_dict[word] = 1;
                
    return curr_dict
            

In [22]:
def count_words(words):
    mydict = counting_words(words)
    return mydict

In [23]:
assert count_words(tokenize_line('This, and The-this from, and A a a')) == \
    {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

Write a function, `sort_word_counts`, that return a list of sorted word counts:

* Each element of the list should be a `(word, count)` tuple.
* The list should be sorted by the word counts, with the higest counts coming first.
* To perform this sort, look at using the `sorted` function.

This can return a concrete list as the memory here is proportional to the number of unique words in the text.

In [24]:
def sort_dict(dic):
    new_dict = {}
    mydict_vals = sorted(dic.values())
    mydict_items = sorted(dic, key=dic.__getitem__)
    mydict = mydict_vals[::-1]
    mydict_items = mydict_items[::-1]
    
    items = iter(mydict_items)
    vals = iter(mydict)
    
    while True:
        try:
            val = next(vals)
            item = next(items)
        except:
            break
        else:
            yield item, val
        

In [25]:
set(sort_dict(count_words(tokenize_line('This, and The-this from, and A a a'))))

{('a', 3), ('and', 2), ('from', 1), ('the', 1), ('this', 2)}

In [26]:
def sort_word_counts(wc):
    mydict = sort_dict(wc)
    return mydict

In [27]:
assert set(sort_word_counts(count_words(tokenize_line('This, and The-this from, and A a a')))) == \
    {('a', 3), ('and', 2), ('this', 2), ('the', 1), ('from', 1)}

## File IO

Write a generator function, `files_to_lines`, that takes an iterator of filenames, and yields the lines in all of those files. Make sure to not ever create a concrete list/tuple in this process to keep your memory consumption $\mathcal{O}(1)$. Make sure you use a `with` statement to properly close each file.

In [44]:
def gen_files(files):
    iter_files = iter(files)
    
    while True:
        try:
            file = next(iter_files)
        except:
            break
        else:
            with open(file, 'r') as f:
                iter_lines = iter(f.readline())
            while True:
                try:
                    line = next(iter_lines)
                except:
                    f.close()
                    break
                else:
                    yield line

In [45]:
def files_to_lines(files):
    lines = gen_files(files)
    return lines

In [46]:
%%writefile file1.txt
This is the first line in the first file.
This is the secon line in the first file.

Overwriting file1.txt


In [47]:
%%writefile file2.txt
This is the first line in the second file.
This is the second line in the second file.

Overwriting file2.txt


In [48]:
list(files_to_lines(['file1.txt', 'file2.txt']))

['T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'f',
 'i',
 'r',
 's',
 't',
 ' ',
 'l',
 'i',
 'n',
 'e',
 ' ',
 'i',
 'n',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'f',
 'i',
 'r',
 's',
 't',
 ' ',
 'f',
 'i',
 'l',
 'e',
 '.',
 '\n',
 'T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'f',
 'i',
 'r',
 's',
 't',
 ' ',
 'l',
 'i',
 'n',
 'e',
 ' ',
 'i',
 'n',
 ' ',
 't',
 'h',
 'e',
 ' ',
 's',
 'e',
 'c',
 'o',
 'n',
 'd',
 ' ',
 'f',
 'i',
 'l',
 'e',
 '.',
 '\n']

In [51]:
assert isinstance(files_to_lines(['file1.txt', 'file2.txt']), types.GeneratorType)
assert list(files_to_lines(['file1.txt', 'file2.txt'])) == \
    ['This is the first line in the first file.\n',
     'This is the secon line in the first file.',
     'This is the first line in the second file.\n',
     'This is the second line in the second file.']
    

AssertionError: 

## All together now

Now use all of the above functions to perform tokenization and word counting for all of the text documents described by your instructor:

* You should be able to perform this in a memory efficient manner.
* Read your stop words from the included `stopwords.txt` file.
* Save your sorted word counts to a variable named `swc`.

In [None]:
with open('stopwords.txt', 'r') as f:
    data = f.read()
swc = sort_word_counts(count_words(tokenize_line(data.splitlines())))

In [None]:
assert [word for word, count in swc[0:10]] == \
    ['said', 'one', 'mr', 'now', 'upon', 'will', 'little', 'time', 'man', 'like']

Create a horizontal bar chart for the top 50 words using text and simple calls to `print`:

* For each word, encode the count as a bar of `*` characters.
* You will have to scale the length of your bars to fit on the page.
* Provide labels for each bar that indicates which word the counts apply to.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()