<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Word-Counting" data-toc-modified-id="Word-Counting-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Word Counting</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Pipe" data-toc-modified-id="Pipe-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Pipe</a></span></li></ul></li><li><span><a href="#Extending-the-example-to-multi-line-files" data-toc-modified-id="Extending-the-example-to-multi-line-files-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Extending the example to multi-line files</a></span></li></ul></li></ul></div>

# Word Counting

In [1]:
sentence = "This cat jumped over this other cat!"

In [2]:
def stem(word):
    """ Stem word to primitive form 
    
    >>> stem("Hello!")
    'hello'
    """
    return word.lower().rstrip(",.!)-*_?:;$'-\"").lstrip("-*'\"(_$'")

In [3]:
def wordcount(string):
    words = string.split()
    
    stemmed_words = []
    for word in words:
        stemmed_words.append(stem(word))
    
    counts = dict()
    for word in stemmed_words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    
    return counts

wordcount(sentence)

{'cat': 2, 'jumped': 1, 'other': 1, 'over': 1, 'this': 2}

Now we compute this same result with `map` and the `frequencies` function from `toolz`

In [4]:
from toolz import frequencies

frequencies(map(stem, sentence.split()))

{'cat': 2, 'jumped': 1, 'other': 1, 'over': 1, 'this': 2}

### Pipe

Lines of functional code can contain *a lot* of parentheses.  To get around this lets learn the `pipe` function.  Consider the following code to do laundry.

```
# Do Laundry
clothes = ...
wet_clothes = wash(clothes)
dry_clothes = dry(wet_clothes)
result = fold(dry_clothes)
```

This pushes the data, `clothes` through a pipeline of functions, `wash`, `dry`, and `fold`.  This is a common pattern.  Using `pipe` we push the clothes through three transformations in sequence:

```
result = pipe(clothes, wash, dry, fold)
```   

Pipe pushes data (first argument) through a sequence of functions (rest of the arguments) from left to right.

In [5]:
from toolz import pipe

# Simple example
def double(x):
    return 2 * x

pipe(3, double, double, str)

'12'

In [6]:
from toolz.curried import map
    
pipe(sentence, str.split, map(stem), frequencies)

{'cat': 2, 'jumped': 1, 'other': 1, 'over': 1, 'this': 2}

## Extending the example to multi-line files

We implement wordcount on whole files rather than on single sentences.  We see how the above implementations need to adapt to handle this new change

In [7]:
def wordcount(file):

    counts = dict()
    
    for line in file:
        words = line.split()
        
        stemmed_words = []
        for word in words:
            stemmed_words.append(stem(word))

        for word in stemmed_words:
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1
    
    return counts

with open('data/tale-of-two-cities.txt') as f:
    for i in range(112):  # Burn first 112 lines - they include the Gutenberg header
        next(f)
    result = wordcount(f)

result
    

{'': 35,
 '1': 2,
 '1.a': 1,
 '1.b': 1,
 '1.c': 2,
 '1.d': 1,
 '1.e': 2,
 '1.e.1': 5,
 '1.e.2': 1,
 '1.e.3': 1,
 '1.e.4': 1,
 '1.e.5': 1,
 '1.e.6': 1,
 '1.e.7': 3,
 '1.e.8': 4,
 '1.e.9': 3,
 '1.f': 1,
 '1.f.1': 1,
 '1.f.2': 1,
 '1.f.3': 4,
 '1.f.4': 1,
 '1.f.5': 1,
 '1.f.6': 1,
 '1500': 1,
 '1757': 1,
 '1767': 2,
 '1792': 1,
 '2': 1,
 '20%': 1,
 '2001': 1,
 '21': 1,
 '3': 3,
 '30': 1,
 '4': 3,
 '4557': 1,
 '5': 1,
 '5,000': 1,
 '50': 1,
 '501(c)(3': 2,
 '596-1887': 1,
 '60': 1,
 '64-6221541': 1,
 '801': 1,
 '809': 1,
 '84116': 1,
 '90': 2,
 '98-8.txt': 1,
 '98-8.zip': 1,
 '99712': 1,
 'a': 2967,
 'a--a': 1,
 'a-a-a-business': 1,
 'a-a-matter': 1,
 'a-buzz': 1,
 'a-tiptoe': 1,
 'aback': 1,
 'abandon': 1,
 'abandoned': 10,
 'abandoning': 1,
 'abandonment': 1,
 'abashed': 1,
 'abate': 1,
 'abated': 1,
 'abbaye': 4,
 'abbaye--in': 1,
 'abed': 2,
 'abhorrence': 1,
 'abide': 1,
 'abided': 1,
 'abiding': 1,
 'abilities': 1,
 'ability': 2,
 'abject': 1,
 'ablaze': 1,
 'able': 17,
 'aboard': 1,

In [8]:
from toolz import concat
from toolz.curried import drop

pipe('data/tale-of-two-cities.txt', open, drop(112), map(str.split), concat, map(stem), frequencies)

{'': 35,
 '1': 2,
 '1.a': 1,
 '1.b': 1,
 '1.c': 2,
 '1.d': 1,
 '1.e': 2,
 '1.e.1': 5,
 '1.e.2': 1,
 '1.e.3': 1,
 '1.e.4': 1,
 '1.e.5': 1,
 '1.e.6': 1,
 '1.e.7': 3,
 '1.e.8': 4,
 '1.e.9': 3,
 '1.f': 1,
 '1.f.1': 1,
 '1.f.2': 1,
 '1.f.3': 4,
 '1.f.4': 1,
 '1.f.5': 1,
 '1.f.6': 1,
 '1500': 1,
 '1757': 1,
 '1767': 2,
 '1792': 1,
 '2': 1,
 '20%': 1,
 '2001': 1,
 '21': 1,
 '3': 3,
 '30': 1,
 '4': 3,
 '4557': 1,
 '5': 1,
 '5,000': 1,
 '50': 1,
 '501(c)(3': 2,
 '596-1887': 1,
 '60': 1,
 '64-6221541': 1,
 '801': 1,
 '809': 1,
 '84116': 1,
 '90': 2,
 '98-8.txt': 1,
 '98-8.zip': 1,
 '99712': 1,
 'a': 2967,
 'a--a': 1,
 'a-a-a-business': 1,
 'a-a-matter': 1,
 'a-buzz': 1,
 'a-tiptoe': 1,
 'aback': 1,
 'abandon': 1,
 'abandoned': 10,
 'abandoning': 1,
 'abandonment': 1,
 'abashed': 1,
 'abate': 1,
 'abated': 1,
 'abbaye': 4,
 'abbaye--in': 1,
 'abed': 2,
 'abhorrence': 1,
 'abide': 1,
 'abided': 1,
 'abiding': 1,
 'abilities': 1,
 'ability': 2,
 'abject': 1,
 'ablaze': 1,
 'able': 17,
 'aboard': 1,

In [9]:
timeit pipe('data/tale-of-two-cities.txt', open, drop(112), map(str.split), concat, map(stem), frequencies)

10 loops, best of 3: 141 ms per loop


In [10]:
timeit with open('data/tale-of-two-cities.txt') as f: wordcount(f)

10 loops, best of 3: 139 ms per loop
