This notebook refactors
[tf-05.py](https://github.com/crista/exercises-in-programming-style/blob/master/05-pipeline/tf-05.py),
starting with the version from
[d521abd 2016-05-21 08:14:41 -0700](https://github.com/crista/exercises-in-programming-style/blob/d521abd5d7aac14af19aa7794aca9ee23c0f8cc5/05-pipeline/tf-05.py).
It was refactored by the audience at the 2016-08-29 [COhPy](cohpy.org)
[meeting](http://www.meetup.com/Central-Ohio-Python-Users-Group/events/228901519/).

The refactoring starts with cell #5. Cells before that setup the diff_python script to aid refactoring and later review. It shows:
- changes in the source code from the previously executed cell
- whether or not the output is correct
  - if the output is not correct, shows differences from correct output
- execution time

The license in the following cell covers only this notebook
and is in addition to the LICENSE file in the parent directory
of this notebook.

The MIT License (MIT)

Copyright (c) 2016 James Prior, Travis Risner, Sam, Joe Friedrich, Russ Herrold, and Eric Floehr

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


---
First we run the program, and save its output.

The original code runs only in Python 2, but my virtual environment runs Python 3 by default, so Python 2 is used explicitly.

In [1]:
!python2 tf-05.py ../pride-and-prejudice.txt | tee good_output

mr  -  786
elizabeth  -  635
very  -  488
darcy  -  418
such  -  395
mrs  -  343
much  -  329
more  -  327
bennet  -  323
bingley  -  306
jane  -  295
miss  -  283
one  -  275
know  -  239
before  -  229
herself  -  227
though  -  226
well  -  224
never  -  220
sister  -  218
soon  -  216
think  -  211
now  -  209
time  -  203
good  -  201


When creating new cells interactively,
one knows exactly what the changes are because they were just done.
But when one looks at the cells later,
how does one know what all the little changes were?
It would be nice the see the differences
between one cell and another as the refactoring progresses.
So cell magic is used to show the difference
between a cell and the previously executed cell.

After that, any difference between what the output should be
and what is actually is, is shown.

One complication is that since my trickery
executes cells outside Jupyter notebook,
the cells do not have access to variables
from Jupyter notebook and vice versa.

One nice thing about running the cells outside Jupyter,
is that we know each cell has all the stuff it needs
and does not rely on some result from a previous cell.

---
Create the diff_python script
that will be executed by %%script magic
to show differences between cells,
and differences in output from what it should be.

In [2]:
%%script bash

# As we refactor, it would be nice to see the difference between
# one cell and the previously executed cell.
# This script creates a shell script that
# does that when executed with the %%script diff_python
# at the beginning of a cell.
#
# To disable the diff command,
# Put a : and space in front of it. I.e.,
#     : diff old.py new.py
#
# meld yields a beautiful diff,
# but pops up a window for each cell executed.

program_name="${PATH%%:*}/diff_python"

cat >"$program_name" <<EOF
#!/usr/bin/env bash
cat >new.py
chmod +x new.py
if [ -a old.py ]; then
    diff old.py new.py
fi
chmod +x new.py
time ./new.py ../pride-and-prejudice.txt >new_output
echo
if cmp -s new_output good_output; then
    echo GOOD: the output is good
else
    echo ERROR: new_output is different from good_output
    # md5sum good_output new_output
    diff good_output new_output
fi
mv new.py old.py
EOF
rm -f old.py
chmod +x "$program_name"

From now on,
each cell will start with the %%script diff_python magic.
The original code is repeated below with the addition
of the %%script diff_python magic at the beginning,
changing the #!/usr/bin/env python to #!/usr/bin/env python2,
and a change to deliberately cause a bug for the cmp to catch.
This also initializes the code differences.

In [3]:
%%script diff_python
#!/usr/bin/env python2
import sys, re, operator, string

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = {}
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.iteritems(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print word_freqs[0][0], ' - ', word_freqs[0][1]
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[2:25])



ERROR: new_output is different from good_output
1,2d0
< mr  -  786
< elizabeth  -  635



real	0m0.353s
user	0m0.340s
sys	0m0.008s


diff_python correctly detected the change in output,
so we know that diff_python works. 

So next we undo that change so the output is good.

In [4]:
%%script diff_python
#!/usr/bin/env python2
import sys, re, operator, string

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = {}
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.iteritems(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print word_freqs[0][0], ' - ', word_freqs[0][1]
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


74c74
< print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[2:25])
---
> print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])

GOOD: the output is good



real	0m0.354s
user	0m0.324s
sys	0m0.028s


---
Now we start refactoring, one thing at a time.

Python 2 is [scheduled to retire in 2020](https://pythonclock.org/),
so let's port it to Python 3.

In [5]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = {}
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print(word_freqs[0][0], ' - ', word_freqs[0][1])
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


1c1
< #!/usr/bin/env python2
---
> #!/usr/bin/env python3
61c61
<     return sorted(word_freq.iteritems(), key=operator.itemgetter(1), reverse=True)
---
>     return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)
68c68
<         print word_freqs[0][0], ' - ', word_freqs[0][1]
---
>         print(word_freqs[0][0], ' - ', word_freqs[0][1])

GOOD: the output is good



real	0m0.500s
user	0m0.468s
sys	0m0.024s


With the addition of the following line

    from __future__ import print_function
    
the above code would work with either Python 2 or Python 3.
That was not thought of at the meeting.

---

Travis Risner came up with the next thing to improve.
Use a
[sortedcontainers](https://pypi.python.org/pypi/sortedcontainers).SortedDict
to avoid the sorted function in sort().

This was abandoned when it was realized that it was not 
in the standard library.

In [6]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from sortedcontainers import sortedDict # not standard library

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = SortedDict()
    for w in word_list:
        if w in word_freqs:
            word_freqs[w] += 1
        else:
            word_freqs[w] = 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    # return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)
    return word_freq

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print(word_freqs[0][0], ' - ', word_freqs[0][1])
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


2a3
> from sortedcontainers import sortedDict # not standard library
47c48
<     word_freqs = {}
---
>     word_freqs = SortedDict()
61c62,63
<     return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)
---
>     # return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)
>     return word_freq

ERROR: new_output is different from good_output
1,25d0
< mr  -  786
< elizabeth  -  635
< very  -  488
< darcy  -  418
< such  -  395
< mrs  -  343
< much  -  329
< more  -  327
< bennet  -  323
< bingley  -  306
< jane  -  295
< miss  -  283
< one  -  275
< know  -  239
< before  -  229
< herself  -  227
< though  -  226
< well  -  224
< never  -  220
< sister  -  218
< soon  -  216
< think  -  211
< now  -  209
< time  -  203
< good  -  201


Traceback (most recent call last):
  File "./new.py", line 3, in <module>
    from sortedcontainers import sortedDict # not standard library
ImportError: No module named 'sortedcontainers'

real	0m0.046s
user	0m0.036s
sys	0m0.008s


Sam improved the counting by using a 
[defaultdict](https://docs.python.org/3/library/collections.html#collections.defaultdict).
It definitely cleaned up the counting code,
eliminating the if/else structure.

There was confusion about how to use a defaultdict.
The first argument is a callable,
which returns the default value.
int() called with no arguments returns 0.

In [7]:
int()

0

In [8]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import defaultdict

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    word_freqs = defaultdict(int)
    for w in word_list:
        word_freqs[w] += 1
    return word_freqs

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print(word_freqs[0][0], ' - ', word_freqs[0][1])
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


3c3
< from sortedcontainers import sortedDict # not standard library
---
> from collections import defaultdict
48c48
<     word_freqs = SortedDict()
---
>     word_freqs = defaultdict(int)
50,53c50
<         if w in word_freqs:
<             word_freqs[w] += 1
<         else:
<             word_freqs[w] = 1
---
>         word_freqs[w] += 1
62,63c59
<     # return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)
<     return word_freq
---
>     return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

GOOD: the output is good



real	0m0.487s
user	0m0.464s
sys	0m0.016s


John Cassidy suggested using a
[Counter](https://docs.python.org/3/library/collections.html#collections.Counter)
to further simplify the counting code.
This was so successful that
frequencies() is now just a thin wrapper around Counter().

In [9]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print(word_freqs[0][0], ' - ', word_freqs[0][1])
        print_all(word_freqs[1:]);

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


3c3
< from collections import defaultdict
---
> from collections import Counter
48,51c48
<     word_freqs = defaultdict(int)
<     for w in word_list:
<         word_freqs[w] += 1
<     return word_freqs
---
>     return Counter(word_list)

GOOD: the output is good



real	0m0.485s
user	0m0.464s
sys	0m0.016s


Counter objects have a nifty
[most_common](https://docs.python.org/3/library/collections.html#collections.Counter.most_common)
method
which would have made the sorting trivial,
but no one spoke up about that at the meeting.

---

Eric Floehr recognized that there was an unnecessary semicolon,
so it was removed.

In [10]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        print(word_freqs[0][0], ' - ', word_freqs[0][1])
        print_all(word_freqs[1:])

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


64c64
<         print_all(word_freqs[1:]);
---
>         print_all(word_freqs[1:])

GOOD: the output is good



real	0m0.493s
user	0m0.468s
sys	0m0.016s


What do word_freqs[0][0] and word_freqs[0][1] in print_all() mean?
They make the function hard to read.
Eric Floehr gave them meaningful names to make the code readable.
He used tuple unpacking to do that.

In [11]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    if(len(word_freqs) > 0):
        word, frequency = word_freqs[0]
        print(word, ' - ', frequency)
        print_all(word_freqs[1:])

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


63c63,64
<         print(word_freqs[0][0], ' - ', word_freqs[0][1])
---
>         word, frequency = word_freqs[0]
>         print(word, ' - ', frequency)

GOOD: the output is good



real	0m0.486s
user	0m0.440s
sys	0m0.040s


Eric Floehr noticed that print_all()
was unnecessarily complicated with recursion,
so he refactored the function to use a simple for loop.
That made the function simple and easy to read. Yeah!

In [12]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    for word, frequency in word_freqs:
        print(word, ' - ', frequency)

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


62,63c62
<     if(len(word_freqs) > 0):
<         word, frequency = word_freqs[0]
---
>     for word, frequency in word_freqs:
65d63
<         print_all(word_freqs[1:])

GOOD: the output is good



real	0m0.487s
user	0m0.460s
sys	0m0.024s


Joe Friedrich noticed that the docstring of
filter_chars_and_normalize() did not match the
behavior of the function,
so the docstring was corrected.

In [13]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    All letters changed to lowercase.
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    for word, frequency in word_freqs:
        print(word, ' - ', frequency)

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


20a21
>     All letters changed to lowercase.

GOOD: the output is good



real	0m0.487s
user	0m0.460s
sys	0m0.024s


Russ Herrold moved .lower() from
filter_chars_and_normalize() to read_file()
and was inclined to completely absorb 
filter_chars_and_normalize() into read_file().

It works, but consolidating functionality in fewer functions
was not the point of this style,
so we did not keep it.
(It *is* the point of the
[code golf](https://en.wikipedia.org/wiki/Code_golf)
style such as in
[tf-06.py](https://github.com/crista/exercises-in-programming-style/blob/d521abd5d7aac14af19aa7794aca9ee23c0f8cc5/06-code-golf/tf-06.py).)

In [14]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    lowercase contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read().lower()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data)

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = f.read().split(',')
    # add single-letter words
    stop_words.extend(list(string.ascii_lowercase))
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    for word, frequency in word_freqs:
        print(word, ' - ', frequency)

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


11c11
<     contents of the file as a string
---
>     lowercase contents of the file as a string
14c14
<         data = f.read()
---
>         data = f.read().lower()
21d20
<     All letters changed to lowercase.
24c23
<     return pattern.sub(' ', str_data).lower()
---
>     return pattern.sub(' ', str_data)

GOOD: the output is good



real	0m0.497s
user	0m0.480s
sys	0m0.012s


Jim Prior changed stop_words from a list to a set.
This made searching for words in stop_words fast.
Notice that string.ascii_lowercase is directly
iterable by the update method.
Notice the big reduction in execution time.

Review [20160523-cohpy-speed-of-searching-sets-and-lists.ipynb](http://nbviewer.jupyter.org/github/james-prior/cohpy/blob/master/20160523-cohpy-speed-of-searching-sets-and-lists.ipynb).

In [15]:
%%script diff_python
#!/usr/bin/env python3
import sys, re, operator, string
from collections import Counter

#
# The functions
#
def read_file(path_to_file):
    """
    Takes a path to a file and returns the entire
    contents of the file as a string
    """
    with open(path_to_file) as f:
        data = f.read()
    return data

def filter_chars_and_normalize(str_data):
    """
    Takes a string and returns a copy with all nonalphanumeric
    chars replaced by white space
    All letters changed to lowercase.
    """
    pattern = re.compile('[\W_]+')
    return pattern.sub(' ', str_data).lower()

def scan(str_data):
    """
    Takes a string and scans for words, returning
    a list of words.
    """
    return str_data.split()

def remove_stop_words(word_list):
    """
    Takes a list of words and returns a copy with all stop
    words removed
    """
    with open('../stop_words.txt') as f:
        stop_words = set(f.read().split(','))
    # add single-letter words
    stop_words.update(string.ascii_lowercase)
    return [w for w in word_list if not w in stop_words]

def frequencies(word_list):
    """
    Takes a list of words and returns a dictionary associating
    words with frequencies of occurrence
    """
    return Counter(word_list)

def sort(word_freq):
    """
    Takes a dictionary of words and their frequencies
    and returns a list of pairs where the entries are
    sorted by frequency
    """
    return sorted(word_freq.items(), key=operator.itemgetter(1), reverse=True)

def print_all(word_freqs):
    """
    Takes a list of pairs where the entries are sorted by frequency and print them recursively.
    """
    for word, frequency in word_freqs:
        print(word, ' - ', frequency)

#
# The main function
#
print_all(sort(frequencies(remove_stop_words(scan(filter_chars_and_normalize(read_file(sys.argv[1]))))))[0:25])


11c11
<     lowercase contents of the file as a string
---
>     contents of the file as a string
14c14
<         data = f.read().lower()
---
>         data = f.read()
20a21
>     All letters changed to lowercase.
23c24
<     return pattern.sub(' ', str_data)
---
>     return pattern.sub(' ', str_data).lower()
38c39
<         stop_words = f.read().split(',')
---
>         stop_words = set(f.read().split(','))
40c41
<     stop_words.extend(list(string.ascii_lowercase))
---
>     stop_words.update(string.ascii_lowercase)

GOOD: the output is good



real	0m0.181s
user	0m0.160s
sys	0m0.020s


---

More opportunities:
- Name the magic numbers, such as 25.
- Give meaningful name to sys.argv[1].
- Choose better names.
  - Avoid types in names. That often restricts code unnecessarily.
- Make better docstrings.
- Make PEP-8 compliant.
- Delete comments that belabor the obvious.
- Use .most_common() method of Counter object (as mentioned earlier).
- Compare heapq and collections.Counter.
- Write to work in Python 2 and Python 3.
- Put top level code in a main() function.
  - There is a comment about main function,
    but there is no main function, just top level code.
- Use generators.
  - Could handle very large files that are bigger than memory.
---

Afterthoughts

Focus on readability before speed.

Would have been better to focus on one function at a time instead of jumping around.

Would have been better to say what the constraints were before starting
instead of winging it.
- Standard Python (so nothing that needs pip install to use).
- Stick to the style. In this case the style was that each function did one thing \([UNIX philosopy](https://en.wikipedia.org/wiki/The_unix_philosophy)\) and they were nested.
  - The functions are pure functions.
    - Their input is only from the arguments.
    - The only output is the return value (except print_all()).
Maintain the functionality within each function.
