In [None]:
import re
import string
import collections

%load_ext line_profiler

In [None]:
with open("songofhiawatha.txt", "r") as f:
    poem = f.read()

In [None]:
poem = poem.split("\n")

We want to strip out all the punctuation first.

In [None]:
type(poem)

In [None]:
len(poem)

In [None]:
poem[:5]

In [None]:
exclude = set(string.punctuation).union(string.digits)

In [None]:
exclude

### A first pass at scrubbing out punctuation and numbers

In [None]:
def first(poem):
    newline = ''
    poem2 = []
    for line in poem:
        for char in line:
            if char not in exclude:
                newline += char
        poem2.append(newline.lower())
        newline = ''
        
    return poem2

In [None]:
%timeit first(poem)

In [None]:
%lprun -f first first(poem)

In [None]:
def second(poem):
    poem2 = []
    for line in poem:
        poem2.append(''.join([char for char in line.lower() if char not in exclude]))
    return poem2

In [None]:
%timeit second(poem)

In [None]:
%lprun -f second second(poem)

## Exercise: If one line is slowing you down

Then one thing to try is to speed up that line.
We're still looping over every character in the line in Python, albeit in a more compact representation.
What if we could use something to replace all the characters we don't want?

```
Some people, when confronted with a problem, think “I know,
I'll use regular expressions.”  Now they have two problems.
```

But they can be faster.  Sometimes.
Give it a shot.

A couple of references:

```python
re.sub(REGEX_PATTERN, replacement character, input)
```

Two possibly suitable regex patterns to use:

Match all characters that aren't lowercase letters or single spaces: `"[^a-z\s]"`

Match all characters that aren't letters or single spaces: `"[^a-zA-Z\s]"`

In [None]:
def third(poem):
    ...

In [None]:
%timeit third(poem)

In [None]:
%lprun -f third third(poem)

In [None]:
def bag_of_words(poem):
    counters = [collections.Counter(line.split(' ')) for line in poem]
    counter = sum(counters, collections.Counter())
    return counter

In [None]:
def bag_of_words_re(poem):
    counters = [collections.Counter(re.findall("\w+", line)) for line in poem]
    counter = sum(counters, collections.Counter())
    return counter

## Try it out on the first 100 lines to see if it works

In [None]:
bag_of_words(third(poem[:100]))

In [None]:
%timeit bag_of_words(third(poem[:1000]))

In [None]:
%timeit bag_of_words_re(third(poem[:1000]))

## What should we have done before writing the regex version?

In [None]:
%lprun -f bag_of_words bag_of_words(third(poem[:1000]))

In [None]:
%lprun -f bag_of_words_re bag_of_words_re(third(poem[:1000]))

## Exercise: Make a faster bag-of-words

What have we noticed from the line profiling runs?  What is taking most of the time?

And remember that built-in methods are generally much faster to execute than interpreted code that we write.

*hint*: what if `poem` wasn't a list?  what might be faster?

### First exercise: 

Try to write something that can handle scrubbing out all the characters in `exclude` faster than what we have above:

In [None]:
#%timeit <soln>

In [None]:
#%lprun -f <soln> <soln>(poem[:1000])

### Second exercise:

Does our existing `bag_of_words` function work with the new scrubbing function?
If not, update the `bag_of_words` so it can use the output of the scrubbing function.

In [None]:
#%timeit <soln>