Problem: Finding the "the"s
----

We are going to find the "the"s in a sentence.

In [7]:
reset -fs

In [2]:
# Lets make up a fictional sentece containing many "the"s. 
text = 'theo thought the overly blithe theatrical theme was otherwise the most boring' 

Let's start with string methods
------

In [3]:
text.index('the') # Only finds first occurance; At 0th index

0

In [4]:
text.count('the') # There are 7 total

7

In [5]:
# Find index for all occurances using string methods only
[i for i, _ in enumerate(text) if text.startswith('the', i)] # Linear time

[0, 13, 27, 31, 42, 53, 62]

Check out [KMP algorithm](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) for a much better substring search algorithm

Lets use regex instead...
-------

<center><img src="images/two_problems.png" width="700"/></center>

In [12]:
import re

# The findall module requires the string we are searching for and the sentence to search in
re.findall?

In [13]:
results_the = re.findall(r'the', text) 

In [14]:
# We got all of them like before
# However the results are capturing groups, not indexs
results_the # We see all"the"s corresponding to the, blithe, theatrical, theme and otherwise

['the', 'the', 'the', 'the', 'the', 'the', 'the']

We are only interested in the indepedent "the".
We should igore all the other "the"'s: blithe, theatrical, theme

Let's look for 'the' as a word 

In [15]:
# Using literals
results_the = re.findall(r' the ', text) 
results_the

[' the ', ' the ']

In [16]:
# Using regex symbols
results_the = re.findall(r'[^\w]the[^\w]', text) 
results_the

[' the ', ' the ']

In [17]:
# Let's remove the extra padding
results_the = [item.strip() for item in results_the]
results_the

['the', 'the']

In [18]:
# Or use the optimal regex
re.findall(r'\bthe\b', text) 

['the', 'the']

Regex Workflow
---
1. Create pattern in Plain English
2. Map to regex language
3. Make sure results are correct:
    - All Positives: Captures all examples of pattern
    - No Negatives: Everything captured is from the pattern
4. Don't over-engineer your regex. 
    - Your goal is to Get Stuff Done, not write the best regex in the world
    - Filtering before and after are okay.

Matching Phone Numbers (The "Hello, world!" of Regex)
------

`[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]` matches US telephone number.

Let's Refactor!
------

Refactored: `\d\d\d-\d\d\d-\d\d\d\d`

### metacharacter: 

A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "\d" means any digit.

Metacharacters are the special sauce of regex.



Quantifiers
-----

Allow you to specify how many times the preceding expression should match. 

`{}` is an extact qualifer

Refactored: `\d{3}-\d{3}-\d{4}`

Unexact quantifiers
-----

1. `?` question mark - zero or one 
2. `*` star - zero or more
3. `+` plus sign - one or more | 

"?" is zero or one
------

`z?a`

chunk (patterns to match): za, a   
junk (near misses to not match): z, zzza

[Test it](http://regexr.com/)

"*" is zero or more
------

`z*a`

chunk: za, a, zzza,     
junk: z

[Test it](http://regexr.com/)

"+" is one or more
------

`z+a` 
    
chunk: za, zzza      
junk: z, a

[Test it](http://regexr.com/)

Check for understanding
------

Write a RegEx pattern (no literals):

chunk: za, zza, z   
junk: zaa, a, 

[Test it](http://regexr.com/)

`z+a?`

<center><h2>Takeaways</h2></center>

- Quantifiers allow you specify the number of characters to match
- Quantifiers can be exact `{42}` or inexact `?+*`

<br>
<br> 
<br>

----

-----
Yet Other Regex Example: Finding frequency of the words ending in "ing"
-----

In this exercise we will use regular expressions to determine if a word ends with “ing”. If it does we will add it to a dictionary and increase its count by 1. 

The objective is to get familiar with the idea of the using the counts of the words.

In [35]:
from quilt.data.spiering.shakespeare import shakespeare

with open(shakespeare._data()) as f:
    shakespeare = f.read()

In [36]:
print("Shakespeare wrote about {:,} words.".format(len(shakespeare.split())))

Shakespeare wrote about 899,594 words.


In [37]:
# How would find words that end in `ing` with string methods?
ing_words = [w for w in shakespeare.split() if w.endswith('ing')]

ing_words[:10]

['Making',
 'being',
 'all-eating',
 'Proving',
 'nothing',
 'being',
 'having',
 'never-resting',
 'willing',
 'Leaving']

In [38]:
len(ing_words)

9580

--- 
Let's write some regex
----

When I write some regex

![](http://i.imgur.com/8b5kNhQ.gif)

[Source](http://thecodinglove.com/post/85802561535/when-i-write-some-regex)

In [39]:
matches = re.findall(r'\b(\w+ing)\b', shakespeare)  

# \b Word boundary
# \w wording 1 or more
# (...) Group

In [40]:
len(matches)

13080

<br>
<details><summary>
Why are we getting different number of matches between string methods and regex?
</summary>
`str.split()` is whitespace only. Word boundary are different for regex.
</details>

What is the most frequent `ing` word the Bard used?

In [41]:
from collections import Counter

In [42]:
word_counts = Counter(matches)
word_counts.most_common(1)

[('King', 1312)]

In [44]:
word_counts.most_common(4)

[('King', 1312), ('being', 548), ('nothing', 541), ('king', 388)]

In [47]:
word_counts = Counter(word.lower() for word in matches)

In [48]:
word_counts.most_common(5)

[('king', 1700),
 ('being', 664),
 ('nothing', 635),
 ('bring', 449),
 ('thing', 359)]

Write a regex to find all the words in Shakespeare that end in 'ing' AND contain a vowel before the "ing" ending

For example, 'king' will be a non-match but "being" would be a match

In [30]:
matches = re.findall(r'\b(\w+[ioeuay]ing)\b', shakespeare)  

In [31]:
matches[:10]

['being',
 'being',
 'going',
 'being',
 'being',
 'straying',
 'being',
 'unseeing',
 'being',
 'being']

In [32]:
'king' in matches

False

In [33]:
# Make it a unit test
assert not 'king' in matches

Regex Workflow
---
1. Create pattern in Plain English
2. Map to regex language
3. Make sure results are correct:
    - All Positives: Captures all examples of pattern
    - No Negatives: Everything captured is from the pattern
4. Don't over-engineer your regex. 
    - Your goal is to Get Stuff Done, not write the best regex in the world
    - Filtering before and after are okay.

<br>
<br>
---