## A story about Katherine

**Katherine** went to the concert to see '**Catheryn** and the **Cathryns**'. She ran into her friend **Kathryn**, who introduced **Katherine** to her friend **Catherine**. Together, they enjoyed the concert while texting inaudible snippets to their mutual friend, **Kathrin**. Their mercurial friend, **katharine**, felt left out.

#### Find all instances of "Katherine"

In [1]:
story = "Katherine went to the concert to see 'Catheryn and the Cathryns'. She ran into her friend Kathryn, who introduced Katherine to her friend Catherine. Together, they enjoyed the concert while texting inaudible snippets to their mutual friend, Kathrin. Their mercurial friend, katharine, cried in the bathtub"

In [4]:
import re
attempt1 = re.compile('[ck]+ath[aey]?r[eiy]ne*\w*', re.I)
match1 = re.findall(attempt1, story)
print(match1)
print(len(match1))

['Katherine', 'Catheryn', 'Cathryns', 'Kathryn', 'Katherine', 'Catherine', 'Kathrin', 'katharine']
8


In [5]:
attempt2 = re.compile('[ck]ath\w+', re.I)
match2 = re.findall(attempt2, story)
print(match2)
print(len(match2))

['Katherine', 'Catheryn', 'Cathryns', 'Kathryn', 'Katherine', 'Catherine', 'Kathrin', 'katharine']
8


## Load Python's `re` module

- Regular expression functions are not loaded by default
- Need to load the `re` module at the beginning of your script

In [6]:
import re

## Regular expressions
- A regular expression, or _pattern_, is a construct that either matches – or doesn't match – all of or, more typically, part of a string
- String matching is an all or nothing proposition
    - matches return a match object that we can work with
    - non-matches return None, which evaluates as False

## Regular expressions are often embedded within conditionals

if re.search("pattern", string): do something else: do something different

## Exact matches are trivial

In [9]:
artist = 'Alicia.Keys'
if(re.search('Key', artist)):
    print("Found a match!")

Found a match!


## Metacharacters
   - special characters (metacharacters) and character classes to help you search for more complex patterns
   - **character class**: a list of possible characters inside square brackets [ ]
   - matches any single character within the class

        - [ACGTacgt]
        - [abcwxyz6789]
        - [a-cw-z6-9]
        - [ \t\n]

---

   - you can negate a character class to look for everything **but** the characters in your character class
        - [^ACGTacgt] (does my DNA sequence have non-nucleotide characters?)
        - [^ \t\n]

In [10]:
# find any uppercase letter
match = re.search("[A-Z]", artist)

# print the match (object)
print(artist)
print(match)

Alicia.Keys
<re.Match object; span=(0, 1), match='A'>


## How to find all the matches in a string

In [12]:
dna = 'ACGTRTAANNNNNNNNNNNNNNNNNNNNNNNNNN'
match = re.findall("[^ACGT]", dna)

# the match is returned in a list
print(match)

# print the first match
print(match[0])
print(match.count('N'))
print(match.count('N')/len(dna))

['R', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N']
R
26
0.7647058823529411


## Character class shortcuts

| Shortcut | Character class | Description                |
| :------: | :-------------- | :------------------------- |
| \d       | [0-9]           | a digit                    |
| \D       | [^0-9]          | a non-digit                |
| \s       | [ \t\n\r\f]     | a whitespace character     |
| \S       | [^ \t\n\r\f]    | a non-whitespace character |
| \w       | [a-zA-Z0-9]     | a 'word' character         |
| \W       | [^a-zA-Z0-9]    | a 'non-word' character     |


## Find non-word characters

In [16]:
my_string = 'AcgTaaC&&9\\t----653\\\\\\\\-\\n'
match = re.findall("[\W'\']", my_string)

for i in match:
    print(i)

&
&
\
-
-
-
-
\
\
\
\
-
\


## Metacharacters

These characters serve a function when used in regular expressions: 

        \  |  (  )  [  ]  {  }  ^  $  *  +  ?  .

## The dot `.` is a wildcard character that matches any ONE character except a newline

In [30]:
artist = 'Alicia Augello Ke8s Alicia KeRs'
match = re.search("(Alicia )((Ke).s)", artist)

# print the entire match
print(match.group(0))

# print the part of the match we told it to capture
print(match.group(1))
print(match.group(2))
print(match.group(3))

Alicia KeRs
Alicia 
KeRs
Ke


## Adding quantifiers: the "any old junk" construct
- the `.` matches any character except a newline
- the `*` means "match the preceding character or character class 0 or more times", i.e., it's optional

In [31]:
artist = 'Alicia GoBBlee44444444444444444444Goo&&K==Keys is the best'
match = re.search("Alicia(.*)Keys", artist)
if match:
    print("Full match:", match.group(0))
    print("Captured:" + "[" + match.group(1) + "]")
else:
    print("No match")

Full match: Alicia GoBBlee44444444444444444444Goo&&K==Keys
Captured:[ GoBBlee44444444444444444444Goo&&K==]


## A more narrowly defined quantifier: `?`
- `?` means "match the preceding character or character class ZERO or ONE times"

In [36]:
artist = 'Alicia\t\t    \t\t\tKeys'
match = re.search("Alicia\s*Keys", artist)  # does it match?
print(match.group(0))

artist = 'Alicia  Keys'
match = re.search("Alicia\s*Keys", artist)  # does it match?
print(match.group(0))

Alicia		    			Keys
Alicia  Keys


## A more narrowly defined quantifier: `+`
- `+` means "match the preceding character or character class one or more times"
- we can use standard functions to measure the match length

In [38]:
artist = 'Allllllllllllllicia Keys'
match = re.search("A(l)+", artist)
if match:
    print("Full match:", match.group(0))
    print("There are", len(match.group(1)), "l's:", match.group(0))
else:
    print("No match")

Full match: Allllllllllllll
There are 1 l's: Allllllllllllll


## Use parentheses to make precise, multi-character matches and captures
- surround the pattern you want to quantify in parentheses
- the entire unit in parentheses is evaluated by the quantifier 

### In this example, we want to match and capture an entire word, or string, rather than a single character

In [39]:
artist = 'AliciaAliciaAliciaAliciaAlicaKeys'
match = re.search("(Alicia)+", artist)
if match:
    print("Full match:", match.group(0))
    print("Captured:", match.group(1))
else:
    print("No match")

Full match: AliciaAliciaAliciaAlicia
Captured: Alicia


## A general, defined quantifier: `{}`

In [43]:
artist = 'AliciaAliciaAliciaAliciaAliciaAliciaAliciaAliciaAliciaAlicia Keys'
match = re.search("(Alicia){2,12}", artist)
if match:
    print("Full match:", match.group(0))
    print("Captured:", match.group(1))
else:
    print("No match:", match)

Full match: AliciaAliciaAliciaAliciaAliciaAliciaAliciaAliciaAliciaAlicia
Captured: Alicia


## `*`, `+`, and `{}` are greedy, meaning that they match as much as they possibly can

`artist = 'Alicia GoBBleeGoo&&K==Keys'`

### These two behave the same:
   - `re.search("Alicia.*Keys", artist)`
   - `re.search("Alicia.+Keys", artist)`
	
### The regular expression algorithm does the following:
1. Matches **Alicia**
1. Matches everything through **Keys** (the "greedy" part)
1. Backtracks one letter at a time (**s**→**y**→**e**→**K**) until the last part of the regular expression (**Keys**) matches

## Greedy vs. Non-greedy Quantifiers

In [None]:
artist = "<BOLD>Holy moly</BOLD>, it's <BOLD>Alicia Keys</BOLD>"

# grab 'Holy moly' from this HTML string
match = re.search("", artist)
if match:
    print("Found a match:" + "[" + match.group(1) + "]")
else:
    print("No match:", match)

## Making the `*`, `+`, and `{}` quantifiers non-greedy with `?`

In [None]:
# add ? to make *, +, and {} NON-GREEDY
artist = "<BOLD>Holy moly</BOLD>, it's <BOLD>Alicia Keys</BOLD>. <BOLD>Wow</BOLD> "
match = re.findall("", artist)

for i in match:
    print(i)


#if match:
#    print("Found a match:" + "[" + match.group(1) + "]")
#else:
#    print("No match:", match)

### `?` completely alters the matching algorithm
1. Matches \<BOLD\>
1. Moves right, _reluctantly_ matching one character at a time
1. After each character match, try to let the rest of the pattern (\</BOLD\>) match
1. Stop once all 7 characters in \</BOLD\> have matched

## Understand the behavior of `*`

### Does this match? Why?

In [None]:
artist = 'Beyonce'
match = re.search("", artist)
if match:
    print("Full match:", match.group(0))
    print("Captured:", match.group(1))
else:
    print("No match")

## This example shows how to use a character class and how to capture parts of a match
1. the entire match is captured in group 0
2. you capture part of a match by putting it in parentheses
3. you can capture multiple parts of the match
    - captures are numbered by the nesting of the parentheses (1 - n)
    - innermost is 1, outermost is n

In [None]:
queenB = "Beyonce    Know&&l12es"
match = re.search("", queenB, re.I)
if match:
    print("capture 0:", match.group(0))
    print("capture 1:", match.group(1))
    print("capture 2:", match.group(2))
else:
    print("No match")

## Either/or matching if you're not picky

In [None]:
# either, or
artist = 'Alicia'
match = re.search("", artist)
if match:
    print("Found an artist:", match.group(0))
else:
    print("No match:", match)

## There are lots of modifiers that can be added to your regular expressions
- `re.I` allows for case-insensitive matching

In [None]:
artist = 'Alicia Keys'
match = re.search("", artist, re.I)
if match:
    print("Found a match:", match.group())
else:
    print("No match:", match)

## Using variables inside regular expressions

In [None]:
cat1 = 'Peaches'
fact = 'We love ' + cat1
print(fact)

pat = re.compile(cat1)
print(pat)
match = pat.search(fact, re.I)
if match:
    print(match.group(0))

# THE END