<center><img src="images/face_tat.png" width="35%"/></center>

In [34]:
reset -fs

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Explain why Regular Expressions (regex) are used to pattern-match strings.
- Explain when and when not to use regex.
- Compare and contrast regex to string methods.
- List common Python string types
- Describe the performance of regex in terms of binary classification: True/False and Positive/Negative.

What is regex?
----

Regular expressions is a pattern matching language. It is a "meta language". Symbols used to represent human language.

Instead of writing `0 1 2 3 4 5 6 7 8 9`, you can write `[0-9]` or `\d`

It is Domain Specific Language (DSL). Powerful (but limited language). 

It is like learning any new language (e.g., French or Chinese). You learn it piece-by-piece with lot's of practice.

What are other DSL you already know?
------

- SQL  
- Markdown
- TensorFlow

What are the uses of RegEx?
---



1. Find / Search
1. Find & Replace
2. Cleaning

![cartoon](http://imgs.xkcd.com/comics/regular_expressions.png)

Don't forgot about Python's `str` methods
-----

`str.<tab>`
    
str.find()

In [35]:
str.find?

Regex vs. String methods
-----

1. String methods are easier to understand.
1. String methods express the intent more clearly. 

-----

1. Regex handle much broader use cases.
1. Regex can be language independent.
1. Regex can be faster at scale.

Spec (aka Specification for a piece of software)
----

> Write a regex to match: calendar, calandar, celender, celander  

`c[a|e]l[a|e]nd[a|e]r`   
`c[ae]l[ae]nd[ae]r`

Let's explore it with with [regexr.com](http://regexr.com/)

Let's Test It
------

<center><img src="http://2.bp.blogspot.com/-iSc4t8I0ejg/TyNOkvH1kQI/AAAAAAABAVk/MGdD2eXhFPA/s1600/testing-darth-vader-300x240.jpg" width="700"/></center>

In [36]:
# Patterns to match
chunk = "calendar calandar celender celander".split()

In [37]:
# Near misses to not match
junk = "foo cal cali calaaaandar".split()

In [38]:
# Interleave them to create mock data
string = " ".join(item 
                  for pair in zip(chunk, junk) 
                  for item in pair)

In [39]:
print(string)

calendar foo calandar cal celender cali celander calaaaandar


In [40]:
import re

def test_regex(patttern, string, chunk, junk):
    """Given a regex pattern, find examples in string. 
    Match chunk. Do not match junk."""
    assert sorted(re.findall(pattern, string)) == sorted(chunk) 
    assert sorted(re.findall(pattern, string)) != sorted(junk)
    
    return "All tests pass! 😀"

In [41]:
# You match it with literals
literal1 = 'calendar'
literal2 = 'calandar'
literal3 = 'celender'
literal4 = 'celander'

pattern = "|".join([literal1, literal2, literal3, literal4])  
pattern

'calendar|calandar|celender|celander'

In [42]:
print(test_regex(pattern, string, chunk, junk))

All tests pass! 😀


### THERE MUST BE A BETTER WAY

Let's write it with regex language: 

`c[ae]l[ae]nd[ae]r`

<center><img src="https://imgs.xkcd.com/comics/automation.png" width="700"/></center>

In [43]:
# A little bit of meta-programming (strings all the way down!)
variable_vowels = "[ae]"
constant_letters = ["c","l","nd","r"]
pattern2 = variable_vowels.join(constant_letters)
pattern2

'c[ae]l[ae]nd[ae]r'

In [44]:
# Does our test still pass?
print(test_regex(pattern2, string, chunk, junk))

All tests pass! 😀


Types of Python strings
-----

In [45]:
# Vanilla String
"Hello, world!"

'Hello, world!'

In [46]:
# Formated string
greeting = "Hello"
f"{greeting}, world!"

'Hello, world!'

In [47]:
# Escape characters
print("Hello, \nworld!")

Hello, 
world!


In [48]:
# Raw strings treat backslashes as literal characters
print(r"Hello, \nworld!")

Hello, \nworld!


In [49]:
# What it looks like to the Python interpreter
r"Hello, \nworld!"

'Hello, \\nworld!'

https://docs.python.org/3/reference/lexical_analysis.html

What are Type I and Type II Errors in Statistics?
------

<center><img src="images/type_I_error.jpg" width="700"/></center>

Regex: Connection to statistical concepts
----

__False positives__ (Type I): Matching strings that we should __not__ have
matched

__False negatives__ (Type II): __Not__ matching strings that we should have matched

Reducing the error rate for a task often involves two antagonistic efforts:

1. Minimizing false positives
2. Minimizing false negatives

In a perfect world, you would be able to minimize both but in reality you often have to trade one for the other.

What are the advantages of regex?
------

1. Concise and powerful pattern matching DSL
2. Supported by many computer languages, including SQL

What are the disadvantages of regex?
------

1. Brittle 
2. Hard to write, can get complex to be correct
2. Hard to readable

<center><img src="images/regex.png" width="300"/></center>

Useful Tools:
----
- [Regex cheatsheet](http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/)
- [regexr.com](http://regexr.com/) Realtime regex engine
- [pyregex.com](https://pythex.org/) Realtime Python regex engine

<center><h2>Takeaways</h2></center>

1. We use regex as a metalanguage to find string patterns in blocks of text
1. `r""` are your IRL friends for Python regex
1. We are just doing binary classification so use the same performance metrics
1. You'll make a lot of mistakes in regex 😩: 
    - False Positive: Thinking you are right but you are wrong.
    - False Negative: Missing something you should have found.

<br>
<br> 
<br>

----

<center><h2>Bonus Material (mostly cartoons)</h2></center>

<center><img src="https://imgs.xkcd.com/comics/backslashes.png" width="700"/></center>

<center><img src="https://imgs.xkcd.com/comics/perl_problems.png" width="700"/></center>

<center><img src="https://imgs.xkcd.com/comics/regex_golf.png" width="700"/></center>

Regex Terms
----


- __target string__:	This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.


- __search expression__: The pattern we use to find what we want. Most commonly called the regular expression. 


- __literal__:	A literal is any character we use in a search or matching expression, for example, to find 'ind' in 'windows' the 'ind' is a literal string - each character plays a part in the search, it is literally the string we want to find.

- __metacharacter__: A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "." means any character.

Metacharacters are the special sauce of regex.

- __escape sequence__:	An escape sequence is a way of indicating that we want to use a metacharacters as a literal. 

In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal. 

`'\.'` means find literal period character (not match any character)