In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns

# Feedback

- Thanks!
- Criticisms that appeared at least twice:
    - Clearer directions on assignments.
    - Workload is high (8 out of 61 responses)
    - Discussion podcasts end early.
    - More office hours.
    - Slow down when going over lecture examples.

### Part 1

# Canonicalization

* Lecture Credits (First half):
    - [Berkeley: DS100](https://docs.google.com/presentation/d/1ECr_XrDJXaLK-eGwWlydLjJu-3xwpzFcV4fGMSfwKwU/edit?usp=sharing)
    - [Princeton: COS226](http://www.cs.princeton.edu/courses/archive/spring17/cos226/lectures/54RegularExpressions.pdf)

## Joining on text

* Will these join? What are the problems?
* How would you clean the dataframes to join them?
<img src="imgs/image_0.png">

## Joining on text: cleaning the join key
* Upper vs lower case
* Words "county" and "parish" don't add information in these datasets
* Take care of common variants of words
* Take care of punctuation

In [142]:
county_and_state = pd.read_csv("data/county_and_state.csv")
county_and_pop = pd.read_csv("data/county_and_pop.csv")

In [143]:
display(county_and_state)
display(county_and_pop)

Unnamed: 0,County,State
0,De Witt County,IL
1,Lac qui Parle County,MN
2,Lewis and Clark County,MT
3,St John the Baptist Parish,LA


Unnamed: 0,County,Population
0,DeWitt,16798
1,Lac Qui Parle,8067
2,Lewis & Clark,55716
3,St. John the Baptist,43044


In [144]:
# Naive join
county_and_state.merge(county_and_pop)

Unnamed: 0,County,State,Population


In [145]:
def clean_county(county):
    return (county
            .lower()
            .strip()
            .replace(' ', '')
            .replace('county', '')
            .replace('parish', '')
            .replace('&', 'and')
            .replace('.', ''))

In [146]:
county_and_pop['County'] = county_and_pop['County'].apply(clean_county)
county_and_pop

Unnamed: 0,County,Population
0,dewitt,16798
1,lacquiparle,8067
2,lewisandclark,55716
3,stjohnthebaptist,43044


In [147]:
county_and_state['County'] = county_and_state['County'].apply(clean_county).to_frame()
county_and_state

Unnamed: 0,County,State
0,dewitt,IL
1,lacquiparle,MN
2,lewisandclark,MT
3,stjohnthebaptist,LA


In [148]:
county_and_state.merge(county_and_pop)

Unnamed: 0,County,State,Population
0,dewitt,IL,16798
1,lacquiparle,MN,8067
2,lewisandclark,MT,55716
3,stjohnthebaptist,LA,43044


## Canonicalization
* Create a sequence of steps that transforms both columns into a single form.

<img src="imgs/image_1.png">

## Canonicalization

Replace each string with a unique representation.
- Used string methods
- Very brittle procedure; may only work for X% of the data.
- Hard to verify correctness.
- Also *parse* data using a data model if given the choice!

### Question: limitations of string methods 

* Suppose we want to extract the date and time from the following string:
```
170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585
```
* How would you do this?

### Limitations of string methods 

```
170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585
```
Steps:
1. Get string between `[` and `]`
2. Parse date as day/month/year:hour:min:sec without the timezone
    - in `strftime` format: `%d/%b/%y:%H:%M:%S`
    
What could go wrong with this technique?

In [149]:
s = '''170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''

dt = s[s.find('[') + 1: s.find(']') - 6]
dt

'14/Mar/2018:12:09:20'

In [150]:
dt.split('/')

['14', 'Mar', '2018:12:09:20']

In [151]:
day, month, year_time = dt.split('/')

In [152]:
year, hour, minute, second =  year_time.split(':')

In [153]:
year, month, day, minute, second

('2018', 'Mar', '14', '09', '20')

In [154]:
dt

'14/Mar/2018:12:09:20'

In [155]:
%pd.to_datetime(dt)

UsageError: Line magic function `%pd.to_datetime(dt)` not found.


In [156]:
# better: if date format changes, will throw a descriptive error! (good!)
pd.to_datetime(dt, format='%d/%b/%Y:%H:%M:%S')

Timestamp('2018-03-14 12:09:20')

### Part 2

# Regular Expressions

## Regular Expressions (`regex`)

* Fast, compact way of matching patterns in text
* Python library: `import re`
* Advantages: powerful; capable of matching very complex patterns.
* Disadvantages: 
    - It's still text processing, so brittle and likely to break.
    - Hard to understand: "write-only" language.

## Regular Expressions: Applications

* Source Code: IDE syntax highlighting, search and replace
* Google Code Search
* Scanning for viruses
* Validating text for data-entry

## You may already know regular expressions
* Google supports wildcard search `*` and union `|`
![google](imgs/regex_google.png)

## `re`

- comes with Python
- `import re` to import regular expression module
- `m = re.search(<pattern>, <string_to_search>)` to search for first match, `m.groups`
- `re.findall(<pattern>, <string_to_search>)` return all matches
* `re.sub(<pattern>, <replacement>, <string_to_search>)` substitute

Also in Pandas: `Series.str.sub`, `Series.str.extract`, `Series.str.contains`.

In [55]:
re.search('ba', 'foo bar baz wibble wobble wubble')

<re.Match object; span=(4, 6), match='ba'>

In [56]:
re.findall('ba', 'foo bar baz wibble wobble wubble')

['ba', 'ba']

In [61]:
re.findall('w.bble', 'foo bar baz wibble wobble wubble')

['wibble', 'wobble', 'wubble']

In [62]:
re.findall('ba.', 'foo bar baz wibble wobble wubble')

['bar', 'baz']

In [64]:
re.findall('w[io]bble', 'foo bar baz wibble wobble wubble')

['wibble', 'wobble']

In [158]:
re.sub('w[io]bble', 'WOBBLE', 'foo bar baz wibble wobble wubble')

'foo bar baz WOBBLE WOBBLE wubble'

### Regular Expression Syntax

* The four 'building blocks' for all regular expressions:
    - "order" specifies "order of operations"

<img src="imgs/image_2.png">

In [159]:
re.search(pattern='ABA', string='ABAAAAAA')

<re.Match object; span=(0, 3), match='ABA'>

In [160]:
re.search(pattern='ABA', string='AAAAABAAAAAA')

<re.Match object; span=(4, 7), match='ABA'>

In [161]:
# examples of each
re.search(pattern='AB*A', string='ABBBBBBA')

<re.Match object; span=(0, 8), match='ABBBBBBA'>

In [162]:
re.search(pattern='AB*A', string='ABA')

<re.Match object; span=(0, 3), match='ABA'>

In [163]:
re.search(pattern='AB*A', string='AA')

<re.Match object; span=(0, 2), match='AA'>

In [164]:
re.search(pattern='AB*A', string='AB')

### Discussion

1. Give a regular expression that matches "moon", "moooon", etc. 
    - Should match any **even** positive number of 'o'.
2. Give a regex that matches muun, muuuun, moon, moooon, etc. 
    - Should match any **even** positive number of 'u' and 'o'.

<img src="imgs/image_3.png">

In [165]:
re.search('', 'moooon')

<re.Match object; span=(0, 0), match=''>

In [166]:
re.search('', 'muun')

<re.Match object; span=(0, 0), match=''>

### Expanded Regexp Syntax

* The following operations offer convenience for expressive matching.

<img src="imgs/image_4.png">

* Ex. `[A-E]+` is just shorthand for `(A|B|C|D|E)(A|B|C|D|E)*`


### More Regular Expression Examples


<img src="imgs/image_5.png">

### Discussion Question

* Give a regular expression for any lowercase string that has a repeated vowel (i.e. noon, peel, festoon, looop, etc).
* Give a regular expression for any string that contains both a lowercase letter and a number, in any order. (marina20, 20marina, ma20rina etc).

<img src="imgs/image_6.png">

In [167]:
re.search('.*(aa|oo|ee|uu|ii)+.*', 'looop')

<re.Match object; span=(0, 5), match='looop'>

In [168]:
re.search('.*[a-z][0-9]+.*|.*[0-9]+[a-z].*', 'sdf78sdf')

<re.Match object; span=(0, 8), match='sdf78sdf'>

### Even More Regular Expressions

<img src="imgs/image_7.png">

* Anchors (`^` and `$`) will be useful on HW).
* Non-greedy qualifier is useful, but won't be used in class.

In [169]:
re.search('a$', "Marina")

<re.Match object; span=(5, 6), match='a'>

### More regexp syntax

For more regexp info, see:
* [DS100 Textbook](https://www.textbook.ds100.org/ch/08/text_regex.html)
* [Python Documentation](https://docs.python.org/3/library/re.html)

<img src="imgs/image_9.png">

## Still more regexp syntax

- Useful: `\b` is "word boundary"

In [170]:
re.findall('\\b\w+\\b', 'this is a test')

['this', 'is', 'a', 'test']

In [176]:
re.findall('\\b\w+\\b', 'this-is-a-test')

['this', 'is', 'a', 'test']

## Aside: "raw" strings

- Python string escape codes: `\n` for newline, `\b` backspace.
- Some are also regex character classes.
- Place an `r` before string literal to make a "raw" string.
    - `\b` is interpreted as "\b", not "backspace".

In [171]:
# \b is an escape character for ASCII backspace
'\b'

'\x08'

In [172]:
r'\b'

'\\b'

In [173]:
re.findall('\b\w+\b', 'this is a test')

[]

In [174]:
# raw strings are helpful!
re.findall(r'\b\w+\b', 'this is a test')

['this', 'is', 'a', 'test']

## Capture Groups
* Use `(` regex `)` to define match groups within the pattern.
- Useful for extracting several parts of the string at once.
- Notice the `\.` -- otherwise, `.` would be a wildcard.

In [177]:
s

'170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585'

In [178]:
ip_pattern = '(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
dt_pattern = '\[(.*)\]'

In [179]:
re.findall(ip_pattern, s)

['170.242.51.168']

In [180]:
re.findall(dt_pattern, s)

['14/Mar/2018:12:09:20 -0800']

In [181]:
re.findall('(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})', s)

[('170', '242', '51', '168')]

### Part 3


# Example: Log Parsing

In [182]:
s

'170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585'

In [183]:
import re
pat = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(pat, s)

[('14', 'Mar', '2018', '12', '09', '20')]

In [184]:
import re
pat = r'\[(.+)/(.+)/(.+):(.+):(.+):(.+) .+\]' # raw string
re.findall(pat, s)

[('14', 'Mar', '2018', '12', '09', '20')]

## Regular Expressions: the more specific, the better!
* Be as specific in your pattern matching as possible
    - Easier to validate the extracted text
    - Easier for error handling (understanding what went wrong).
    
* A better date extraction pattern uses:
```
'\[([0-9]{2}\/[A-Z]{1}[a-z]{2}\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2} -[0-9]{4})\]'
```

In [185]:
s

'170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585'

In [186]:
s = '170.242.51.168 - - [14/Mar/2018:12:09:20 -0800] "GET /my/home/ HTTP/1.1" 200 2585'

In [187]:
pat1 = '\[(.+\/.+\/.+:.+:.+:.+ .+)\]'
re.findall(pat1, s)

['14/Mar/2018:12:09:20 -0800']

In [188]:
pat2 = '\[([0-9]{2}\/[A-Z]{1}[a-z]{2}\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2} -[0-9]{4})\]'
re.findall(pat2, s)[0]

'14/Mar/2018:12:09:20 -0800'

## Regexp Expressions
* Parsing the expression:
```
'\[([0-9]{2}\/[A-Z]{1}[a-z]{2}\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2} -[0-9]{4})\]'
```

* `[0-9]{2}` matches any 2-digit number.
* `[A-Z]{1}` matches any single occurrence of any upper-case letter.
* `[a-z]{2}` matches any 2 consecutive occurrences of lower-case letters.
* Certain special characters (`[`, `]`, `/`) need to be escaped with `\`

In [189]:
# Why is pattern 2 better?
t = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'

In [190]:
re.findall(pat, t)[0]

('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')

In [191]:
# Easy to check if something mis-parsed: empty list!
re.findall(pat2, t)

[]

## Limitations of Regular Expressions

Writing regular expressions is like writing a program.
* Need to know the syntax well.
* Can be easier to write than to read.
* Can be difficult to debug.

Regular expressions terrible at certain types of problems. Examples:
* Anything involving counting (same number of instances of a and b).
* Anything involving complex structure (palindromes).
* Parsing highly complex text structure (e.g., [HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags))

### Email address validation using regexp

<img src="imgs/image_8.png">

### Regexp for data validation
* See [postmortem](http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016)
![so_regexp](imgs/so_regexp.png)