# Chapter 4: Regex

In [27]:
%config Completer.use_jedi = False
import numpy as np

In [2]:
phone1 = "123-456-7890"

phone2 = "123 456 7890"

not_phone1 = "101 fastai"

In [3]:
import string
string.digits

'0123456789'

In [4]:
def check_phone(inp):
    valid_chars = string.digits + ' -()'
    for char in inp:
        if char not in valid_chars: return False
    return True

In [5]:
assert check_phone(phone1)
assert check_phone(phone2)
assert not check_phone(not_phone1)

In [6]:
# Attempt 2 without regex
not_phone2 = "1234"

In [7]:
import pytest

with pytest.raises(AssertionError): assert not check_phone(not_phone2)

In [8]:
def check_phone(inp):
    nums = string.digits
    valid_chars = nums + ' -()'
    num_counter = 0
    for char in inp:
        if char not in valid_chars: return False
        if char in nums: num_counter += 1
    if num_counter==10: return True
    else: return False

In [9]:
assert check_phone(phone1)
assert check_phone(phone2)
assert not check_phone(not_phone1)
assert not check_phone(not_phone2)

### Attempt 3 without regex
We also need to extract the digits. 

In [10]:
not_phone3 = "34 50 98 21 32"

with pytest.raises(AssertionError): assert not check_phone(not_phone3)

In [11]:
not_phone4 = "(34)(50)()()982132"

with pytest.raises(AssertionError): assert not check_phone(not_phone4)

## Introducing Regex
**Best Practice: Be as specific as possible.**

It is Domain Specific Language (DSL). Powerful (but limited) language. 

Other DSLs: SQL, Markdown, TensorFlow. 

For US Phone Number: \d\d\d-\d\d\d-\d\d\d\d

**metacharacter** is one or more special characters that have a unique meaning and NOT used as literals in search expression. \d means any digit. **Metacharacters are special sauce of regex**. 

### Quantifiers: 
How many times preceding expression should match. This uses {} curly braces. Refactor above: \d{3}-\d{3}-\d{4}. 

### Unexact Quantifiers: 
1. ? question mark: 1 or 0 repeats. 
2. * star: zero or more repeats. 
3. + plus sign: one or more repeats. 

The best way to learn is through practice. Otherwise it's like reading lists of rules. 

### Pros and Cons: 
Pros:  
1. Concise and powerful pattern matching DSL
2. Supported by many computer languages, including SQL. 

Cons:
1. Brittle
2. Hard to write, can get complex to be correct. 
3. Hard to read. 

## Revisiting Tokenization. 
How do we make our own tokenizer? Create our own tokens? 

In [13]:
import re

In [20]:
re_punc = re.compile("([\"\''().,;:/_?!—\-])")  # add spaces around punctuation. 
re_apos = re.compile(r"n ' t ")  # n't
re_bpos = re.compile(r" ' s")  # 's
re_mult_space = re.compile(r"  *")  # replace multiple spaces with just one. (two spaces)

def simple_toks(sent): 
    sent = re_punc.sub(r" \1 ", sent)
    sent = re_apos.sub(r" n't ", sent)
    sent = re_bpos.sub(r" 's ", sent)
    sent = re_mult_space.sub(" ", sent)
    return sent.lower().split()

In [22]:
text = "I don't know who Kara's new friend is -- is it 'Mr. Toad'?"
" ".join(simple_toks(text))

"i do n't know who kara 's new friend is - - is it ' mr . toad ' ?"

In [23]:
text2 = re_punc.sub(r" \1 ", text); text2

"I don ' t know who Kara ' s new friend is  -  -  is it  ' Mr .  Toad '  ? "

In [24]:
text3 = re_apos.sub(r" n't ", text2); text3

"I do n't know who Kara ' s new friend is  -  -  is it  ' Mr .  Toad '  ? "

In [25]:
text4 = re_bpos.sub(r" 's ", text3); text4

"I do n't know who Kara 's  new friend is  -  -  is it  ' Mr .  Toad '  ? "

In [26]:
sentences = ['All this happened, more or less.',
             'The war parts, anyway, are pretty much true.',
             "One guy I knew really was shot for taking a teapot that wasn't his.",
             'Another guy I knew really did threaten to have his personal enemies killed by hired gunmen after the war.',
             'And so on.',
             "I've changed all their names."]

In [31]:
tokens = list(map(simple_toks, sentences))
[np.array(token) for token in tokens]

[array(['all', 'this', 'happened', ',', 'more', 'or', 'less', '.'],
       dtype='<U8'),
 array(['the', 'war', 'parts', ',', 'anyway', ',', 'are', 'pretty', 'much',
        'true', '.'], dtype='<U6'),
 array(['one', 'guy', 'i', 'knew', 'really', 'was', 'shot', 'for',
        'taking', 'a', 'teapot', 'that', 'was', "n't", 'his', '.'],
       dtype='<U6'),
 array(['another', 'guy', 'i', 'knew', 'really', 'did', 'threaten', 'to',
        'have', 'his', 'personal', 'enemies', 'killed', 'by', 'hired',
        'gunmen', 'after', 'the', 'war', '.'], dtype='<U8'),
 array(['and', 'so', 'on', '.'], dtype='<U3'),
 array(['i', "'", 've', 'changed', 'all', 'their', 'names', '.'],
       dtype='<U7')]

We need to convert them to integer ids. We also need to know our vocabulary, and have a way to convert between words and ids. 

In [32]:
import collections

In [34]:
PAD = 0
SOS = 1


def toks2ids(sentences):
    voc_cnt = collections.Counter(t for sent in sentences for t in sent)
    vocab = sorted(voc_cnt, key=voc_cnt.get, reverse=True)
    vocab.insert(PAD, "<PAD>")
    vocab.insert(SOS, "<SOS>")
    w2id = {w:i for i, w in enumerate(vocab)}
    ids = [[w2id[t] for t in sent] for sent in sentences]
    return ids, vocab, w2id, voc_cnt

In [35]:
ids, vocab, w2id, voc_cnt = toks2ids(tokens)
[np.array(id) for id in ids]

[array([ 5, 13, 14,  3, 15, 16, 17,  2]),
 array([ 6,  7, 18,  3, 19,  3, 20, 21, 22, 23,  2]),
 array([24,  8,  4,  9, 10, 11, 25, 26, 27, 28, 29, 30, 11, 31, 12,  2]),
 array([32,  8,  4,  9, 10, 33, 34, 35, 36, 12, 37, 38, 39, 40, 41, 42, 43,
         6,  7,  2]),
 array([44, 45, 46,  2]),
 array([ 4, 47, 48, 49,  5, 50, 51,  2])]

In [36]:
np.array(vocab)

array(['<PAD>', '<SOS>', '.', ',', 'i', 'all', 'the', 'war', 'guy',
       'knew', 'really', 'was', 'his', 'this', 'happened', 'more', 'or',
       'less', 'parts', 'anyway', 'are', 'pretty', 'much', 'true', 'one',
       'shot', 'for', 'taking', 'a', 'teapot', 'that', "n't", 'another',
       'did', 'threaten', 'to', 'have', 'personal', 'enemies', 'killed',
       'by', 'hired', 'gunmen', 'after', 'and', 'so', 'on', "'", 've',
       'changed', 'their', 'names'], dtype='<U8')

What could be another better name for `vocab` variable above? 

In [37]:
np.array(w2id)

array({'<PAD>': 0, '<SOS>': 1, '.': 2, ',': 3, 'i': 4, 'all': 5, 'the': 6, 'war': 7, 'guy': 8, 'knew': 9, 'really': 10, 'was': 11, 'his': 12, 'this': 13, 'happened': 14, 'more': 15, 'or': 16, 'less': 17, 'parts': 18, 'anyway': 19, 'are': 20, 'pretty': 21, 'much': 22, 'true': 23, 'one': 24, 'shot': 25, 'for': 26, 'taking': 27, 'a': 28, 'teapot': 29, 'that': 30, "n't": 31, 'another': 32, 'did': 33, 'threaten': 34, 'to': 35, 'have': 36, 'personal': 37, 'enemies': 38, 'killed': 39, 'by': 40, 'hired': 41, 'gunmen': 42, 'after': 43, 'and': 44, 'so': 45, 'on': 46, "'": 47, 've': 48, 'changed': 49, 'their': 50, 'names': 51},
      dtype=object)

### What are the use of RegEx? 
1. Find / Search. 
2. Find & Replace. 
3. Cleaning. 

#### Don't forget about Python's `str` methods. 
`str.<tab>`  
`str.find()`

In [40]:
str.find?

### Regex vs String method. 
String:  
1. String methods are easier to understand
2. String methods express the intent more clearly. 

--- 

Regex:  
1. Regex handle much broader use cases. 
2. Regex can be language independent. 
3. Regex can be faster at scale. 

### What about unicode? 

In [41]:
message = "😒🎦 🤢🍕"

re_frown = re.compile(r"😒|🤢")
re_frown.sub(r"😊", message)

'😊🎦 😊🍕'

### Regex Errors: 
**False positives** (Type I): Matching strings that we should **not** have matched.  
**False negatives** (Type II): **Not** matching strings that we should have matched.  

Reducing the error rate for a task often involves two antagonistic efforts: 
1. Minimizing false positives
2. Minimizing false negatives. 

**Important to tests for both!**

In reality, you often have to trade one for the other. 

Useful tools: 
- [Regex cheatsheet](http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/)
- [regexr.com](http://regexr.com/) Realtime regex engine.
- [pyregex.com](https://pythex.org/) Realtime Python regex engine. 

### Summary
1. We use regex as metalanguage to find string patterns in blocks of text. 
2. r"" are IRL friends for Python regex. 
3. We are just doing binary classification so use the same performance metrics. 
4. You'll make a lot of mistakes in regex. Think about FP and FN. 

### Regex Terms
- **target string**: This term describes the string that we will be searching (string in which we want to find our match or search pattern). 
- **search expression**: The pattern we use to find what we want. Most commonly called regular expression (regex). 
- **literal**: Any character we use in a search or matching expression, for example, to find 'ind' in 'windows' the 'ind' is a literal string - each character plays a part in the search, it is literally the string we want to find. 
- **metacharacter**: One or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "." means any character. 

Metacharacters are the special sauce of regex. 

- **escape sequence**: A way of indicating that we want to use a metacharacters as a literal. 

In regex an escape sequence involves placing metacharacter \ (backslash) in front of the metacharacter to use as literal. '\.' means find literal period character (not match any character). 

### Regex workflow
1. Create pattern in Plain English. 
2. Map to regex language. 
3. Make sure results are correct:  
    - All Positives: Captures all examples of pattern. 
    - No Negatives: Everything captured is from pattern. 
4. Don't over-engineer regex. 
    - Your goal is to Get Stuff Done, not write best regex in the world. 
    - Filtering before and after are okay. 