Regular Expressions
-------------------

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists. 

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. We will present examples using grep - a Unix command to find lines of a text file with a given string in them. We create a Python version of grep to work with.

In [None]:
import re # https://docs.python.org/3/library/re.html

def printMatches(text, regex_expression):
    BACKGROUND_YELLOW = '\x1b[43m'
    COLOR_RESET  = "\x1b[0m"
    regex= re.compile(regex_expression)
    matches = regex.finditer(text)
    for m in matches:
        highlighted  = text[:m.start()] # the string before the regex match
        highlighted += BACKGROUND_YELLOW + text[m.start():m.end()] + COLOR_RESET 
        highlighted += text[m.end():] # the string after the regex match
        print(highlighted)
        print('\n')

def grep(regex_expression, file_name):
    f = open(file_name, "r")
    content = f.read()
    f.close()
    for line in content.split("\n"):
        printMatches(line, regex_expression)

For today's class, we'll be taking a look at  the ["Capital Project Tracker" from NYC Open Data.](https://data.cityofnewyork.us/Recreation/Capital-Project-Tracker/qiwj-i2jk)

One of the first things you may want to do is search for a literal – simply match the exact text in the document in question. For instance, if we want to find any mention of the word, "Highway"...

_Note that we have uploaded our "ParkProjects.txt" file directly into our Colab instance so that we can search it._ 

In [None]:
grep('Highway', './ParkProjects.txt')

That's all fine and well, but the goal here is to leverage the power and flexibility of regular expressions. That's where atoms come into play. 

The simplest regular expressions are a sequence of "atoms".

## The dot (.) atom 

Matches any single character other than \n (newline) 

In [None]:
grep('Avenue . and', './ParkProjects.txt')

In [None]:
grep('Avenue . and Avenue .', './ParkProjects.txt')

## The bracket expression 

Defines a set of characters of which only one needs to be matched 

In [None]:
grep('Avenue [FG]', './ParkProjects.txt')

In [None]:
grep('[Pp]layground', './ParkProjects.txt')

You can also incorporate ranges into your brackets. For instance, we want the word "East" followed by any three numbers:

In [None]:
grep('East [0-9][0-9][0-9]', './ParkProjects.txt')

In [None]:
grep('East [0-9][0-9][^0-9]', './ParkProjects.txt')

## Anchors 

Are atoms used to define the location of a regex within a line :

- the `^` anchor specifies the beginning of the line 
- the `$` anchor specifies the end of a line
- the `\b` anchor specifies a word boundry 

In [None]:
grep('^Summary This project will reconstruct', './ParkProjects.txt')

In [None]:
grep('^Summary This project will construct', './ParkProjects.txt')

---

## Metacharacters 

Include "  \ ^ $ . | ? * + ( ) [ ] and \

These metacharacters help us match various, non-literal components of a sentence. In order to 'escape' them (aka, to search for that symbol itself) you need to proceed it with a backslash (\)

`.` Matches any single character. With bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", but [a.c] matches only "a", ".", or "c".

`[]` Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z".

`[^ ]` Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c".

`^` Matches the starting position within the string.

`$` Matches the ending position of the string or the position just before a string-ending newline.

`()` Defines a marked subexpression.

`\n` Matches what the nth marked subexpression matched, where n is a digit from 1 to 9.

`*` Matches the preceding element zero or more times.

`{m,n}` Matches the preceding element at least m and not more than n times.

`?` Matches the preceding element zero or one time.

`+` Matches the preceding element one or more times.

`|` The alternation operator matches either the expression before or the expression after the operator

---

To gain insight into what your regular expression is doing at any time, I highly recommend using regexper.com (https://regexper.com/) which will allow you to see exactly what a given search is doing. 

For instance, check out https://regexper.com/#%5EMy%0A to see what we just did with '^My'

Here is a good cheat sheet for all the special characters, too, From Emma Wedekind: https://dev.to/emmawedekind/regex-cheat-sheet-2j2a

Finally, I'd also recommend RegEx101, a handy debugger for regular expressions: https://regex101.com/

---

## Shortcuts: 

`\d` Matches digits 0-9 <br>
`\D` Matches anything but \d <br>
`\w` Matches any alphanumeric character plus underscore <br>
`\W` Matches anything but \w <br>
`\s` Matches any "whitespace character (space, tab, newline) <br>
`\S` Matches anything but \s

---

## Operators:

## `'alternation' operator`

Defines one or more alternatives that need to be true to return a match:

In [None]:
grep('^Summary This project will (partially|expand)', './ParkProjects.txt')

## `Repetition operator `

Specifies that the symbol to be matched may be repeated:

Repetition Shortcuts: 

- * = {0,} The * character means match the previous atom zero or more times
- + = {1,} The + character means match the previous atom one or more times
- ? = {0,1} The ? character means match the previous atom zero or one times

In [None]:
grep('Summary .{400,}$', './ParkProjects.txt')

## `Group Operator`

In the group operator, when a group of characters is inclused in parantehses, the next operator applies to the whole group, not only the previous characters

In [None]:
grep('name.* (\d){3,} .*', './ParkProjects.txt')

## `Back References`

Sometimes it is handy to be able to refer to a match that was made earlier in a regex. To do so, we can use `back references`. 

For instance, if we want to check if the first word of one sentence is the same as the last: 

```python
^([a-zA-Z]{1,}).*\1$
```

1. `^` designates the beginning of a line,
2. `([a-zA-Z]{1,})` is looking for a word (this is our first subexpression),
3. `.*` is looking for any character, 
4. `\1` is referencing back to our first subexpression (a backreference), 
5. `$` designates the end of the line.

## Example: 

Find all lines in fields.txt that have a number in the form [0-9]*x[0-9]x[0-9]*, where x is a digit:

• grep ‘([0-9])([0-9])\1’ fields.txt

Find all numbers that have two consecutive same digits:

• grep ‘([0-9])\1)’ fields.txt

---

# Exercise 1: What do these regex's match? 

In [None]:
# your code here

# Solution

In [None]:
grep('^Summary .* Street.$', './ParkProjects.txt')

In [None]:
grep('East (\d){2,}st', './ParkProjects.txt')

In [None]:
grep('Lenox .* (Pl\.|Place)', './ParkProjects.txt')

---

# Exercise 2: Imagine you have a file with telephone numbers in different formats: 

- 679-397-5255
- 2126660921
- 212-998-0902
- 888-888-2222
- 800-555-1211
- 800 555 1212
- 800.555.1213
- (800) 555-1214
- 1-800-555-1215
- 1(800)555-1216
- 800-555-1212-1234
- 800-555-1212x1234
- 800-555-1212 ext. 1234
- work 1-(800) 555.1212 #1234

# Your goal is to standardize everything in the form (xxx)-xxx-xxx.

To make the process interactive, go to http://regex101.com/?#python, copy and paste the numbers above in the text area called "Text String", and then try to write the regular expression above. (Remember to put the "g" character in the small text field next to the regex: this has the same meaning as in sed, and it means "find globally" the regex, not just the first occurence).

In [None]:
TextString = """

679-397-5255
2126660921
212-998-0902
888-888-2222
800-555-1211
800 555 1212
800.555.1213
(800) 555-1214
1-800-555-1215
1(800)555-1216
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

"""

# your code here

# Solution

In [None]:
TextString = """

679-397-5255
2126660921
212-998-0902
888-888-2222
800-555-1211
800 555 1212
800.555.1213
(800) 555-1214
1-800-555-1215
1(800)555-1216
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

"""

# Notice now that each part of the phone is included in parentheses allowing us to grab individual part of the 
# phone number

regex = re.compile(r'([2-9]\d{2})\D*(\d{3})\D*(\d{4})')
matches = regex.finditer(TextString)

phones = list()
for m in matches:
    area_code = m.group(1)
    first_three_digits = m.group(2)
    last_four_digits =  m.group(3)
    
    phone = "(" + area_code + ")" + first_three_digits + "-" + last_four_digits
            
    phones.append(phone)

# Notice that our list does not include numbers with invalid area codes (e.g., 124, 125)
phones

---

# [Python's re Regular Expression Library](https://docs.python.org/3/library/re.html)

We are going to move away from using Panos's Grep function now and focus on Python's Regular Expression Library (re). To be clear, the regular expressions remain the same, but ow we call for them and summon them is different. 

A quick note on match groups, too. [_Documentation_](https://docs.python.org/2/library/re.html)

`group([group1, ...]) returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.`

For example: 
```python 
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')
```

In [None]:
text = "Hello, my name is Alex Siegman. Please call me back at 212 555-9583 or email me at as13815@nyu.edu at your \
earliest convenience. Thank you."

In [None]:
import re
from re import search 

In [None]:
regex = re.compile(r'[Aa]lex')
matches = regex.finditer(text)

for match in matches: 
    print(match.group())

Let's try and match for an email address:

In [None]:
regex = re.compile(r'\w+@\w+\..{3}')

matches = regex.finditer(text)

for match in matches: 
    print(match.group())                 

How about for a phone number? 

In [None]:
regex = re.compile(r'\d{3} \d{3}-\d{4}')

matches = regex.finditer(text)

for match in matches: 
    print(match.group()) 

And what if you have multiple matches in the same string? 

In [None]:
text = "Hello, my name is Alex Siegman. Please call me back at 212 555-9583 or 314 935-9981 or email me at as13815@nyu.edu at your \
earliest convenience. Thank you."

In [None]:
# we can use 'finditer' that returns a collection of MatchObject items

regex = re.compile(r'\d{3} \d{3}-\d{4}')

matches = regex.finditer(text)
for i, match in enumerate(matches):
    print(i+1, "==>", match.group())
    
# FYI the 'search' term only returns the first MatchObject item

---

## Data Extraction 

It's awesome that we can return our matches here in our notebook, but what we really want to do is select the strings that match our regex and return them to a program to be processed. For example: 

In [None]:
regex = re.compile(r"""(\d{3}) # the area code
                       \D* # zero or more non-digits
                       (\d{3}) # three digits
                       \D* # zero or more non-digits
                       (\d{4}) # four digits
                    """, re.VERBOSE)

matches = regex.finditer(text)
for match in matches:
    print(match.group())
    print("Formatted:", match.group(1),"-", match.group(2), "-", match.group(3))
    print("===========")

What about our large 'file' of ill-formatted phone numbers from earlier? 

In [None]:
TextString = """

679-397-5255
2126660921
212-998-0902
888-888-2222
800-555-1211
800 555 1212
800.555.1213
(800) 555-1214
1-800-555-1215
1(800)555-1216
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

"""

In [None]:
regex = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})')
matches = regex.finditer(TextString)

phones = list()

for m in matches:
    area_code = m.group(1)
    first_three_digits = m.group(2)
    last_four_digits =  m.group(3)
    
    phone = "(" + area_code + ")" + first_three_digits + "-" + last_four_digits
            
    phones.append(phone)

phones

## String Replacement

String Replacement (.sub()) allows us to return a version of our text where all instances that matched have been substituted with a replacement. For instance, if we want to mask phone numbers in a document: 

In [None]:
regex = re.compile('(\d{3})\D*(\d{3})\D*(\d{4})')

newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)

print(newstring)

---

## [`Requests`](https://requests.readthedocs.io/en/master/)

In [3]:
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'

html = requests.get(url).text

In [4]:
html

'<!DOCTYPE html>\n<html lang="en">\n<head profile="https://www.w3.org/1999/xhtml/vocab">\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n  <meta charset="utf-8" /><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={licenseKey:"5b89eb49f4",applicationID:"65778391"};window.NREUM||(NREUM={}),__nr_require=function(n,e,t){function r(t){if(!e[t]){var i=e[t]={exports:{}};n[t][0].call(i.exports,function(e){var i=n[t][1][e];return r(i||e)},i,i.exports)}return e[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<t.length;i++)r(t[i]);return r}({1:[function(n,e,t){function r(){}function i(n,e,t){return function(){return o(n,[u.now()].concat(f(arguments)),e?null:this,t),e?void 0:this}}var o=n("handle"),a=n(4),f=n(5),c=n("ee").get("tracer"),u=n("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],d="ap