# Deaing with Data Spring 2022 – Class 6

------

Regular Expressions
-------------------

Regular expressions (aka 'regexes') constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find themself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists. 

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. We will present examples using grep - a Unix command to find lines of a text file with a given string in them. We create a Python version of grep to work with.

_Please note that while you should be familiar with the concept of UDFs (user defined functions) at this point you are not expected to understand everything happening within this function (yet)._

In [None]:
import re # https://docs.python.org/3/library/re.html

def printMatches(text, regex_expression):
    BACKGROUND_YELLOW = '\x1b[43m'
    COLOR_RESET  = "\x1b[0m"
    regex= re.compile(regex_expression)
    matches = regex.finditer(text)
    for m in matches:
        highlighted  = text[:m.start()] # the string before the regex match
        highlighted += BACKGROUND_YELLOW + text[m.start():m.end()] + COLOR_RESET 
        highlighted += text[m.end():] # the string after the regex match
        print(highlighted)
        print('\n')

def grep(regex_expression, file_name):
    f = open(file_name, "r")
    content = f.read()
    f.close()
    for line in content.split("\n"):
        printMatches(line, regex_expression)



For today's class, we'll be taking a look at a [transcript from Tesla's Q4 2019 Earnings Call, published by Motley Fool and copied to a .txt file](https://www.fool.com/earnings/call-transcripts/2020/01/30/tesla-inc-tsla-q4-2019-earnings-call-transcript.aspx)

One of the first things you may want to do is search for a literal – simply match the exact text in the document in question. 

_Note that we have uploaded our "TSLA2019_EarningsTranscript.txt" file directly into our Colab instance so that we can search it._ 

---

You can see in the syntax below that we pass through two inputs to our grep function: the regular expression (in this case, a literal string match) and the file we want to search.

Please note that it is not regular expressions themselves that are highlighting the match in yellow – that is what our grep function is doing. The regular expression is, however, identifying the match (more on that later).

---

# ⭕ **QUESTIONS?**

---

## The dot (.) atom 

`Matches any single character other than \n (newline)`

## The bracket [ ] expression 

`Defines a set of characters of which only one needs to be matched`

You can also incorporate ranges into your brackets. For instance, we want some double-digit number followed by the word "percent".

^ It looks like there are no three-digit percentages mentioned in the file.

---

# ⭕ **QUESTIONS?**

---

## Metacharacters 

Include "  \ ^ $ . | ? * + ( ) [ ] and \

These metacharacters help us match various, non-literal components of a sentence. In order to 'escape' them (aka, to search for that symbol itself) you need to proceed it with a backslash (\)

`.` Matches any single character. With bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", but [a.c] matches only "a", ".", or "c".

`[]` Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z".

`[^ ]` Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c".

`^` Matches the starting position within the string.

`$` Matches the ending position of the string or the position just before a string-ending newline.

`()` Defines a marked subexpression.

`*` Matches the preceding element zero or more times.

`{m,n}` Matches the preceding element at least m and not more than n times.

`?` Matches the preceding element zero or one time.

`+` Matches the preceding element one or more times.

`|` The alternation operator matches either the expression before or the expression after the operator

---

`[^ ]` Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c".


`^` Matches the starting position within the string.

`$` Matches the ending position of the string or the position just before a string-ending newline.

`{m,n}` Matches the preceding element at least m and not more than n times.

`*` Matches the preceding element zero or more times.


`?` Matches the preceding element zero or one time.

`+` Matches the preceding element one or more times.

`|` The alternation operator matches either the expression before or the expression after the operator

---

# ⭕ **QUESTIONS?**

---

# Exercise 1: 

> How would you search for the term "[Indecipherable]" in the transcript, including the square brackets? 

In [None]:
# your code here

---

# Exercise 2: 

> How would you find any matches for the phrase, "produced in [year]", not including the square brackets in this case? 

---

# ⭕ **QUESTIONS?**

---

To gain insight into what your regular expression is doing at any time, I highly recommend using regexper.com (https://regexper.com/) which will allow you to see exactly what a given search is doing. 

For instance, check out https://regexper.com/#%5EMy%0A to see what we just did with '^My'

Here is a good cheat sheet for all the special characters, too, From Emma Wedekind: https://dev.to/emmawedekind/regex-cheat-sheet-2j2a

Finally, I'd also recommend RegEx101, a handy debugger for regular expressions: https://regex101.com/

---

## Shortcuts: 

`\d` Matches digits 0-9 <br>
`\D` Matches anything but \d <br>
`\w` Matches any alphanumeric character plus underscore <br>
`\W` Matches anything but \w <br>
`\s` Matches any "whitespace character (space, tab, newline) <br>
`\S` Matches anything but \s

---

# [Python's re Regular Expression Library](https://docs.python.org/3/library/re.html)

We are going to move away from using Panos's Grep function now and focus on Python's Regular Expression Library (re). To be clear, the regular expressions remain the same, but how we call for them and summon them is different. 

A quick note on match groups, too. [_Documentation_](https://docs.python.org/2/library/re.html)

`group([group1, ...]) returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.`

For example: 
```python 
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')
```

# Exercise 3: Imagine you have a file with telephone numbers in different formats: 

- 679-397-5255
- 2126660921
- 212-998-0902
- 888-888-2222
- 800-555-1211
- 800 555 1212
- 800.555.1213
- (800) 555-1214
- 1-800-555-1215
- 1(800)555-1216
- 800-555-1212-1234
- 800-555-1212x1234
- 800-555-1212 ext. 1234
- work 1-(800) 555.1212 #1234

# Your goal is to standardize everything in the form (xxx)-xxx-xxx.

To make the process interactive, go to http://regex101.com/?#python, copy and paste the numbers above in the text area called "Text String", and then try to write the regular expression above. (Remember to put the "g" character in the small text field next to the regex: this has the same meaning as in sed, and it means "find globally" the regex, not just the first occurence).

In [None]:
TextString = """

679-397-5255
2126660921
212-998-0902
888-888-2222
800-555-1211
800 555 1212
800.555.1213
(800) 555-1214
1-800-555-1215
1(800)555-1216
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

"""

# your code here

---

In [None]:
text = "Hello, my name is Alex Siegman. Please call me back at 212 555-9583 or email me at as13815@nyu.edu at your \
earliest convenience. Thank you."

Let's try and match for an email address:

How about for a phone number? 

And what if you have multiple matches in the same string? 

---

# ⭕ **QUESTIONS?**

---

## Data Extraction 

It's awesome that we can return our matches here in our notebook, but what we really want to do is select the strings that match our regex and return them to a program to be processed. For example: 

What about our large 'file' of ill-formatted phone numbers from earlier? 

In [None]:
TextString = """

679-397-5255
2126660921
212-998-0902
888-888-2222
800-555-1211
800 555 1212
800.555.1213
(800) 555-1214
1-800-555-1215
1(800)555-1216
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

"""

## String Replacement

String Replacement (.sub()) allows us to return a version of our text where all instances that matched have been substituted with a replacement. For instance, if we want to mask phone numbers in a document: 

---

## [`Requests`](https://requests.readthedocs.io/en/master/)

Last but not least, a quick note on the requests library that will allow us to leverage our regular expressions to run over HTML from any site.