# Regular expressions

## This workshop
- Brief introduction to regular expressions
- Continually introduce new features of regular expressions
- Series of short challenges for each feature to practice your skills

## Brief introduction

Regular expressions are of both theoretical and practical interest in computer science. For the theoretical side, see the [Wikipedia page on regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [on regular langauges](https://en.wikipedia.org/wiki/Regular_language). For our purposes, a regular expression is a sequence of characters that defines a search pattern. We can use regular expressions to find particular patterns in text data. The text data could be an English sentence, or e-mail addresses, or TeX commands, Python source code, or anything you like. Once we've found a particular pattern, we can optionally replace it with some other text. **In this way, regular expressions are really just advanced "find and replace" techniques.**

Here are some tasks you can achieve easily and quickly with regular expressions:
- Extract all words within parentheses in a text file
- Find all product on a webpage that costs over \$250
- Replace all swear words with family-friendly alternatives
- Find every mention of a date in some text
- Locate all the phone numbers in a series of emails

**What are some situations in your work where regular expressions would be useful?**

Here are some example regular expressions:

- `(\w+ )(\1)`
- `n|a|b`
- `one`
- `\d+`

We can think of regular expressions as a tiny, highly specialized programming language. This "tiny, highly specialized programming language" is available in Python, R, C, Excel, etc. To use regular expressions, we have to learn their syntax, just like you did when you learnt Python or R syntax. The awesome thing about regular expressions is that their syntax is almost entirely identical across different programming languages. All the patterns we wrote above match the same patterns regardless of where you're using them. In this workshop, we'll be using Python. **But bear in mind, the regular expression syntax we learn today isn't unique to Python.**

There are some excellent resources on the web for learning and using regular expressions in Python. We'll refer to all of these throughout the workshop.
- [The documentation for Python 3's regular expressions](https://docs.python.org/3/howto/regex.html)
- [Python regular expression cheat sheet (2.7, but mostly unchanged)](https://github.com/dlab-berkeley/regular-expressions-in-python/blob/master/regex_cheatsheet.pdf)
- [Pythex](https://pythex.org/)
- [PyRegex](http://www.pyregex.com/)
- [HOWTO Python regular expressions](https://docs.python.org/3/howto/regex.html)

One last note on terminology. Regular expressions are also called regexes, regex patterns or REs.

### Why use regular expressions?

What does text processing look like **without** regular expressions? Let's try to find phone numbers.

In [8]:
def is_phone_number(text):
    '''Return True if `text` is a valid US phone number.
    
    A phone number in the US is a string of digits with a 3-digit Area Code, 
    followed by hyphen, a group of three digits, another hyphen and a group 
    of four digits.'''
    # Test the length of the string
    if len(text)!= 12:
        return False
    # Test that the first three characters are digits
    for i in range(3): 
        if not text[i].isnumeric():
            return False
    # Test that the fourth character is a '-'
    if text[3] != '-':
        return False
    # Test that the next three characters are digits
    for i in range(4,7):
        if not text[i].isnumeric():
            return False
    # Test that the next character is a '-'
    if text[7] != '-':
        return False
    # Test that the last four characters are digits
    for i in range(8,12): 
        if not text[i].isnumeric():
            return False
    # If we didn't fail any of the above tests, it's a valid US phone number
    return True

Let's use our new function on a test string.

In [11]:
test_string = '510-654-1220'
is_phone_number(test_string)

True

In [12]:
test_string = 'this is not a phone number'
is_phone_number(test_string)

False

Now we want to find **all** the phone numbers in a message. We can do that by looping through every substring of length 12 in our message, and using our `is_phone_number` function above.

In [14]:
message = 'Call me at 409-223-8952 tomorrow. 409-888-8498 is my office'
for i in range (len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: '+ chunk)

Phone number found: 409-223-8952
Phone number found: 409-888-8498


This definitely works. But there's so much overhead! Regular expressions allow us to be much more concise. Here's how we could do the same thing with regular expressions. In Python, we need to `import` the `re` module in order to use regular expressions. The regular expression for this example is assigned to the variable `phone_number_pattern`. The pattern is: "three digits followed by a '-', followed by three digits, followed by '-', followed by four digits". This is **a lot** easier to understand than our `is_phone_number` function.

In [17]:
import re
phone_number_pattern = '\d{3}-\d{3}-\d{4}'
for number in re.findall(phone_number_pattern, message):
    print('Phone number found: '+ number)

Phone number found: 409-223-8952
Phone number found: 409-888-8498


#### Challenge 1 
What are some problems with our regular expression defined above? Will it find all valid US phone numbers? _Hint: Do people always write their phone number in exactly the format used above?_

### Matching characters exactly

Most letters and characters will simply match themselves. For example, the regular expression `regular` will match the string `regular` exactly.

In [20]:
pattern = 'regular'
test_string = 'we are practising our regular expressions'
re.findall(pattern, test_string)

['regular']

There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. Really, learning regular expression syntax is just learning how to use these metacharacters. In this workshop, we'll discuss the following metacharacters:

`. ^ $ * + ? { } [ ] \ | ( )`

The first metacharacters we’ll look at are `[` and `]`. They’re used for specifying a **character class**, which is a set of characters that you wish to match.

In [23]:
vowel_pattern = '[ab]'
test_string = 'abracadabra'
re.findall(vowel_pattern, test_string)

['a', 'b', 'a', 'a', 'a', 'b', 'a']

#### Challenge 2
Find all the p's and q's in the test string below.

In [22]:
test_string = "Quick, there's a large goat filled with pizzaz. Is there a path to the queen of Zanzabar?"

#### Challenge 3
Find all the vowels in the test sentence below.

In [None]:
test_string = 'the quick brown fox jumped over the lazy dog'

Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, `[abc]` will match any of the characters a, b, or c; this is the same as `[a-c]`.

#### Challenge 4
Find all the capital letters in the following string.

In [77]:
test_string = 'The 44th pPresident of the United States of America was Barack Obama'

You can match the characters not listed within the class by complementing the set. This is indicated by including a `^` as the first character of the class; `^` outside a character class will simply match the `^` character. For example, `[^5]` will match any character except `5`.

In [26]:
everything_but_t = '[^t]'
test_string = 'the quick brown fox jumped over the lazy dog'
re.findall(everything_but_t, test_string)[:5]

['h', 'e', ' ', 'q', 'u']

#### Challenge 5
Find all the consonants in the test sentence below.

In [27]:
test_string = 'the quick brown fox jumped over the lazy dog'

#### Challenge 6
Find all the `^` characters in the following test sentence.

In [34]:
test_string = """You can match the characters not listed within the class by complementing the set. 
This is indicated by including a ^ as the first character of the class; 
^ outside a character class will simply match the ^ character. 
For example, [^5] will match any character except 5."""

Challenge 6 is a bit of a trick. The problem is that we want to match the `^` character, but it's interpreted as a metacharacter, a character which has a special meaning. If we want to literally match the `^`, we have to "escape" its special meaning. For this, we use the `\`.

#### Challenge 7
Find all the square brackets `[` and `]` in the following test string

In [38]:
test_string = "The first metacharacters we'll look at are [ and ]."

In [39]:
pattern = r'[^t]'

The backslash `\` has another use in regexes, in addition to escaping metacharacters. It's used as the first character in special two-character combinations that have special meanings. These special two-character combinations are really shorthand for sets of characters.

|      Character     |       Meaning      |   Shorthand for  |
|:------------------:|:------------------:|:----------:|
|        `\d`        |      any digit     | `[0-9]` |
|        `\D`        |    any non-digit   |    `[^0-9]`    |
|        `\s`        |   any whitespace   |    `[ \t\n\r\f\v]`    |
|        `\S`        | any non-whitespace |    `[^ \t\n\r\f\v]`    |
|        `\w`        |      any word      |    `[a-zA-Z0-9_]`    |
| what do you think? |    any non-word    |         `?`   |

Now here's a quick tip. When writing regular expressions in Python, use raw strings instead of normal strings. Raw strings are preceded by an `r` in Python code. If we don't, the Python interpreter will try to convert backslashed characters before passing them to the regular expression engine. This will end in tears. You can read more about this [here](https://docs.python.org/3/library/re.html#module-re).

#### Challenge 8
Find all three digit prices in the following test sentence. Remember the `$` is a metacharacter so needs to be escaped.

In [40]:
test_string = 'The iPhone X costs over $999, while the Android competitor comes in at around $550.'

Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times.

| Character |        Meaning        |    Example    |                Matches               |
|:---------:|:---------------------:|:-------------:|:------------------------------------:|
|   `{n}`   |    exactly n times    |     `a{3}`    |                 'aaa'                |
|  `{n, m}` | between n and m times | `[1-9]{2, 4}` |          '12', '123', '1234'         |
|    `?`    |      0 or 1 times     |   `colou?r`   |           'color', 'colour'          |
|    `*`    |    0 or more times    |    `data!*`   | 'data', 'data!', 'data!!', 'data!!!' |
|    `+`    |    1 or more times    |     `lo+l`    |        'lol', 'lool', 'loool'        |

#### Challenge 9
Find all prices in the following test sentence.

In [41]:
test_string = """The iPhone X costs over $999, while the Android competitor comes in at around $550.
Apple's MacBook Pro costs $1200, while just a few years ago it was $1700.
A new charger for the MacBook costs over $80.
"""

The regular expression syntax that we've seen so far covers most of the common use cases. Let's take a break from the syntax, and focus on Python's re module. It has some quirks that we should talk about, after which we'll get back to the syntax.

Up until now we've only used `re.findall`. This function takes two arguments, a `pattern` and a `text` to search through. It returns a list of all the substrings in `text` that follow `pattern`. 

Two other common functions are `re.match` and `re.search`. These take the same two arguments as `re.findall`. `re.search` looks through `text` for the **first** occurrence of `pattern`. `re.match` only looks at the start of `text`. Rather than returning a list, these two functions return a `match` object, which contains information about the substring in `text` that matches `pattern`. For example, it gives you the starting and ending index of the substring. If no such matching substring is found, they return `None`.

In [43]:
price_pattern = r'\$\d+'
test_string = """The iPhone X costs over $999, while the Android competitor comes in at around $550.
Apple's MacBook Pro costs $1200, while just a few years ago it was $1700.
A new charger for the MacBook costs over $80.
"""
m = re.search(price_pattern, test_string)
m

<_sre.SRE_Match object; span=(24, 28), match='$999'>

The `match` object has everal methods and attributes; the most important ones are `group()`, `start()`, `end()` and `span()`. `group()` returns the string that matched the regex, `start()` and `end()` return the relevant indicies, and `span()` returns the indicies as a tuple.

In [50]:
print(m.group())
print(m.start())
print(m.end())
print(m.span())

$999
24
28
(24, 28)


In general, I prefer just using `re.findall`, because I rarely need the information that `match` object instances give.

#### Challenge 10
Write a function called `first_vowel` that takes in a single word, and returns the first vowel. If there is no vowel in the word, it should return the string `"Hey, no vowel!"`.

In [51]:
def first_vowel(word):
    vowel_pattern = r'[aeiou]'
    m = re.search(vowel_pattern, word)
    if m:
        return m.group()
    return 'Hey, no vowel!'

In [54]:
print(first_vowel('hello'))
print(first_vowel('sky'))

e
Hey, no vowel!


So far we've just been finding, but I promised you advanced "find and replace"! That's what `re.sub` is for. `re.sub` takes three arguments: a `pattern` to look for, a `replacement` string to replace it with, and a `text` to look for `pattern` in.

#### Challenge 11
Replace all the prices in the test string below with `"one million dollars"`.

In [56]:
test_string = """The iPhone X costs over $999, while the Android competitor comes in at around $550.
Apple's MacBook Pro costs $1200, while just a few years ago it was $1700.
A new charger for the MacBook costs over $80.
"""

So far we've used the module-level functions `re.findall` and friends. We can also `compile` a regex into a `pattern` object. The `pattern` object has methods with identical names to the module-level functions. The benefits are if you're searching over huge texts. It's entirely the same as what we've been doing so far so no need to complicate things. But you'll see it around so it's good to know about.

In [59]:
vowel_pattern = re.compile(r'[aeiou]')
test_string = 'abracadabra'
vowel_pattern.findall(test_string)

['a', 'a', 'a', 'a', 'a']

You might also want to experiment with `re.split`.

#### Challenge 12
You've received a problematic dataset from a fellow researcher, with some data entry errors/discrepancies. How would you use regular expressions to correct these errors?

1. Replace all instances of "district" or "District" with "County". 
2. Replace all instances of "Not available" or "[Name] looking up" with numeric codes.  

In [None]:
with open("data/usecase1/problem_dataset.csv", "r") as f:
    text = f.read()

# DO SOME REGEX MAGIC
# cleaned_text = ...

with open("data/usecase1/cleaned_dataset.csv", "w") as f:
    f.write(cleaned_text)


#### Challenge 13
Find all words in the following string about robots.

In [60]:
string = '''Robots are branching out. A new prototype soft robot takes inspiration from plants by growing to explore its environment.

Vines and some fungi extend from their tips to explore their surroundings. 
Elliot Hawkes of the University of California in Santa Barbara 
and his colleagues designed a bot that works 
on similar principles. Its mechanical body 
sits inside a plastic tube reel that extends 
through pressurized inflation, a method that some 
invertebrates like peanut worms (Sipunculus nudus)
also use to extend their appendages. The plastic 
tubing has two compartments, and inflating one 
side or the other changes the extension direction. 
A camera sensor at the tip alerts the bot when it’s 
about to run into something.

In the lab, Hawkes and his colleagues 
programmed the robot to form 3-D structures such 
as a radio antenna, turn off a valve, navigate a maze, 
swim through glue, act as a fire extinguisher, squeeze 
through tight gaps, shimmy through fly paper and slither 
across a bed of nails. The soft bot can extend up to 
72 meters, and unlike plants, it can grow at a speed of 
10 meters per second, the team reports July 19 in Science Robotics. 
The design could serve as a model for building robots 
that can traverse constrained environments

This isn’t the first robot to take 
inspiration from plants. One plantlike 
predecessor was a robot modeled on roots.'''

#### Challenge 14
We can use parentheses to match certain parts of a regular expression.

In [72]:
price_pattern = pattern = r'\$(\d+)\.(\d{2})'
test_string = "The iPhone X costs over $999.99, while the Android competitor comes in at around $550.50."
m = re.search(price_pattern, test_string)
dollars, cents = m.group(1), m.group(2)
print(dollars)
print(cents)

999
99


Use parentheses to group together the area code of a US phone number. Write a function called `area_code` that takes in a string, and if it is a valid US phone number, returns the area code. If not, it should return the string `"Hey, not a phone number!"`.

#### Challenge 15
Parentheses can also be used to group together characters in a regular expression so that metacharacters can apply to the entire group, not just a single character.

In [75]:
bat_pattern = r'Bat(wo)?man'
test_string = 'Batwoman, Batman and Robin are good friends.'
re.findall(bat_pattern, test_string)

['wo', '']

What went wrong? Well, parentheses have a double life in regular expression syntax. They are used to signal groups like in Challenge 14, but also to let metacharacters apply to those groups. Those two uses interfere with each other. If we want the `?` to apply to the whole `wo` sequence, but we want the whole substring that matches, we have to use a non-capturing group.

In [76]:
bat_pattern = r'Bat(?:wo)?man'
test_string = 'Batwoman, Batman and Robin are good friends.'
re.findall(bat_pattern, test_string)

['Batwoman', 'Batman']