# Background

Before diving into learning regular expression syntax, let's spend a little time understanding what regular expressions are, what benefits they offer us, and what kinds of problems we can solve with them.

[This workshop](#section 1)<br>

[What are regular expressions?](#section 2)<br>

[Why use regular expressions?](#section 3)<br>

### Time
- Teaching: 20 minutes
- Exercises: 5 minutes

## This workshop<a id='section 1'></a>

In this workshop, we're going to focus on **using regular expressions to solve real world problems**. Regular expressions can be found in all modern programming languages and many text editing software tools. For paedological and practical purposes, we'll be using Python. But remember that most of what you learn in this workshop will carry across to any environment in which regular expressions are available.

## What are regular expressions?<a id='section 2'></a>

Regular expressions are of both theoretical and practical interest in computer science. For the theoretical side, see the [Wikipedia page on regular expressions](https://en.wikipedia.org/wiki/Regular_expression) and [on regular langauges](https://en.wikipedia.org/wiki/Regular_language). For our purposes, a regular expression is a sequence of characters that defines a search pattern. We can use regular expressions to find particular patterns in text data. The text data could be an English sentence, or e-mail addresses, or TeX commands, Python source code, or anything you like. Once we've found a particular pattern, we can optionally replace it with some other text. **In this way, regular expressions are really just advanced "find and replace" techniques.**

Here are some tasks you can achieve easily and quickly with regular expressions:

* Extract all words within parentheses in a text file
* Find all products on a webpage that cost over \$250
* Replace swear words with family-friendly alternatives
* Extract all the dates and times in a document
* Locate all the phone numbers in a series of emails
* Find all words regardless of whether they are singular or plural, or past, present or future tense.

**What are some situations in your work where regular expressions would be useful?**

In general, we use regular expressions when we know have a large corpus of text data, and we know the data we want to extract follows specific rules.

Here are some example regular expressions:

* `(.*)`
* `family|ies`
* `fu+c?k`
* `n|a|b`
* `one`
* `$\d+`
* `after.$`
* `the (\w)er .*, the (\w)er .*`

We can think of regular expressions as a tiny, highly specialized programming language. This "tiny, highly specialized programming language" is available in Python, R, C, Excel, Microsoft Word, Google Docs, vim, emacs, notepad++, etc. To use regular expressions, we have to learn their syntax, just like you did when you learnt Python or R syntax. In fact, **learning to use regular expressions is 95% learning the syntax.** You can learn the whole syntax for regular expressions in a few hours, although becoming proficient will probably take a few hours of practice. The awesome thing about regular expressions is that their syntax is almost entirely identical across different programming languages. All the patterns we wrote above match the same patterns regardless of where you're using them. In this workshop, we'll be using Python. **But bear in mind, the regular expression syntax we learn today isn't unique to Python.**

Another way to think about regular expressions is that they define a set of strings. For example, it might be the set of all possible URLs, or the set of all possible social security numbers. In a sense, you can think of the regular expression like this:

`\d{3}-\d{2}-\d{4}` = `{123-45-6789, 123-45-6780, 123-45-6781, ...}`

Regular expressions are useful because they are a concise way of representing the set of strings you care about. We can use the regular expression above to search for any strings in our data that are in that set. Another way to say that is that we search for strings in our data that **match the regular expression**.


One last note on terminology. Regular expressions are also called regexes, regex patterns or REs.

## Why use regular expresions?<a id='section 3'></a>

To get a feel for why regular expressions are so useful, let's consider what text processing looks like without them. Let's try to find phone numbers.

In [1]:
def is_phone_number(text):
    '''Return True if `text` is a valid US phone number.
    
    A phone number in the US is a string of digits with a 3-digit Area Code, 
    followed by hyphen, a group of three digits, another hyphen and a group 
    of four digits.'''
    # Test the length of the string
    if len(text)!= 12:
        return False
    # Test that the first three characters are digits
    for i in range(3): 
        if not text[i].isnumeric():
            return False
    # Test that the fourth character is a '-'
    if text[3] != '-':
        return False
    # Test that the next three characters are digits
    for i in range(4,7):
        if not text[i].isnumeric():
            return False
    # Test that the next character is a '-'
    if text[7] != '-':
        return False
    # Test that the last four characters are digits
    for i in range(8,12): 
        if not text[i].isnumeric():
            return False
    # If we didn't fail any of the above tests, it's a valid US phone number
    return True

Let's use our new function on a test string.

In [2]:
test_string = '510-654-1220'
is_phone_number(test_string)

True

In [3]:
test_string = 'this is not a phone number'
is_phone_number(test_string)

False

In [5]:
test_string = '415-677-211' # missing final digit
is_phone_number(test_string)

False

Now we want to find all the phone numbers in a message. We can do that by looping through every substring of length 12 in our message, and using our is_phone_number function above.

In [8]:
message = 'Call me at 409-223-8952 tomorrow. 409-888-8498 is my office'
for i in range (len(message)):
    chunk = message[i:i+12]
    if is_phone_number(chunk):
        print('Phone number found: '+ chunk)

Phone number found: 409-223-8952
Phone number found: 409-888-8498


This definitely works. But there's so much overhead! Regular expressions allow us to be much more concise. Here's how we could do the same thing with regular expressions. In Python, we need to import the re module in order to use regular expressions. The regular expression for this example is assigned to the variable `phone_number_pattern`. The pattern is: "three digits followed by a '-', followed by three digits, followed by '-', followed by four digits". This is **a lot** easier to understand than our `is_phone_number` function.

In [9]:
import re
phone_number_pattern = '\d{3}-\d{3}-\d{4}'
for number in re.findall(phone_number_pattern, message):
    print('Phone number found: '+ number)

Phone number found: 409-223-8952
Phone number found: 409-888-8498


### Challenge

_What are some problems with our regular expression defined above? Will it find all valid US phone numbers? Hint: Do people always write their phone number in exactly the format used above?_

### Solution
This won't work if people don't write the dashes, omit the area code, or use any spaces between the digits.
In general, this simple regular expression is a brittle approach. By that, we mean that it will break (i.e. not work when we want it to) quite easily.

In summary, we're interested in learning how to use regular expressions because:

> _To master regular expressions is to master your data_ (Friedl 2006: xvii)