# Regular Expressions

<style>
section.present > section.present { 
    max-height: 90%; 
    overflow-y: scroll;
}
</style>

<small><a href="https://colab.research.google.com/github/brandeis-jdelfino/cosi-10a/blob/main/lectures/notebooks/15_regexes.ipynb">Link to interactive slides on Google Colab</a></small>

# Announcements

* PS4 grades are in LATTE, let me know if you have questions.
* If you aren't sure how to interpret LATTE grades, if the class ended today, the grade cutoffs for "Problem set skills total" would be:
   * **A**: 68 - 72
   * **B**: 65 - 67.9
   * **C**: 61 - 64.9
   * These represent the grading from the syllabus: 90% of mastery skills for an A, 80% for a B, 70% for a C.
   * These assume you fully correct your quizzes.

## Exercise

Determine whether a string contains a phone number. 

But not a specific phone number - anything that looks like a phone number: `555-1234`

In [None]:
def is_phone_number(value):
    if len(value) != 8:
        return False
    if not value[0].isdigit():
        return False
    if not value[1].isdigit():
        return False
    if not value[2].isdigit():
        return False
    if value[3] != '-':
        return False
    if not value[4].isdigit():
        return False
    if not value[5].isdigit():
        return False
    if not value[6].isdigit():
        return False  
    if not value[7].isdigit():
        return False  
    return True

In [None]:
print(is_phone_number("555-1234"))
print(is_phone_number("5551234"))
print(is_phone_number("555-12345"))
print(is_phone_number("555-123"))

Now we can try `is_phone_number` at each index of a longer string...

In [None]:
def contains_phone_number(value):
    for i in range(len(value)-7):
        if is_phone_number(value[i:i+8]):
            return True
    return False


In [None]:
print(contains_phone_number("My phone number is 867-5309."))
print(contains_phone_number("My phone number is 867-5309"))
print(contains_phone_number("867-5309 is my phone number."))
print(contains_phone_number("867-5309"))
print(contains_phone_number("867-530 is not my phone number"))
print(contains_phone_number("867-530"))
print(contains_phone_number("My phone number is not 867-530"))

Ok, it's not pretty, but it works.

Now handle a variety of formats - with/without the dash, area codes `(617)867-5309`, `617-867-5309`, etc.

Ugh.

## Regular Expressions

Python (along with many other programming languages) provides **regular expressions**: a powerful "mini-language" for matching, extracting, and manipulating string data.

"Regular expression" is often shortened to "regex".

In [None]:
import re
def contains_phone_number(value):
    match = re.search('\d{3}-\d{4}', value)
    return match is not None

In [None]:
print(contains_phone_number("My phone number is 867-5309."))
print(contains_phone_number("My phone number is 867-5309"))
print(contains_phone_number("867-5309 is my phone number."))
print(contains_phone_number("867-5309"))
print(contains_phone_number("867-530 is not my phone number"))
print(contains_phone_number("867-530"))
print(contains_phone_number("My phone number is not 867-530"))

## If you thought `def __init__(self)` looked weird...

`re.search('\d{3}-\d{4}', value)` is probably making your head spin

## Patterns

A regex "pattern" defines a set of matching rules.

A pattern might define the rules for a phone number, an email address, a date, or something more complicated.

Patterns can be used to ask questions like "does this string match this pattern?" or "can this string be found in this pattern?"

## The simplest pattern

In [None]:
pattern = "test"
print(re.search(pattern, "This is a test."))

"Normal" characters in a pattern match themselves.

The `search` function tries to find the pattern anywhere in a given string.

So, we're using the `search` function here to ask: "does the string `"This is a test."` contain the sequence `test`"?

In [None]:
pattern = "test"
print(re.search(pattern, "This is a test2."))


`search` returns a `Match` object that gives us some information about the match that was found - where it occurred in the string, and what the matched text was.

If no match is found, `search` returns `None`.

## Special sequences

If we stuck only to normal characters, then regexes would be no better than saying `"test" in "This is a test."`

There are a number of special characters and sequences that offer much more power. We'll look at a few today.

## `[]`

Brackets denote a "character class" - a set of characters that can be matched:
* `[abc]` matches `"a"`, `"b"`, or `"c"`

`-` can be used to specify a range of characters:
* `[0-9]` matches any digit between `0` and `9`
* `[a-p]` matches any letter between `a` and `p`

You can specify multiple ranges together:
* `[a-zA-Z]` matches any upper or lower case letter

In [None]:
pattern = 'I got an [ABCDF]'
print(re.search(pattern, "I got an A"))
print(re.search(pattern, "I got an F :("))
print(re.search(pattern, "I got an d"))
print(re.search(pattern, "I got a B"))

print(re.match(pattern, ":( I got an F"))

`[ABCDF]` in the pattern matches any of the letters: `A`, `B`, `C`, `D`, `F`.

## `^`

Using `^` at the beginning of a character class means "match anything **except** these characters".

In [None]:
pattern = 'No vowels allowed: [^aeiou]'

print(re.search(pattern, "No vowels allowed: p"))
print(re.search(pattern, "No vowels allowed: a"))

pattern2 = '[^aeiou]*'
print(re.findall(pattern2, "No vowels allowed: p"))
print(re.search(pattern2, "fjlwfjksnb"))

This pattern matches any substring that starts with `"No vowels allowed: "`, followed by a letter that isn't `a`, `e`, `i`, `o`, or `u`.

## Predefined character classes

There are some predefined character classes you can use:

| Sequence | Matches | Equivalent character class |
| --- | --- | --- | 
| `\d` | any digit | `[0-9]` |
| `\D` | any non-digit | `[^0-9]` |
| `\s` | any whitespace character | `[ \t\n\r\f\v]` |
| `\S` | any non-whitespace character | `[^ \t\n\r\f\v]` |
| `\w` | any alphanumeric character | `[a-zA-Z0-9_]` |
| `\W` | any non-alphanumeric character | `[^a-zA-Z0-9_]` |

In [None]:
pattern = "Name:\s\w\w\w\sPhone:\s\d\d\d-\d\d\d\d"
print(re.search(pattern, "Name: Joe Phone: 555-1234"))
print(re.search(pattern, "Name:\tJoe\nPhone: 555-1234"))
print(re.search(pattern, "Name: Bill Phone: 555-1234"))

## Repetition

The next set of special regex characters we'll look at allow us to introduce the idea of repetition. 

## `*` - zero or more times

`*` means "match the previous character zero or more times"

In [None]:
pattern = "Name:\s*\w\w\w"
print(re.search(pattern, "Name:Joe"))
print(re.search(pattern, "Name: Joe"))
print(re.search(pattern, "Name:  Joe"))
print(re.search(pattern, "Name:             Joe"))

## `?` - zero or one times

`?` means "match the previous character zero or one times"

In [None]:
pattern = "Name:\s?\w\w\w"
print(re.search(pattern, "Name:Joe"))
print(re.search(pattern, "Name: Joe"))
print(re.search(pattern, "Name:  Joe"))
print(re.search(pattern, "Name:             Joe"))

## `+` - one or more times

`+` means "match the previous character one or more times"

In [None]:
pattern = "Name:\s+\w\w\w"
print(re.search(pattern, "Name:Joe"))
print(re.search(pattern, "Name: Joe"))
print(re.search(pattern, "Name:  Joe"))
print(re.search(pattern, "Name:             Joe"))

## `{m,n}` - a specific range

`{m,n}` means "match the previous character at least `m`, but not more than `n`, times"

In [None]:
pattern = "Name:\s{1,3}\w\w\w"
print(re.search(pattern, "Name:Joe"))
print(re.search(pattern, "Name: Joe"))
print(re.search(pattern, "Name:  Joe"))
print(re.search(pattern, "Name:   Joe"))
print(re.search(pattern, "Name:    Joe"))
print(re.search(pattern, "Name:             Joe"))

You can also use a variant: `{m}`, which matches exactly `m` times:

In [None]:
pattern = "Name:\s{2}\w\w\w"
print(re.search(pattern, "Name:Joe"))
print(re.search(pattern, "Name: Joe"))
print(re.search(pattern, "Name:  Joe"))
print(re.search(pattern, "Name:   Joe"))
print(re.search(pattern, "Name:             Joe"))

## `()` - Grouping

Parentheses - `()` - can be used to group multiple characters together.

In [None]:
#pattern = "\w*(,\w+)*"
word = "\w+"
comma_and_word = f",{word}"
pattern = f"{word}({comma_and_word})*"

print(re.search(pattern, "Any,length,list,matches"))
print(re.search(pattern, "Any,length,list,matches,"))
print(re.search(pattern, "Any"))
print(re.search(pattern, "Can't,use,non,alphanumeric"))

In [None]:
#pattern = "[^,]*(,[^,]*)*"
word = "[^,]+"
comma_and_word = f",{word}"
pattern = f"{word}({comma_and_word})*"
print(re.search(pattern, "Any,length,list,matches"))
print(re.search(pattern, "Any"))
print(re.search(pattern, "Can't,use,non,alphanumeric"))

## `|` - "or"

The pipe (`|`) character can be used to take two patterns, _A_ and _B_, and match either one.

It has very low precedence, so you often need parentheses:

In [None]:
pattern = "I'm (happy|sad)"
print(re.search(pattern, "I'm happy"))
print(re.search(pattern, "I'm sad"))
print(re.search(pattern, "I'm excited"))

print(re.findall(pattern, "I'm happy and I'm sad"))

## `re` module functions

The `re` module gives us a few useful functions:

| Function | meaning |
| --- | --- |
| match() | Match at the beginning of the string only |
| search() | Match anywhere in the string |
| findall() | Return a list of all matches (as strings) found in the string |
| split() | Split a string wherever the pattern matches |
| sub() | Replace all pattern matches with a different string |

These all take 2 arguments: a pattern to match against, and a string to search in.

## Phone number matching revisited

Our last pattern: `\d{3}-\d{4}`

We can decode this now: any 3 digits, a dash, and any 4 digits.

## Area codes

We can accept area codes, with optional parentheses (e.g. `(617) 555-1234` or `617 555-1234`)

In [None]:
area_code = "\(?\d{3}\)?"
optional_space = "[\s-]?"
phone_number = "\d{3}-\d{4}"
pattern = f"{area_code}{optional_space}{phone_number}"
#pattern = "\(?\d{3}\)?\s?\d{3}-\d{4}"

print(re.search(pattern, "(617) 555-1234"))
print(re.search(pattern, "617 555-1234"))
print(re.search(pattern, "617-555-1234"))

Handle dash, space, or nothing for each separator:

In [None]:
area_code = "\(?\d{3}\)?"
optional_space = "[\s-]?"
phone_number = "\d{3}" + optional_space + "\d{4}"
pattern = f"{area_code}{optional_space}{phone_number}"
pattern = "\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}"
print(re.search(pattern, "(617) 555-1234"))
print(re.search(pattern, "617 555-1234"))
print(re.search(pattern, "617-555-1234"))
print(re.search(pattern, "617555-1234"))
print(re.search(pattern, "617-5551234"))
print(re.search(pattern, "6175551234"))
print(re.search(pattern, "(617)-5551234"))

In [None]:
pattern = "\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}"
print(re.search(pattern, "555-1234"))

Make the whole area code optional:

In [None]:
area_code = "\(?\d{3}\)?"
optional_area_code = f"({area_code})?"
optional_space_or_dash = "[\s-]?"
phone_number = "\d{3}" + optional_space_or_dash + "\d{4}"
pattern = f"{optional_area_code}{optional_space_or_dash}{phone_number}"
#pattern = "(\(?\d{3}\)?)?[-\s]?\d{3}[-\s]?\d{4}"
print(re.search(pattern, "555-1234"))
print(re.search(pattern, "(617) 555-1234"))
print(re.search(pattern, "617 555-1234"))
print(re.search(pattern, "617-555-1234"))
print(re.search(pattern, "617555-1234"))
print(re.search(pattern, "617-5551234"))
print(re.search(pattern, "6175551234"))
print(re.search(pattern, "(617)-5551234"))
print(re.search(pattern, "555-1234"))

## Exercise

Write a regex that matches dates of the form: `MM/DD/YYYY`

A naive version:

In [None]:
pattern = '\d\d/\d\d/\d\d\d\d'
print(re.match(pattern, "02/18/1974"))
print(re.match(pattern, "12/18/2021"))

Let's improve it - first, allow `MM/DD/YY` as well:

In [None]:
pattern = '\d\d/\d\d/\d\d(\d\d)?'
print(re.match(pattern, "02/18/1974"))
print(re.match(pattern, "12/18/2021"))
print(re.match(pattern, "12/18/21"))
print(re.match(pattern, "12/18/213"))

Now allow single digit day/months:

In [None]:
month = '\d?\d'
day = '\d?\d'
year = '\d\d(\d\d)?'
pattern = f'{month}/{day}/{year}'
print(re.match(pattern, "02/18/1974"))
print(re.match(pattern, "12/18/2021"))
print(re.match(pattern, "12/18/21"))
print(re.match(pattern, "2/18/1974"))
print(re.match(pattern, "12/8/1974"))

Now disallow obviously invalid months/days:

In [None]:
month = '(0?[1-9]|1?[0-2])'
day = '[0-3]?\d'
year = '\d\d(\d\d)?'
pattern = f'{month}/{day}/{year}'
#pattern = '[01]?\d/[0-3]?\d/\d\d(\d\d)?'
print(re.match(pattern, "02/18/1974"))
print(re.match(pattern, "12/18/2021"))
print(re.match(pattern, "12/18/21"))
print(re.match(pattern, "2/18/1974"))
print(re.match(pattern, "12/8/1974"))
print(re.match(pattern, "22/8/1974"))
print(re.match(pattern, "02/48/1974"))
print(re.match(pattern, "19/8/1974"))

## Exercise

Find valid hashtags and user mentions in a social media post.

Hashtags start with `#`, and can contain letters, numbers, and underscores, but cannot start with an underscore. They must be at least 3 characters long.

Usernames start with `@`, and can contain letters and numbers only. They must be between 2 and 24 characters long.

In [None]:
hash_pattern = '#\w[\w_]{2,}'
user_pattern = '@\w{2,24}'
message = "Excited for #PythonRegex lecture! @User123, don't miss it. #RegexFun"
print(re.findall(hash_pattern, message))
print(re.findall(user_pattern, message))

## Exercise

Extract quoted sections from text.

In [None]:
quote_pattern = '"[^"]*"'
re.findall(quote_pattern, 'Find some "quotes" in "this text"')

Regular expressions have a number of other features we haven't talked about. 

They can be extremely powerful, and can also get extremely complex.

Here's a regular expression to validate email addresses (which have a [more complicated specification than you might think](https://datatracker.ietf.org/doc/html/rfc5322#section-3.4.1)):

```
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
```

The [Python tutorial on regular expressions](https://docs.python.org/3/howto/regex.html) is relatively accessible. It covers everything from this lecture, and more.