# Regular Expressions

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (also called "regexp" or "regex") are patterns that let you find matching text.
Think of them like mathematical expressions for text (an equation can define a line (collection of points), and a regex can define a collection of strings).
When starting off, regular expressions can be pretty confusing.
But once you get comfortable using them, you start to see how they can be used in almost all of your everyday coding.

<center><img src="xkcd-regular-expressions.png"/></center>
<center style='font-size: small'>Comic courtesy of <a href='https://xkcd.com/208'>xkcd</a></center>

Aside from this assignment, here are some resources on regular expressions that you may find helpful:
 - [Text Tutorial](https://www.sitepoint.com/learn-regex/)
 - [Video Tutorial](https://www.youtube.com/watch?v=sa-TUpSx1JA)
 - [Cheat Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)
 - [Regex Playground](https://regex101.com/) (Interactively create, test, and visualize regular expressions.)
 - [Python Regex Tutorial](https://docs.python.org/3/howto/regex.html)

Take your time and break down the regex into one piece at a time.

## Regular Expressions in Python

We will be using Python for this exercise (hence the iPython notebook),
so we will be using the `re` Python standard library.
Almost every major programming language has regular expressions either built-in or supported in a standard library.
There may be slight variations in the syntax and semantics from language to language,
but the core functionality will all be the same.

In code and documentation, we will often refer to a regex as a "pattern".

### "Normal" Characters

The most simple regular expressions can be used to match strings in the same way that you would use another string to match a string
(like if you were using [`str.find()`](https://docs.python.org/3/library/stdtypes.html#str.find) or [`str.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace).
Just type the characters that you want to match.

foo

```
```


### Special Characters

```
. ^ $ * + ? { } [ ] \ | ( )
```

When writing regular expressions in Python, you will probably want to use a ["raw string"](https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences).
Raw strings do not interpret escape characters, so you don't have to double escape backslashes or accidentally create escape characters.

In [None]:
import re

string = "What is 'foo bar'?"
print(re.search('foo', string))

To test your understanding of the concepts throughout this assignment,
we will use a game called "Regex Golf".
In Regex Golf, you have will have two sets of strings.
You want to match all the strings in the first list, while not matching all the string in the second list.

In [None]:
def regex_golf(regex, match_list = [], nomatch_list = []):
    errors = []

    if ((regex is None) or (regex == '')):
        print("Error: No regex provided.")
        return False
    
    for match_value in match_list:
        match = re.search(regex, match_value)
        if (match is None):
            errors.append("Error: Failed to match '%s'." % (match_value))
    
    for nomatch_value in nomatch_list:
        match = re.search(regex, nomatch_value)
        if (match is not None):
            errors.append("Error: Incorrectly matched '%s'." % (nomatch_value))

    if (len(errors) == 0):
        print("Great job!")
        return True
    else:
        print("You have some golfing errors, try again.")
        for error in errors:
            print("    " + error)
        return False

In [None]:
matches = [
    'foo',
    'foorbar',
    'football'
]

nomatches = [
    'fourty',
    'FOO',
    'bar',
    '123',
]

regex = r'foo'
regex_golf(regex, matches, nomatches)

<h3 style="color: darkorange";>★ Task 1: My First Match</h3>

Write a regular expression (assigned in the `TASK1_REGEX` variable) that matches the sequence "cat" (all lowercase).
Note that you don't have to match the entire string, just a part of it.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 1 regular expression here.
TASK1_REGEX = r''

cats = ['cat', 'cats', 'some cat', 'categories']
non_cats = ['dog', 'cta']
regex_golf(TASK1_REGEX, match_list = cats, nomatch_list = non_cats)

In [None]:
string = 'A literal backslash: "\\"'
raw_string = r'A literal backslash: "\"'

print("string:     ", string)
print("raw string: ", raw_string)
print(string == raw_string)

### Warning: re.search() vs re.match()

Be careful about the subtle (and important) differences between the main matching functions in the `re` library.

`re.search()` looks for a match within a string (and it can be the entire string).
`re.match()` tries to match **the entire** string.

Eriq's tip:
In almost all cases, just use `re.search()` instead of `re.match()`.
If you want to match the beginning or end of strings, use anchors.

In [None]:
# TODO: Example search() vs match()

## Character Classes

Character classes allow us to refer to any character inside of a set of characters.
Most regex language/engines will have built-in character classes,
and also the ability to define custom character classes.

### Digits

The built-in digit character class is `\d`, and will match any digit (0-9).
The inverse class (not a digit) is also available using `\D`.
`\d` and `\D` do not overlap and together match everything,
this will be true for all the character classes we will cover.

In [None]:
digits = ['0', '1', '2', '9']
non_digits = ['a', 'Z', '-', '!', ' ']

# Try out the digit character class.
regex = r'\d'
regex_golf(regex, match_list = digits, nomatch_list = non_digits)

# Now switch up the lists, and use the "non-digit" character class.
regex = r'\D'
regex_golf(regex, match_list = non_digits, nomatch_list = digits)

### "Word" Characters

"Word" characters are `a-z`, `A-Z`, `0-9`, and `_` (underscore),
and are all included in the "word" character class: `\w`.
So this includes all ASCII letters, digits, and underscore.
Like the digit character class, you can get the inverse class (not a word) using `\W`.

In [None]:
words = ['a', 'Z', '1', '0', '_']
non_words = ['-', '!', ' ']

# Try out the word character class.
regex = r'\w'
regex_golf(regex, match_list = words, nomatch_list = non_words)

# Now switch up the lists, and use the "non-word" character class.
regex = r'\W'
regex_golf(regex, match_list = non_words, nomatch_list = words)

<h3 style="color: darkorange";>★ Task 2: License Plates</h3>

Write a regular expression (assigned in the `TASK2_REGEX` variable) that matches standard (non-custom) California license plates within some string:
a number, three word characters, and three numbers (seven characters in total).

You may assume that:
 - All digits/letters are used in license plates, **including** underscores '_' and upper/lower case letters.
 - Numbers and underscores count as word characters (even though the DMV does not agree).

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 2 regular expression here.
TASK2_REGEX = r''

plates = ['1ABC123', '0xyz987']
non_plates = ['1234567', 'abcdefg']
regex_golf(TASK2_REGEX, match_list = plates, nomatch_list = non_plates)

### Whitespace

There is also a character class to match whitespace: `\s`.
Whitespace in this context includes characters like spaces, tabs, newlines, carriage returns, etc.
The inverse class (not whitespace) is available as `\S`.

In [None]:
# You may not be familiar with all of these whitespace character
# (since we don't typically use half of them).
# These are: [space, tab, newline, carriage return, line feed, vertical tab].
whitespace = [' ', '\t', '\n', '\r', '\f', '\v']
non_whitespace = ['a', 'Z', '1', '0', '_', '-', '!']

# Try out the whitespace character class.
regex = r'\s'
regex_golf(regex, match_list = whitespace, nomatch_list = non_whitespace)

# Now switch up the lists, and use the "non-whitespace" character class.
regex = r'\S'
regex_golf(regex, match_list = non_whitespace, nomatch_list = whitespace)

### Any Character

You can represent (almost) any character using the `.` (dot) character class.
This will match anything except newlines (you have to enable a [special option](https://docs.python.org/3/library/re.html#re.DOTALL) for that behavior).
For this assignment, we will assume that all matches are always on one line.
To make a literal period, you would need to escape it `\.`.

In [None]:
anything = ['1', '0', 'a', 'Z', '_', ' ', '\t', '-', '!', '.']
non_anything = ['\n']

# Try out the anything character class.
regex = r'.'
regex_golf(regex, match_list = anything, nomatch_list = non_anything)

<h3 style="color: darkorange";>★ Task 3: Mysterious Code</h3>

Imagine that you are writing a Python program that uses specific "codes".
These codes are four characters, start with any character, and then end with three digits.

You need to write a regex to find all the places in your program that you defined these codes.
Thankfully, you started every code variable with the string 'code_', and followed that with a single digit, letter, or underscore.

Write a regular expression (assigned in the `TASK3_REGEX` variable) that matches the definition of a code variable.

You may assume:
 - All strings you are trying to match are on one line (they will not have a newline in them), this assumption will apply for this entire assignment.
 - Code strings will always use double quotes `"<code>"`.
 - A single space character will always be on either side of the assignment operator (equals sign).

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 3 regular expression here.
TASK3_REGEX = r''

code_assignments = [
    'code_a = "a123"',
    'code__ = "!098"',
    'code_b = "1098"',
]

non_code_assignments = [
    'a = "a123"',
    'code__ = "098"',
    'code_ = "1098"',
]

regex_golf(TASK3_REGEX, match_list = code_assignments, nomatch_list = non_code_assignments)

### Custom Character Classes

You can also create your own custom character class using square brackets: `[]`.
Any characters inside the square brackets are now inside the character class.
So `[abc]` will match any character that is an 'a', 'b', or 'c'.

You can invert a custom character class by having a carrot/hat character directly after the opening square bracket.
So `[^abc]` will match any character that is **not** an 'a', 'b', or 'c'.
To match a literal carrot/hat, you can escape it: `[abc\^]`.

You can also use a dash `-` to represent a range of characters.
You can make a range between lowercase characters `[a-z]`, uppercase characters `[A-Z]`, and digits `[0-9]`.
Note that you cannot range between lowercase and uppercase characters.
To match a literal dash, you can escape it.
For example, `[a-z]` matches 'a' *through* 'z', but `[a\-z]` matches 'a', 'z', or '-'.

We can recreate some of our built-in character classes using the custom character class:
 - `\d` == `[0-9]`
 - `\D` == `[^0-9]`
 - `\w` == `[a-zA-Z0-9_]`
 - `\W` == `[^a-zA-Z0-9_]`
 - `\s` == `[ \t\n\r\f\v]`
 - `\S` == `[^ \t\n\r\f\v]`

In [None]:
abc = ['a', 'b', 'c']
non_abc = ['A', '1', ' ', '-', '!']

# Try out a custom charatcer class.
regex = r'[abc]'
regex_golf(regex, match_list = abc, nomatch_list = non_abc)

# Now switch up the lists, and invert out custom character class.
regex = r'[^abc]'
regex_golf(regex, match_list = non_abc, nomatch_list = abc)

# We can also match some of the character classes we have seen in the past.

regex = r'[0-9]'
regex_golf(regex, match_list = digits, nomatch_list = non_digits)

regex = r'[a-zA-Z_0-9]'
regex_golf(regex, match_list = words, nomatch_list = non_words)

regex = r'[ \t\n\r\f\v]'
regex_golf(regex, match_list = whitespace, nomatch_list = non_whitespace)

<h3 style="color: darkorange";>★ Task 4: Mysterious Code - Better</h3>

Let's improve upon Task 3 to make it more realistic.

 - Instead of the character after "code_" being a digit, letter, or underscore, force this character to be a lowercase letter.
 - Allow either a single tab or space to be used on either side side of the assignment operator (equals sign).
 - Force the first letter of the code to be a letter (lowercase or uppercase) or a digit.

Write a regular expression (assigned in the `TASK4_REGEX` variable) that matches the definition of a code variable as modified above.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 4 regular expression here.
TASK4_REGEX = r''

code_assignments = [
    'code_a = "a123"',
    'code_b = "1098"',
    'code_c\t=\t"z395"',
]

non_code_assignments = [
    'a = "a123"',
    'code__ = "098"',
    'code_ = "1098"',
    'code_33 = "Z456"',
    'code__ = "!098"',
    'code_3 = "Z456"',
]

regex_golf(TASK4_REGEX, match_list = code_assignments, nomatch_list = non_code_assignments)

## Repetitions

Another core feature of regular expressions is the ability to handle repetition.

### Once or None

You can use a `?` (question mark) to declare that a character (or class) should appear once or not at all.
For example, `too?` will match both "to" and "too".
You can also apply repetition to character classes:
`to[onp]?` will match "to", "too", "ton" and "top" (but not "toon").

### None or Many

You can use a '*' (asterisk/star) to declare that a character can appear an number of times or not at all.
This is also called a ["Kleene Star"](https://en.wikipedia.org/wiki/Kleene_star).

TODO: EXAMPLE

Once or None
\[0, 1]

None or many
\[0, ∞]

One or many
\[1, ∞]

Specific Number
{n}
{n, m}
{,m}
{n,}

## Anchors

^

$

## Capture Groups

()

Back Reference

<h3 style="color: darkorange";>★ Task 1.A</h3>

Edit the following function to return `True`.

In [None]:
def my_function():
    """
    The output of this function will be tested (it must return True).
    """

    return NotImplemented

my_function()

After editing `my_function()`, run the above code cell (CTRL+Enter).
Running the cell both defines the function (the `def` part) and runs it (the last line in the cell).
Note that by default the last value declared in a cell is printed as output.

The above cell should now return `True` instead of `NotImplemented` or raising an exception.
Most (but not all) functions you will be asked to implement in the future will provide an implementation that runs,
but does not produce the correct output.
You will always be expected to edit the functions to return the correct results.