# Regular Expressions

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (also called "regexp" or "regex") are patterns that let you find matching text.
Think of them like mathematical expressions for text (an equation can define a line (collection of points), and a regex can define a collection of strings).
When starting off, regular expressions can be pretty confusing.
But once you get comfortable using them, you start to see how they can be used in almost all of your everyday coding.

<center><img src="xkcd-regular-expressions.png"/></center>
<center style='font-size: small'>Comic courtesy of <a href='https://xkcd.com/208'>xkcd</a></center>

Aside from this assignment, here are some resources on regular expressions that you may find helpful:
 - [Text Tutorial](https://www.sitepoint.com/learn-regex/)
 - [Video Tutorial](https://www.youtube.com/watch?v=sa-TUpSx1JA)
 - [Cheat Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)
 - [Regex Playground](https://regex101.com/) (Interactively create, test, and visualize regular expressions.)
 - [Python Regex Tutorial](https://docs.python.org/3/howto/regex.html)

Take your time and break down the regex into one piece at a time.

## Regular Expressions in Python

We will be using Python for this exercise (hence the iPython notebook),
so we will be using the `re` Python standard library.
Almost every major programming language has regular expressions either built-in or supported in a standard library.
There may be slight variations in the syntax and semantics from language to language,
but the core functionality will all be the same.

### "Normal" Characters

The most simple regular expressions can be used to match strings in the same way that you would use another string to match a string
(like if you were using [`str.find()`](https://docs.python.org/3/library/stdtypes.html#str.find) or [`str.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace).
Just type the characters that you want to match.

foo

```
```


### Special Characters

```
. ^ $ * + ? { } [ ] \ | ( )
```

When writing regular expressions in Python, you will probably want to use a ["raw string"](https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences).
Raw strings do not interpret excape characters, so you don't have to double escape backslashes or accidentally create escape characters.

To test your understanding of the concepts throughout this assignment,
we will use a game called "Regex Golf".
In Regex Golf, you have will have two sets of strings.
You want to match all the strings in the first list, while not matching all the string in the second list.

In [None]:
def regex_golf(regex, match_list, nomatch_list):
    success = True
    
    for match_value in match_list:
        match = re.search(regex, match_value)
        if (match is None):
            print("Error: Failed to match '%s'." % (match_value))
            success = False
    
    for nomatch_value in nomatch_list:
        match = re.search(regex, nomatch_value)
        if (match is not None):
            print("Error: Incorrectly matched '%s'." % (nomatch_value))
            success = False

    return success

In [None]:
matches = [
    'foo',
    'foorbar',
    'football'
]

nomatches = [
    'fourty',
    'FOO',
    'bar',
    '123',
]

task1_regex = r''
regex_golf(task1_regex, matches, nomatches)

In [None]:
string = 'A literal backslash: "\\"'
raw_string = r'A literal backslash: "\"'

print("string:     ", string)
print("raw string: ", raw_string)
print(string == raw_string)

In [None]:
import re

string = "What is 'foo bar'?"
print(re.search('foo', string))

### Warning: re.search() vs re.match()

Be careful about the subtle (and important) differences between the main matching functions in the `re` library.

`re.search()` looks for a match within a string (and it can be the entire string).
`re.match()` tries to match **the entire** string.

Eriq's tip:
In almost all cases, just use `re.search()` instead of `re.match()`.
If you want to match the beginning or end of strings, use anchors.

In [None]:
# TODO: Example search() vs match()

## Character Classes

Character classes allow us to refer to any character inside of a set of characters.
Most regex language/engines will have built-in character classes,
and also the ability to define custom character classes.

### Digits

The built-in digit character class is `\d`, and will match any digit (0-9).
The inverse class (not a digit) is also available using `\D`.
`\d` and `\D` do not overlap and together match everything,
this will be true for all the character classes we will cover.

### "Word" Characters

"Word" characters are `a-z`, `A-Z` and `_` (underscore),
and are all included in the "word" character class: `\w`.
Like the digit character class, you can get the inverse class (not a word) using `\W`.

### Whitespace

There is also a character class to match whitespace: `\s`.
Whitespace in this context includes characters like spaces, tabs, newlines, carrige returns, etc.
The inserve class (not whitespace) is availble as `\S`.

### Any Character

You can represent (almost) any character using the `.` (dot) character classs.
This will match anything except newlines (you have to enable a special option for that behavior).
To make a literal period, you would need to escape it `\.`.

### Custom Character Classes

You can also create your own custom character class using square brackets: `[]`.
Any characters inside the square brackets are now inside the character class.
So `[abc]` will match any character that is an 'a', 'b', or 'c'.

You can invert a custom character class by having a carrot/hat character directly after the opening square bracket.
So `[^abc]` will match any character that is **not** an 'a', 'b', or 'c'.
To match a literal carrot/hat, you can escape it: `[abc\^]`.

You can also use a dash `-` to represent a range of characters.
You can make a range between lowercase characters `[a-z]`, uppercase characters `[A-Z]`, and digits `[0-9]`.
Note that you cannot range between lowercase and uppercasse characters.
To match a literal dash, you can escape it.
For example, `[a-z]` matches 'a' *through* 'z', but `[a\-z]` matches 'a', 'z', or '-'.

We can recreate some of our built-in character classes using the custom character class:
 - `\d` == `[0-9]`
 - `\D` == `[^0-9]`
 - `\w` == `[a-zA-Z]`
 - `\W` == `[^a-zA-Z]`
 - `\s` == `[ \t\n\r\f\v]`
 - `\S` == `[^ \t\n\r\f\v]`

## Repetitions

Another core feature of regular expressions is the ability to handle repetition.

### Once or None

You can use a `?` (question mark) to declare that a character (or class) should appear once or not at all.
For example, `too?` will match both "to" and "too".
You can also apply repetition to character classes:
`to[onp]?` will match "to", "too", "ton" and "top" (but not "toon").

### None or Many

You can use a '*' (asterisk/star) to declare that a character can appear an number of times or not at all.
This is also called a ["Kleene Star"](https://en.wikipedia.org/wiki/Kleene_star).

TODO: EXAMPLE

Once or None
\[0, 1]

None or many
\[0, ∞]

One or many
\[1, ∞]

Specific Number
{n}
{n, m}
{,m}
{n,}

## Anchors

^

$

## Capture Groups

()

Back Reference

<h3 style="color: darkorange";>★ Task 1.A</h3>

Edit the following function to return `True`.

In [None]:
def my_function():
    """
    The output of this function will be tested (it must return True).
    """

    return NotImplemented

my_function()

After editing `my_function()`, run the above code cell (CTRL+Enter).
Running the cell both defines the function (the `def` part) and runs it (the last line in the cell).
Note that by default the last value declared in a cell is printed as output.

The above cell should now return `True` instead of `NotImplemented` or raising an exception.
Most (but not all) functions you will be asked to implement in the future will provide an implementation that runs,
but does not produce the correct output.
You will always be expected to edit the functions to return the correct results.