# Regular Expressions

[Regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (also called "regexp" or "regex") are patterns that let you find matching text.
Think of them like mathematical expressions for text (an equation can define a line (collection of points), and a regex can define a collection of strings).
When starting off, regular expressions can be pretty confusing.
But once you get comfortable using them, you start to see how they can be used in almost all of your everyday coding.

<center><img src="xkcd-regular-expressions.png"/ width=600px></center>
<center style='font-size: small'>Comic courtesy of <a href='https://xkcd.com/208'>xkcd</a></center>

Aside from this assignment, here are some resources on regular expressions that you may find helpful:
 - [Text Tutorial](https://www.sitepoint.com/learn-regex/)
 - [Video Tutorial](https://www.youtube.com/watch?v=sa-TUpSx1JA)
 - [Cheat Sheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)
 - [Regex Playground](https://regex101.com/) (Interactively create, test, and visualize regular expressions.)
 - [Python Regex Tutorial](https://docs.python.org/3/howto/regex.html)

When working with regular expressions, make sure to take your time.
Treat a regular expression as a dense piece of code.
You don't expect to look at a code file and understand everything right away,
you take things piece by piece.
Do the same thing with regular expressions,
take it slow and look at them piece by piece.

## Regular Expressions in Python

We will be using Python for this exercise (hence the iPython notebook),
so we will be using the `re` Python standard library.
Almost every major programming language has regular expressions either built-in directly or supported in a standard library.
There may be slight variations in the syntax and semantics from language to language,
but the core functionality will all be the same.

### re.search()

In this assignment (and probably most of your Python regex usage),
we will be using the method `re.search()`.
`re.search()` takes two required arguments,
first the regex and then the string to search in (we will often call this the "target").

There is another method `re.match()` that is subtly different,
and you will probably want to avoid.
`re.search()` looks for a match within a string (and it can be the entire string).
`re.match()` tries to match **the beginning** of a string (and that can mean the entire string).
Later in this assignment, we will discuss how to recreate the functionality of `re.match()`
in a more explicit and less error-pone way.

### "Normal" Characters

The most simple regular expressions can be used to match strings in the same way that you would use another string to match a string
(like if you were using [`str.find()`](https://docs.python.org/3/library/stdtypes.html#str.find) or [`str.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace)).
Just type the characters that you want to match.
In fact, in Python strings are used to represent regular expressions.
So most strings are already regular expressions
(but arbitrary strings may contain special symbols which are not valid regular expressions).

For example, the string `"foo"` can be used as a regular expression to match "foo", "food", "foo bar",
and literally infinite other strings that have “foo” as a substring.

In [None]:
import re

# The regex 'foo' matches the word 'foo' in the target.
target = "What is 'foo bar'?"
regex = 'foo'
print(re.search(regex, target))

# The regex 'dog' matches 'dogs'.
target = "dogs, cats, lizards"
regex = 'dog'
print(re.search(regex, target))

# Numbers are fine too.
target = "0123456789"
regex = '45'
print(re.search(regex, target))

# The entire target string can be matched.
target = "This string is a regex"
regex = "This string is a regex"
print(re.search(regex, target))

# Here we do not match, and None is returned from re.search().
target = "abc"
regex = "123"
print(re.search(regex, target))

### Special Characters

There are a few special characters that you will need to be aware of.
Here is a list of them, each of which will be discussed somewhere in this assignment:

 - `.`
 - `^`
 - `$`
 - `*`
 - `+`
 - `?`
 - `{`
 - `}`
 - `[`
 - `]`
 - `\`
 - `|`
 - `(`
 - `)`

When you want to match one of these characters literally, you will need to escape it with a backslash (`\`).

### Raw Strings

When writing regular expressions in Python, you will probably want to use a ["raw string"](https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences).
Raw strings do not interpret escape characters, so you don't have to double escape backslashes or accidentally create escape characters.

In [None]:
string = 'A literal backslash: "\\"'
raw_string = r'A literal backslash: "\"'

print("string:     ", string)
print("raw string: ", raw_string)
print(string == raw_string)

To test your understanding of the concepts throughout this assignment,
we will use a game called "Regex Golf".
In Regex Golf, you have will have two sets of strings.
You want to match all the strings in the first list, while not matching all the string in the second list.

In [None]:
def regex_golf(regex, match_list = [], nomatch_list = []):
    errors = []

    if ((regex is None) or (regex == '')):
        print("Error: No regex provided.")
        return False
    
    for match_value in match_list:
        match = re.search(regex, match_value)
        if (match is None):
            errors.append("Error: Failed to match '%s'." % (match_value))
    
    for nomatch_value in nomatch_list:
        match = re.search(regex, nomatch_value)
        if (match is not None):
            errors.append("Error: Incorrectly matched '%s'." % (nomatch_value))

    if (len(errors) == 0):
        print("Great job!")
        return True
    else:
        print("You have some golfing errors, try again.")
        for error in errors:
            print("    " + error)
        return False

In [None]:
matches = [
    'foo',
    'foorbar',
    'football'
]

nomatches = [
    'forty',
    'FOO',
    'bar',
    '123',
]

regex = r'foo'
regex_golf(regex, matches, nomatches)

<h3 style="color: darkorange";>★ Task 1: My First Match</h3>

Write a regular expression (assigned in the `TASK1_REGEX` variable) that matches the sequence "cat" (all lowercase).
Note that you don't have to match the entire string, just a part of it.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 1 regular expression here.
TASK1_REGEX = r''

cats = ['cat', 'cats', 'some cat', 'categories']
non_cats = ['dog', 'cta']
regex_golf(TASK1_REGEX, match_list = cats, nomatch_list = non_cats)

## Character Classes

Character classes allow us to refer to any **single** character inside of a set of characters.
Most regex language/engines will have built-in character classes,
and also the ability to define custom character classes.

### Digits

The built-in digit character class is `\d`, and will match any single digit (0-9).
The inverse class (not a digit) is also available using `\D`.
`\d` and `\D` do not overlap and together match everything,
this will be true for most of the character classes we will cover.

In [None]:
digits = ['0', '1', '2', '9']
non_digits = ['a', 'Z', '-', '!', ' ']

# Try out the digit character class.
regex = r'\d'
regex_golf(regex, match_list = digits, nomatch_list = non_digits)

# Now switch up the lists, and use the "non-digit" character class.
regex = r'\D'
regex_golf(regex, match_list = non_digits, nomatch_list = digits)

### "Word" Characters

"Word" characters are `a-z`, `A-Z`, `0-9`, and `_` (underscore),
and are all included in the "word" character class: `\w`.
So this includes all ASCII letters, digits, and underscore.
Like the digit character class, you can get the inverse class (not a word) using `\W`.

In [None]:
words = ['a', 'Z', '1', '0', '_']
non_words = ['-', '!', ' ']

# Try out the word character class.
regex = r'\w'
regex_golf(regex, match_list = words, nomatch_list = non_words)

# Now switch up the lists, and use the "non-word" character class.
regex = r'\W'
regex_golf(regex, match_list = non_words, nomatch_list = words)

<h3 style="color: darkorange";>★ Task 2: License Plates</h3>

Write a regular expression (assigned in the `TASK2_REGEX` variable) that matches standard (non-custom) California license plates within some string.
A CA license plate has the pattern:
a number, three word characters, and three numbers (seven characters in total).

You may assume that:
 - All digits/letters are used in license plates, **including** underscores '_' and upper/lower case letters.
 - Numbers and underscores count as word characters (even though the DMV does not agree).

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 2 regular expression here.
TASK2_REGEX = r''

plates = ['1ABC123', '0xyz987', '1234567']
non_plates = ['123', 'abcdefg']
regex_golf(TASK2_REGEX, match_list = plates, nomatch_list = non_plates)

### Whitespace

There is also a character class to match whitespace: `\s`.
Whitespace in this context includes characters like spaces, tabs, newlines, carriage returns, etc.
The inverse class (not whitespace) is available as `\S`.

In [None]:
# You may not be familiar with all of these whitespace character
# (since we don't typically use half of them).
# These are: [space, tab, newline, carriage return, line feed, vertical tab].
whitespace = [' ', '\t', '\n', '\r', '\f', '\v']
non_whitespace = ['a', 'Z', '1', '0', '_', '-', '!']

# Try out the whitespace character class.
regex = r'\s'
regex_golf(regex, match_list = whitespace, nomatch_list = non_whitespace)

# Now switch up the lists, and use the "non-whitespace" character class.
regex = r'\S'
regex_golf(regex, match_list = non_whitespace, nomatch_list = whitespace)

### Any Character

You can represent (almost) any character using the `.` (dot) character class.
This will match anything except newlines (you have to enable a [special option](https://docs.python.org/3/library/re.html#re.DOTALL) for that behavior).
For this assignment, we will assume that all matches are always on one line.
To make a literal period, you would need to escape it `\.`.

In [None]:
anything = ['1', '0', 'a', 'Z', '_', ' ', '\t', '-', '!', '.']
non_anything = ['\n']

# Try out the anything character class.
regex = r'.'
regex_golf(regex, match_list = anything, nomatch_list = non_anything)

<h3 style="color: darkorange";>★ Task 3: Mysterious Code</h3>

Imagine that you are writing a Python program that uses specific "codes".
These codes are four characters long, start with any character, and then end with three digits.

You need to write a regex to find all the places in your program that you defined these codes.
Thankfully, you started every code variable with the string 'code_', and followed that with a single digit, letter, or underscore.

Write a regular expression (assigned in the `TASK3_REGEX` variable) that matches the definition of a code variable.

You may assume:
 - All strings you are trying to match are on one line (they will not have a newline in them), this assumption will apply for this entire assignment.
 - Code strings will always use double quotes `"<code>"`.
 - A single space character will always be on either side of the assignment operator (equals sign).

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 3 regular expression here.
TASK3_REGEX = r''

code_assignments = [
    'code_a = "a123"',
    'code__ = "!098"',
    'code_b = "1098"',
]

non_code_assignments = [
    'a = "a123"',
    'code__ = "098"',
    'code_ = "1098"',
]

regex_golf(TASK3_REGEX, match_list = code_assignments, nomatch_list = non_code_assignments)

### Custom Character Classes

You can also create your own custom character class using square brackets: `[]`.
Any characters inside the square brackets are now inside the character class.
So `[abc]` will match any character that is an 'a', 'b', or 'c'.

You can invert a custom character class by having a carrot/hat character directly after the opening square bracket.
So `[^abc]` will match any character that is **not** an 'a', 'b', or 'c'.
To match a literal carrot/hat, you can escape it: `[abc\^]`.

You can also use a dash `-` to represent a range of characters.
You can make a range between lowercase characters `[a-z]`, uppercase characters `[A-Z]`, and digits `[0-9]`.
Note that you cannot range between lowercase and uppercase characters.
To match a literal dash, you can escape it.
For example, `[a-z]` matches 'a' *through* 'z', but `[a\-z]` matches 'a', 'z', or '-'.

We can recreate some of our built-in character classes using the custom character class:
 - `\d` == `[0-9]`
 - `\D` == `[^0-9]`
 - `\w` == `[a-zA-Z0-9_]`
 - `\W` == `[^a-zA-Z0-9_]`
 - `\s` == `[ \t\n\r\f\v]`
 - `\S` == `[^ \t\n\r\f\v]`

In [None]:
abc = ['a', 'b', 'c']
non_abc = ['A', '1', ' ', '-', '!']

# Try out a custom character class.
regex = r'[abc]'
regex_golf(regex, match_list = abc, nomatch_list = non_abc)

# Now switch up the lists, and invert out custom character class.
regex = r'[^abc]'
regex_golf(regex, match_list = non_abc, nomatch_list = abc)

# We can also match some of the character classes we have seen in the past.

regex = r'[0-9]'
regex_golf(regex, match_list = digits, nomatch_list = non_digits)

regex = r'[a-zA-Z_0-9]'
regex_golf(regex, match_list = words, nomatch_list = non_words)

regex = r'[ \t\n\r\f\v]'
regex_golf(regex, match_list = whitespace, nomatch_list = non_whitespace)

<h3 style="color: darkorange";>★ Task 4: Mysterious Code - Better</h3>

Let's improve upon Task 3 to make it more realistic.

 - Instead of the character after "code_" being a digit, letter, or underscore, force this character to be a lowercase letter.
 - Allow either a single tab or space to be used on either side side of the assignment operator (equals sign).
 - Force the first letter of the code to be a letter (lowercase or uppercase) or a digit.

Write a regular expression (assigned in the `TASK4_REGEX` variable) that matches the definition of a code variable as modified above.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 4 regular expression here.
TASK4_REGEX = r''

code_assignments = [
    'code_a = "a123"',
    'code_b = "1098"',
    'code_c\t=\t"z395"',
]

non_code_assignments = [
    'a = "a123"',
    'code__ = "098"',
    'code_ = "1098"',
    'code_33 = "Z456"',
    'code__ = "!098"',
    'code_3 = "Z456"',
]

regex_golf(TASK4_REGEX, match_list = code_assignments, nomatch_list = non_code_assignments)

## Anchors

When using regular expressions sometimes you will not just want to match something inside of a string/line,
but you may want to match the **entire** string/line.
To do this, you can use **anchors**.
Anchors do not match an actual character (they **do not consume** a character in your string),
but instead match the beginning or end of a string/line.

`^` (carrot/hat) is the beginning anchor, and matches right before the first character in a string or right after a newline (the beginning of a line).
Remember `^` does not consume an actual character, but matches right before the first character.

`$` (dollar sign) is the end anchor, and matches right after the last character in a string or right before a newline (the end of a line).
Remember `$` does not consume an actual character, but matches right after the last character.

In many regular expression engines, you can enable ["multiline" matching](https://docs.python.org/3/library/re.html#re.MULTILINE)
which allows you to make matches across newlines.
This option changes the semantics of anchors and depends on the specific engine you are using.
Multiline matching is outside the scope of this assignment.

With the beginning anchor you can recreate the functionality of `re.match()` using `re.search()`,
just always start your regex with a carrot.

In [None]:
# With no anchors, we match a lot of things that are dog-related.
regex = r'dog'
dog = ['dog', 'dogs', 'doggy', 'doge', 'hot dog']
non_dog = ['dg', 'do', 'dawg']
regex_golf(regex, match_list = dog, nomatch_list = non_dog)

# With both anchors, we only match exactly dog.
regex = r'^dog$'
dog = ['dog']
non_dog = ['dg', 'do', 'dawg', 'dogs', 'doggy', 'doge', 'hot dog']
regex_golf(regex, match_list = dog, nomatch_list = non_dog)

# We can also decide to only include one of the anchors for more flexibility.

regex = r'^dog'
dog = ['dog', 'dogs', 'doggy', 'doge']
non_dog = ['dg', 'do', 'dawg', 'hot dog']
regex_golf(regex, match_list = dog, nomatch_list = non_dog)

regex = r'dog$'
dog = ['dog', 'hot dog']
non_dog = ['dg', 'do', 'dawg', 'dogs', 'doggy', 'doge']
regex_golf(regex, match_list = dog, nomatch_list = non_dog)

### Word Boundaries

Sometimes, you will want to match the beginning or end of word, instead of an entire string.
To do this, you can use "word boundaries".
A word boundary is a special character that matches the beginning or end of a "word".
Technically, it matches the empty space between a `\w` and `\W` (or vice versa).

Think of word boundaries like anchors for words.
And like anchors, word boundaries do no consume any actual characters in your string.
In Python, a word boundary is represented by a `\b`.

For example, `\bdog\b` matches "dog", "(dog)", and "dog, cat, lemur"
but does not match "doggy" or "hotdog".

In [None]:
regex = r'\bdog\b'
dog = ['dog', '(dog)', 'dog, cat, lemur', 'hot dog']
non_dog = ['dg', 'do', 'dawg', 'dogs', 'doggy', 'doge', 'hotdog']
regex_golf(regex, match_list = dog, nomatch_list = non_dog)

<h3 style="color: darkorange";>★ Task 5: Finding Bad Data</h3>

Imagine that you are working with some chemists and they give you a big dump of data from some fancy chemical machines.
But, some of the machines are broken and sometimes give out bad numbers that are floating point hexadecimal numbers.
The chemists have told you that the bad numbers have these attributes:
 - They are hexadecimal and always start with a `0x`.
 - They are always floating point with two places after the point.
 - They are always between `0x10.00` and `0xff.ff` (inclusive).
 - They appear on a line all by themselves.
 - The data uses only lowercase letters for hexadecimal.

[Hexadecimal numbers](https://en.wikipedia.org/wiki/Hexadecimal) are base 16 numbers and are represented with the numbers 0 - 9 (like normal numbers) and a - f.
In code, they are typically prefixed with `0x` to differentiate them from decimal numbers.
So `0x5 == 5`, `0xa == 10`, `0xf == 15`, and `0x10 == 16`.

Your task is to write a regular expression (assigned in the `TASK5_REGEX` variable) that finds these bad data points.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 5 regular expression here.
TASK5_REGEX = r''

bad_data = ['0x12.34', '0xfe.dc']
non_bad_data = ['12.34', 'fedc', 'other 0x12.34 junk']
regex_golf(TASK5_REGEX, match_list = bad_data, nomatch_list = non_bad_data)

## Repetitions

Another core feature of regular expressions is the ability to handle repetition.
There are several different ways to handle repetition in regular expressions
(and then a generic way that can cover all cases).
We call symbols that signal repetition operations "quantifiers".
In this section we will be dealing with repeating character (or character classes),
but quantifiers can be applied to groups of characters (which we will discuss later).

### None or One

The simplest form of repetition is declaring that a character can appear once or not at all,
i.e. an optional character.
To do this, simple follow a character with a `?` (question mark).
For example, `too?` will match both "to" and "too".
You can apply repetition to character classes in the same way:
`to[onp]?` will match "to", "too", "ton" and "top", but not "toon".

In [None]:
# We can attach a quantifier to a character.
regex = r'^too?$'
match = ['to', 'too']
non_match = ['t', 'tooo', 'ta', 'tooooooooooooooooooooooooooooooooooooo']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# We can also attach a quantifier to a character class.
regex = r'^\d\d?$'
match = ['0', '9', '00', '99']
non_match = ['', '-1', '100']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# This includes custom character class.
# This one matches a hexadecimal nibble (half a byte) or byte.
regex = r'^[0-9a-f][0-9a-f]?$'
match = ['0', 'f', '00', '5a', 'ff']
non_match = ['', 'z', 'zz', '000', 'ffff']
regex_golf(regex, match_list = match, nomatch_list = non_match)

### None or Many

You can use a `*` (asterisk/star) to declare that a character can appear any number of times or not at all.
This is also called a ["Kleene Star"](https://en.wikipedia.org/wiki/Kleene_star).

In [None]:
regex = r'^too*$'
match = ['to', 'too', 'tooo', 'tooooooooooooooooooooooooooooooooooooo']
non_match = ['t', 'ta']
regex_golf(regex, match_list = match, nomatch_list = non_match)

### One or Many

To match a character at least once and at most unlimited times,
you can use a `+` (plus).

In [None]:
regex = r'^too+$'
match = ['too', 'tooo', 'tooooooooooooooooooooooooooooooooooooo']
non_match = ['to', 't', 'ta']
regex_golf(regex, match_list = match, nomatch_list = non_match)

### General Repetition

Curly braces (`{}`) can be used to for generalized repetition,
and they can cover all the cases we previously discussed and more.
The basic syntax is `{m,n}`,
where `m` is the *minimum* number or repetitions and `n` is the *maximum* number of repetitions.
`m` can be omitted if you want zero minimum repetition,
and `n` can be omitted if you want infinite maximum repetitions.
Some regex engines like Python allow you to just do `{n}` 
when you want exactly `n` matches (so when `m == n`).

Therefore, you can use `to{1,2}` to match "to" and "too".

With this we can recreate all our other quantifiers:
 - `?` == `{0,1}`
 - `*` == `{0,}`
 - `+` == `{1,}`

In [None]:
regex = r'^to{1,2}$'
match = ['to', 'too']
non_match = ['t', 'ta', 'tooo', 'tooooooooooooooooooooooooooooooooooooo']
regex_golf(regex, match_list = match, nomatch_list = non_match)

<h3 style="color: darkorange";>★ Task 6: Finding Bad Data - Better</h3>

Let's improve our regex from Task 5 and make it more general.

For this task, we will make the following modifications from Task 5:
 - Instead of assuming that the bad numbers are all floating point,
     assume that they can be ints or floats (so there may be no point).
 - Instead of assuming that the bad numbers are in \[`0x10.00`, `0xff.ff`\],
     assume they are just non-negative.
 - Instead of assuming that there are exactly two hexadecimal digits after the point,
     assume that there can be any number (in cases where there is a point at all).
 - Assume that each number will have at least one hexadecimal digit whether or not there is a point.
 - Bad numbers with a trailing point may appear and should be matched.
     For example, `0x12.` should be matched, but **not** `0x12.34.`.

Your task is to write a regular expression (assigned in the `TASK6_REGEX` variable) that finds these bad data points.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 6 regular expression here.
TASK6_REGEX = r''

bad_data = ['0x12.34', '0xfe.dc', '0x123456789.abcdef', '0xf', '0x0001.0', '0x12.']
non_bad_data = ['12.34', 'fedc', 'other 0x12.34 junk', '0x12.34.']
regex_golf(TASK6_REGEX, match_list = bad_data, nomatch_list = non_bad_data)

Notice that (hopefully) you regex has gotten simpler (or at least shorter) between Task 5 and Task 6 even though we allow many more cases.

## Grouping

The next core concept in regular expressions is "grouping" (also sometimes called "capture groups").
Grouping allows you to refer to more than one character at a time.
Whereas previously we were using quantifiers to repeat one character (or class) at a time,
we can instead repeat an entire group (which can be many characters (or classes) and even subgroups!).

To make a group in a regex, just surround your group with parenthesis `()`, just like in math.
You can nest groups within groups.

For example, `\$1(,000)*` can match "\\$1", "\\$1,000", "\\$1,000,000", etc.
(Remember that we have to escape the dollar sign.)

In [None]:
regex = r'^\$1(,000)*$'
match = ['$1', '$1,000', '$1,000,000', '$1,000,000,000']
non_match = ['$,000', '$10', '$100', '$1000']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# We can use nested groups.
regex = r'Look at that (really (super (duper )*)*)?cute dog.'
match = [
    'Look at that cute dog.',
    'Look at that really cute dog.',
    'Look at that really super cute dog.',
    'Look at that really super super cute dog.',
    'Look at that really super duper cute dog.',
    'Look at that really super duper duper cute dog.',
    'Look at that really super duper super duper cute dog.',
    'Look at that really super super duper super duper cute dog.',
]
non_match = ['Look at that ugly dog.']
regex_golf(regex, match_list = match, nomatch_list = non_match)

### Disjunctions

Disjunctions (also called "alternations" or just "or") lets you choose between two different options in a regular expressions.
They act just like your normal logical disjunction/or.
To use a disjunction, you use the pipe (`|`) character.

For example, `either|or` will match "either" or "or".
Note that the disjunction operator has a very low precedence,
so the disjunction applies to everything on either side and not just the characters to the immediate left and right.

Technically you do not need grouping to use disjunctions,
but it is easily to accidentally make subtle mistakes if you don't use the two together.
Like in math, extra parenthesis may not be necessary but can be helpful for readability.
So in the above example, we can instead use `(either)|(or)` to hopefully create a more readable regex.

In [None]:
# Look very closely at this patter and what it does and does not match.
# Because we didn't do any grouping, the anchors are actually part of the disjunction!
# So what we actually have here is r'^ab' OR r'c$'
regex = r'^ab|c$'
match = ['ab', 'c', 'ac', 'abc']
non_match = ['b']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# This is probably what we intended in the above example.
regex = r'^(ab|c)$'
match = ['ab', 'c']
non_match = ['b', 'ac', 'abc']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# You can chain together multiple disjunctions.
regex = r'^(a|b|c)$'
match = ['a', 'b', 'c']
non_match = ['ab', 'ac', 'abc']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# Note that we don't need the extra parenthesis,
# but they can help make things clear.

match = ['either', 'or']
non_match = ['eitheor', 'rr']

regex = r'^((either)|(or))$'
regex_golf(regex, match_list = match, nomatch_list = non_match)

regex = r'^(either|or)$'
regex_golf(regex, match_list = match, nomatch_list = non_match)

<h3 style="color: darkorange";>★ Task 7: Finding Bad Data - Best</h3>

Let's improve upon Task 6 one more time.

We have found out that the situation is worse than we thought!
It turns out that all numbers that are on a single line are bad!
This includes both hexadecimal **and** decimal numbers!

You may assume:
 - Bad numbers will no longer appear with a trailing point, e.g., `0x12.` should no longer be matched.
 - Scientific notation is not used.
 - Any number (hexadecimal or decimal) alone on a line is a bad number.
 - There may be any amount of whitespace before or after a number.
 - Bad numbers may be positive, zero, or negative (this includes both the hexadecimal and decimal numbers).
 - Positive numbers will not appear with a plus sign.
 - Hexadecimal numbers will still only include lowercase letters.

Your task is to write a regular expression (assigned in the `TASK7_REGEX` variable) that finds these bad data points.

A small golfing instance is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 7 regular expression here.
TASK7_REGEX = r''

bad_data = [
    '0x12.34', '0xfe.dc', '0x123456789.abcdef', '0xf', '0x0001.0',
    '0', '1', '2.3', '-45.67',
]
non_bad_data = ['a.12', '+3', 'fedc', 'other 0x12.34 junk', '0x12.', '0x12.34.']
regex_golf(TASK7_REGEX, match_list = bad_data, nomatch_list = non_bad_data)

### Back Reference

When you use a grouping in your regex, you can actually refer back to this reference (called a "backreference")
in other parts of your regex.
In Python, a backreference is `\n` where `n` is the number of the grouping.
A group's number is determined by the order of its open parenthesis (starting with 1).

For example, `("|')foo\1` will match `"foo"` and `'foo'` (note the order of quotes),
but not `"foo'`.
So only correctly quoted strings get matched.

In [None]:
# Note that we had to escape the single quote,
# not for regex reasons but because we used a single quote for our Python string.
regex = r'^("|\')foo\1$'
match = ['"foo"', "'foo'"]
non_match = ['"foo\'', '\'foo"']
regex_golf(regex, match_list = match, nomatch_list = non_match)

# We can match an HTML tag.
regex = r'^<(\w+)>.*</\1>$'
match = ['<a>link</a>', '<span>Some text!</span>', '<html><body><div>Yay!</div></body></html>']
non_match = ['<p></a>']
regex_golf(regex, match_list = match, nomatch_list = non_match)

Using a backreference during matching is useful,
but the true strength of backreferences are using them with replacements.
Up until now we have only been focused on matching,
but you will probably use regex more in your daily life in find-replace operations.

There are several replace functions available in Python's re library,
with the most common being [`re.sub`](https://docs.python.org/3/library/re.html#re.sub).
`re.sub()` takes three required arguments: the regex, the replacement string, and the target string.
The function then returns the replaced string (or the original target string if no replacements were made).

Backreferences can be used in the replacement string (the second parameter) to represent the exact text that was matched by a group.

In [None]:
regex = r'My name is (.+)\.'
replacement = r'Hello, \1!'
target = "My name is Sammy Slug."
print(re.sub(regex, replacement, target))

# Sometimes you will not want anything in the target string aside from your group.
# In this case, you can use anchors and .* to consume anything before and after your match.
regex = r'^.*(\d{3})\D*(\d{3})\D*(\d{4}).*$'
replacement = r'\1\2\3'
target = "Call me back at (555) 123-4567, thanks."
print(re.sub(regex, replacement, target))

# Remember, that a group's number is determined by the location of the open parenthesis.
regex = r'^.*(\d+)\s+((dog)|(cat)|(spotted lizard))s?.*$'
replacement = r'\1 - \2'
target = "Sammy has 1 parrot and 3 dogs."
print(re.sub(regex, replacement, target))

<h3 style="color: darkorange";>★ Task 8: Mysterious Code - Best</h3>

Let's improve upon Task 4 one more time.
Now we don't just want to find these code, but we want to modify them!

We want to replace the name of each code variable so that it has the actual code in the name.
Instead of:
```
code_a = "a123"
```
We want:
```
code_a123 = "a123"
```

To do this, you will need to complete two parts:
 - `TASK8_REGEX` -- A regular expression that matches the code assignment statement (like in Tasks 4).
 - `TASK8_REPLACEMENT` -- A replacement string that will be used together with your regex to modify our code.

Specifics:
 - Make no assumptions about the amount and type of whitespace on either side of the assignment operator (except that it will not be a newline) in the target string.
 - The replacement string should have exactly one space character on either side of the assignment operator.
 - The replacement string should use double quotes around the code (as the existing code already does).
 - You may assume that the entire assignment statement will be on one line.

A small test is provided to get you started, but the autograder will check more cases.

In [None]:
# Put your Task 8 regular expression and replacement string here.
TASK8_REGEX = r''
TASK8_REPLACEMENT = r''

old_strings = [
    'code_a = "a123"',
    'code_b     =     "1098"',
    'code_c\t=\t"z395"',
]

new_strings = [
    'code_a123 = "a123"',
    'code_1098 = "1098"',
    'code_z395 = "z395"',
]

for i in range(len(old_strings)):
    actual = re.sub(TASK8_REGEX, TASK8_REPLACEMENT, old_strings[i])

    expected = new_strings[i]
    if (actual == expected):
        print("Good job, string %d is correct!" % (i))
    else:
        print("Missed string %d. Expected '%s', found '%s'." % (i, expected, actual))

## Congratulations!

Congratulations, you now know about regular expressions!
Of course there are more features you can learn,
but you know enough of the basics to cover most situations,
and you have the knowledge, resources, and vocabulary to learn about any other situations that you may encounter.