# Denison DA210/CS181 Homework 3.d - Step 1

Before you turn this notebook in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells**.

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import re

---

## Part A - Python `re` module

We can use the Python `re` module to work with regular expressions.  There are four main functions we can make use of in `re`:

Function | Return Type | Short Description
:-------:|-------------|-----------------------------------------------------------------
`search()` | *Match Object* | Find first match of pattern *anywhere* in target
`match()` | *Match Object* | Find match of pattern at *start* of target
`findall()` | *List of strings* | Find all matches and return list of matched strings, or list of tuples of captured groups
`finditer()` | *Match Object Iterator* | Find all matches through an iterator of successful match objects

In this assignment, though, we'll abstract the use of `re` to two functions.  The first function returns a list of match info: `assembleMatches(pattern, text)`.

In [None]:
def assembleMatches(pattern, text):
    """
    Returns a list of string-index tuples for each match of the given
    pattern in the given text.
    """
    return [(m.group(), m.start()) for m in re.finditer(pattern, text)]

In Python, we'll also use "raw strings", which contain the text we give, without worrying about additional escape characters.  A Python raw string has an `r` preceeding the open quotation mark.

We can see how this function works with the following examples:

In [None]:
# Specify the pattern as a raw string
pattern = r"Lulu" # the exact string Lulu (note the preceeding r)

# Target text is just a string to search for matches in
target = "Lulu is 10 years old."

# Here are the matches:
assembleMatches(pattern, target)

In [None]:
# Specify the pattern as a raw string
pattern = r"[\w]+" # one or more alphanumeric characters

# Target text is just a string to search for matches in
target = "Lulu is 10 years old."

# Here are the matches:
assembleMatches(pattern, target)

In [None]:
# Specify the pattern as a raw string
pattern = r"[\d]+" # one or more digits

# Target text is just a string to search for matches in
target = "Lulu is 10 years old."

# Here are the matches:
assembleMatches(pattern, target)

In [None]:
# Specify the pattern as a raw string
pattern = r"[\D]+" # one or more non-digits

# Target text is just a string to search for matches in
target = "Lulu is 10 years old."

# Here are the matches:
assembleMatches(pattern, target)

---

## Part B - Writing your own regexes

In this assignment, you'll just need to provide a regular expression as a raw Python string, and assign to the appropriate `pattern` variable (e.g., `pattern1`).

**Q1:** Write a regular expression that matches complete words that begin with `t` and then `h`, followed by two more letters (i.e., you should match all words that are four letters long and start with `th`).  Assign your regular expression to `pattern1`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Debugging cell
text1 = "Does this text match that pattern?"
assembleMatches(pattern1, text1)

In [None]:
# Testing cell
text1 = "Does this text match that pattern?"

assert assembleMatches(pattern1, text1) == [('this', 5), ('that', 21)]

**Q2:** Outside of the US, it is common to write dates in the form `year.month.day`, e.g., `2020.01.05` for `January 05, 2020`. Write a regular expression (`pattern2`) that matches a date written in this form.  Note that single digits for the month and day will use a leading zero.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Debugging cell
text2 = "2021.02.15 is the first day of week 3; but 123.45.6789 might be mistaken for a social security number"
assembleMatches(pattern2, text2)

In [None]:
# Testing cell
text2 = "2021.02.15 is the first day of week 3; but 123.45.6789 might be mistaken for a social security number"

assert assembleMatches(pattern2, text2) == [('2021.02.15', 0)]

---

## Part C - Capture groups

We can use a "group capture" to acquire part of the match from a regular expression.  Again, we'll use a function to abstract out the use of the `re` module, and focus instead on the regular expression patterns.

In [None]:
def assembleCaptures(pattern, text):
    """
    Returns a list of capture groups for each match of the given
    pattern in the given text.
    
    A capture group is both the string of the capture and the index
    in the text where the capture begins.
    """
    res = []
    for m in re.finditer(pattern, text):
        grp = [(m.group(i), m.start(i)) for i in range(len(m.groups())+1)]
        res.append(grp)
    return res

Consider the following example, which matches US phone numbers, including both the entire phone number, and, separately, the area code, 3-digit prefix, and 4-digit line number:

In [None]:
text1 = "Looking to match 555-123-4567 and (800) 721-6432 but 123.45.6789 might be mistaken for a social security number"

# Group 1, area code:           ([\d]{3})
# Group 2: 3-digit prefix:      ([\d]{3})
# Group 3: 4-digit line number: ([\d]{4})
pattern = r"\({0,1}([\d]{3})[\-)\s]{1,2}([\d]{3})\-([\d]{4})"
assembleCaptures(pattern, text1)

**Q3:** Write a regular expression pattern (`pattern3`) that matches all `import` statements in a target comprised of Python source code.  An `import` is contained in a single line, but there could be leading whitespace (if indented in a block) or trailing whitespace.

Your result should capture both the entire matched line and, separately, the module that is imported.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Debugging cell
text3 = """import pandas
import re  

def f(x):
    import math
    return math.sqrt(x)
"""

assembleCaptures(pattern3, text3)

In [None]:
# Testing cell
text3 = """import pandas
import re  

def f(x):
    import math
    return math.sqrt(x)
"""

res = assembleCaptures(pattern3, text3)
assert res[0][1] == ('pandas', 7)
assert res[1][1] == ('re', 21)
assert res[2][1] == ('math', 48)

**Q4:** Write a regular expression pattern (`pattern4`) that matches all variable assignment statements in a target comprised of Python source code.  A variable assignment statement has the form `variable = expression`, where `variable` must be any valid Python identifier (e.g., containing only letters, numbers, and underscores, and not starting with a number), and `expression` is assumed to be the remainder of the line.  Note that the assignment statement may have leading or trailing whitespaces, which you should ignore, and that spaces around the equals sign (`=`) are optional.

Your result should capture both the entire matched line and, separately, both the variable name and the expression.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Debugging cell
text4 = """
def f(x):
    y = x + 2
    a = x / y
    return a

a = 7
b = 10 + f(a)
print(a+b)
"""

assembleCaptures(pattern4, text4)

In [None]:
# Testing cell
text4 = """
def f(x):
    y = x + 2
    a = x / y
    return a

a = 7
b = 10 + f(a)
print(a+b)
"""

res = assembleCaptures(pattern4, text4)
assert res[0][1] == ('y', 15)
assert res[0][2] == ('x + 2', 19)
assert res[1][1] == ('a', 29)
assert res[1][2] == ('x / y', 33)
assert res[2][1] == ('a', 53)
assert res[2][2] == ('7', 57)
assert res[3][1] == ('b', 59)
assert res[3][2] == ('10 + f(a)', 63)

---

---

## Part D

**Q5:** How much time (in minutes/hours) did you spend on this homework assignment?

YOUR ANSWER HERE

**Q6:** Who was your partner for this assignment?  If you worked alone, say so instead.

YOUR ANSWER HERE