# Lesson 3 - Introduction - Language

Python has a syntax for statements, expressions, classes with operator overloading etc. It also contains some mini-languages like string formatting and regular expressions. These are small, highly specialized languages that serve a specific purpose. The basic idea in this unit is that one can develop a domain-specific language to represent and solve specific problems. To this end, we will describe what a *language* is, what a *grammar* is, the difference between a *compiler* and an *interpreter*, and how to use languages as a design tool.

## Regular Expressions

REs are an example of a language. It can be expressed by strings like `'a*b*c*'`. To make sense of inputs like these we need to make sense of what the possible *grammars* are and what are the possible *languages* that those grammars correspond to.

A **grammar** is a description of a language and a **language** is a set of strings. In our example above, `'a*b*c*` is the description of a grammar and strings like `abc`, `aaabbcc`, `ccccc` form the language associated with such grammar.

Representing grammars as strings, like `'a*b*c*` or `'a+b?` is convenient for small expressions, but it can become complex with longer ones. We are going to use a representation that is more **compositional**. We are going to describe an **Application Programming Interface (API)**, (i.e., what the programmer uses, as opposed to the UI, which is what the user uses), i.e., a series of function calls that can be used to describe the grammar of a RE. Python has a `re` module, so we could leverage the functions in that module, but the point of this unit is not to learn how to use REs, but rather how to build a **language processor**.

The **API calls** listed below are the building blocks of our language description, i.e., of our grammar.

- `lit(s)` is the **literal** string `s`. `lit('a')` describes the language consisting only of character string `'a'`, i.e., `{'a'}`, and nothing else.
- `seq(x, y)` is the **sequence** of x and y, meaning the application of RE `y` on what is returned by the application of `x` to the input. `seq(lit('a'), lit('b'))` describes the language consisting only of the string `'ab'`, i.e. `{'ab'}`.
- `alt(x, y)` stands for **alternatives**. `seq(lit('a'), lit('b'))` would correspond of two possibilities: either `a` or `b`, i.e., `{'a', 'b'}`. The output is what we would get by applying `x` *or* `y` to the input.
- `star(x)` means **zero or more repetitions** of `'x'`, and would therefore correspond to `{'a', 'aa', 'aaa',...}` and so on.
- `oneof(s)` is the same as `alt(c1, c2, ...)` where `'c1'. 'c2', ...` etc. are the characters string `s` is made of. `oneof('abc')` matches `{'a', 'b', 'c'}`. Note that the input of `oneof()` is a string of characters, and not other metacharacters (**TODO** confirm that this is true).
- `eol` means **end of line* matches only the end of a character string and nothing else, so it matches the empty string `{''}`, but only at the end. `seq(lit('a'), eol)` matches `{'a'}` if it is at the end of the line (it would be the equivalent of `'a$'`).
- `dot` matches any possible character `{'a', 'b', 'c',...}`
- `plus(s)` means **one or more repetitions** of `'x'`. **NOTE** it does not seem to be implemented in this unit.

The API calls above implement a subset of regular expression metacharacters. These are based on [Rob Pike's regular expression matcher](https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html). The operators above define patterns that are searched for by these two functions:

- `search(pattern, text)`: returns a string that is the earliest match of the pattern in the text, and if there is more than one match in the same location, it will be the longest of those.
- `match(pattern, text)`: matches only if the pattern occurs at the very start of `text`, so `match('def', 'abcdef')` would return `None`.

The naming of these functions is different from that used by Pike, but is consistent with the naming used in the `re` module. Below we provide the implementations of `search()` and `match()`, plus a couple of utility functions. Of particular interest is the structure of `match_star(p, pattern, text)`, which uses a clever recursive call.

`match1(p, text)` function returns `True` if the first character of `text` is `p` *or* if `p` is `.`, i.e., if the first character is supposed to be *any* character.

In [1]:
def match1(p, text):
    """Return True if first character of text matches pattern character p."""
    if not text:
        return False
    return p == '.' or p == text[0]

`search(pattern, text)` returns `True` if `pattern` is found anywhere in `text`. If the first character of `pattern` is `^`, `pattern` must appear at the very beginning of `text`. Note that this function depends on the function `match(pattern, text)` defined below.

In [2]:
def search(pattern, text):
    "Return True if pattern appears anywhere in text."
    if pattern.startswith('^'):
        return match(pattern[1:], text)
    else:
        return match('.*' + pattern, text)

`match(pattern, text)` returns `True` if `pattern` appears at the start of the in the text. It covers the following special cases:

- If `pattern` is the empty string `''` it returns `True`. This is the same behaviour as `re.match()`.
- If `pattern` is `'$'` it returns `True` only when the end and the beginning of the string coincide, which only happens for the empty string.
- If `pattern` is a multicharacter string and the *second* character is either `'*'` or `'?'`, it splits the pattern into three pieces: the first character `p`, the operator `op`, and the pattern `pat`.
  - If `op` is `'*'` it calls `match_star(p, pat, text)`.
  - If `op` is `'?'` (zero or one occurrances of the character before) it checks whether `p` matches the first character of `text` and whether `pat` matches the rest of `text`. If this is the case, it returns `True`. If this condition is `False`, it checks whether `pat` matches (all of) `text`.
- Finally, if none of the above holds, it makes a recursive call where it checks whether the first character of `pattern` matches the first character of `text` with `match()` and uses (recursively) `match(pattern[1:], test[1:])` to match the rest of the pattern against the rest of the text.

In [3]:
def match(pattern, text):
    "Return True if pattern appears at the start of text."
    if pattern == '':
        return True
    elif pattern == '$':
        return (text == '')
    elif len(pattern) > 1 and pattern[1] in '*?':
        p, op, pat = pattern[0], pattern[1], pattern[2:]
        if op == '*':
            return match_star(p, pat, text)
        elif op == '?':
            if match1(p, text) and match(pat, text[1:]):
                return True
            else:
                return match(pat, text)
        else:
            return (match1(pattern[0], text) and
                    match(pattern[1:], text[1:]))

`match_star(p, pattern, text)` returns `True` if zero or more instances of `p` are followed by `pattern`. As above, matching an arbitrary number of instances is achieved via a recursive call to `match_star()` itself.

In [4]:
def match_star(p, pattern, text):
    """Return True if any number of char p, followed by pattern,
    matches text."""
    return (match(pattern, text) or  # match zero times
            (match1(p, text) and     # match exactly one time
             match_star(p, pattern, text[1:])))  # Brilliant!

## Concept Inventory

Let's make a list of the concept we need to consider in this unit:

- Patterns.
- Texts we want to match the patterns against, and the result of this mathing.

This is pretty much it, however, we will also consider the following concepts:

- A concept of **partial result**.
- A notion of **control over iteration**.

To understand why these additional notions are needed, let's consider the following example: our pattern is `'a*b+'` and our text is `'aaab'`. If our pattern is represented in our API as `seq(star('a'), plus(lit('b')))`, the first part, `star('a')` would match the first character, but the second part of the pattern, `'b+'` would not match. Only after matching the first three `'a'`s we finally match a `'b'`. We would then need a mechanism to iterate through all possible substrings matching  the first of the pattern, and this seems quite tricky. A similar situation occurs, if we need to evaluate between alternatives. In such cases we need some form of control over this form of iteration.

It turns out (no explanation provided in the lesson) that representing these partial results as a **set of remainders of the text** is a good choice. By *remainder* we mean everything left after matching `pattern`. For example, if `pattern = '^a'` and `text = 'abacus'`, the remainder is `'bacus'`. The `matchset(pattern, text)` function below returns this set of remainders. In the case of `star(lit(a))`, `matchset(star(lit(a)), text='aaab')` would then return `set(['aaab', 'aab', 'ab', 'b'])`.

To implement `matchset(pattern, text)` we first introduce a utility function `components(pattern)` which breaks a pattern into three parts: the operator `op` and the arguments `x` and `y`, which will be `None` if missing.

**Important**: here `pattern` is a *tuple* of the form `(op, x, y)`, for example `('lit', 'abc')`.

In [5]:
null = frozenset()  # Acts as Null

def components(pattern):
    """Return the op, x, and y arguments; x and y are None if missing.
    pattern is a tuple (op, x, y) with x, y optional."""
    x = pattern[1] if len(pattern) > 1 else None
    y = pattern[2] if len(pattern) > 2 else None
    return pattern[0], x, y

In [6]:
#----------------
# User Instructions
#
# The function, matchset, takes a pattern and a text as input
# and returns a set of remainders. For example, if matchset 
# were called with the pattern star(lit(a)) and the text 
# 'aaab', matchset would return a set with elements 
# {'aaab', 'aab', 'ab', 'b'}, since a* can consume none, one, two
# or all three of the a's in the text.
#
# dot:   matches any character.
# oneof: matches any of the characters in the string it is 
#        called with. oneof('abc') will match a or b or c.

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)
    elif 'dot' == op:
        return set([text[1:]])
    elif 'oneof' == op:
        return set([text[text.find(x)+1:]])
    elif 'eol' == op:
        return set(['']) if text == '' else null
    elif 'star' == op:
        return (set([text]) |
                set(t2 for t1 in matchset(x, text)
                    for t2 in matchset(pattern, t1) if t1 != text))
    else:
        raise ValueError('unknown pattern: %s' % pattern)

When `op == 'alt'` the function returns the result of `matchset(x, text)` or `matchset(y, text)`. This to account for the fact that `x` and `y` may be composite expressions. The result is, therefore, the *union* of these two sets.

Let's try to understand the case for `'seq'`. Given `('seq', x, y)`, the expression

```python
set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
```

`matchset(x, text)` returns a set the elements of which, indicated by `t1`, become the inputs of `matchset(y, t1)`, which in turn returns a set the elements of which we indicate with `t2`.

In [7]:
def test():
    assert matchset(("lit", "abc"), "abcdef") == set(["def"])
    assert matchset(
        ("seq", ("lit", "hi "), ("lit", "there ")), "hi there nice to meet you"
    ) == set(["nice to meet you"])
    assert matchset(("alt", ("lit", "dog"), ("lit", "cat")), "dog and cat") == set(
        [" and cat"]
    )
    assert matchset(("dot",), "am i missing something?") == set(
        ["m i missing something?"]
    )
    assert matchset(("oneof", "a"), "aabc123") == set(["abc123"])
    assert matchset(("eol",), "") == set([""])
    assert matchset(("eol",), "not end of line") == frozenset([])
    assert matchset(("star", ("lit", "hey")), "heyhey!") == set(
        ["!", "heyhey!", "hey!"]
    )
    return "tests pass"

print(test())

tests pass


### Search and Match

With this definition of `matchset()` we can implement `search(pattern, text)` and `match(pattern, text)`.

In [8]:
null = frozenset()

def search(pattern, text):
    "Match pattern anywhere in text; return longest earliest match or None."
    for i in range(len(text)):
        m = match(pattern, text[i:])
        if m:
            return m
        
def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = matchset(pattern, text)
    if remainders:
        shortest = min(remainders, key=len)
        return text[:text.find(shortest)] if len(text) > 1 else text

def test():
    assert match(('star', ('lit', 'a')),'aaabcd') == 'aaa'
    assert match(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == None
    assert match(('alt', ('lit', 'b'), ('lit', 'a')), 'ab') == 'a'
    assert search(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == 'b'
    return 'tests pass'

print(test())

tests pass


## Filling Out The API

In the code below we provide the implementations of the various API calls. When we call an operator as a function, say `lit('hello')`, we obtain a tuple containing the name of the operator and, depending on the operator, the arguments or other operator-arguments tuples, e.g., `('lit', 'hello)`.

In [9]:
def lit(string):
    return ('lit', string)

def seq(x, y):
    return ('seq', x, y)

def alt(x, y):
    return ('alt', x, y)

def star(x):
    return ('star', x)

def plus(x):
    return ('seq', x, ('star', x))

def opt(x):
    return alt(lit(''), x) #opt(x) means that x is optional

def oneof(chars):
    return ('oneof', tuple(chars))

dot = ('dot',)
eol = ('eol',)

def test():
    assert lit('abc') == ('lit', 'abc')
    assert seq(('lit', 'a'), ('lit', 'b')) == ('seq', ('lit', 'a'), ('lit', 'b'))
    assert alt(('lit', 'a'), ('lit', 'b')) == ('alt', ('lit', 'a'), ('lit', 'b'))
    assert star(('lit', 'a')) == ('star', ('lit', 'a'))
    assert plus(('lit', 'c')) == ('seq', ('lit', 'c'), ('star', ('lit', 'c')))
    assert opt(('lit', 'x')) == ('alt', ('lit', ''), ('lit', 'x'))
    assert oneof('abc') == ('oneof', ('a', 'b', 'c'))
    return 'tests pass'

print(test())

tests pass


## Compiling

Let's summarize how *interpreters* work. In the case of REs we have *patterns*, e.g., `(a|b)+`, and we have *languages*, i.e., set of strings like `{'a', 'b', 'ab', 'ba', ...}` defined by the pattern, and then we have **interpreters** like `matchset(pattern, text)` which return a set of strings. We say that `matchset` is an interpreter because it takes a pattern as a data structure and operates over that pattern. As we can see from its implementation, it has a big `if-elif-else` statement where it checks what type of operator we have in order to select the next action.

There is an inherent inefficiency, in that the pattern is only defined once, but we may want to apply that same pattern to many different texts. Every time we have to go through the sequence of `if-elif-else` in order to figure out what type of operator we have, but we should already know that.

There is another type of interpreter, called the **compiler**, which does all this work at once, the very first time the pattern is defined. Whereas an interpreter takes a pattern and a text and operates on those, a compiler has two steps. In the first step there is a compilation function which takes just the pattern and returns a *compiled object*, let's call it `c`. Then the compiled object is executed taking the text as argument: `c(text)`. While in the interpreter all the work is done by the interpreter itself, in our case by `matchset()`, in the case of a compiler some of the work is done during the compilation stage and some happens every time we get a new text.

In the case of `lit` our API returns:

```python
def lit(s):
    return ('lit', 's')
```

We defined `matchset(pattern, text)` such that when the pattern contains `'lit'` we compute the remainders as:

```python
def matchset(pattern, text):
    op, x, y = components(pattern)
    # other code
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    # other code
```

Now, as soon as we construct a literal, instead of getting a tuple we will get a *function* that returns the set that `matchset()` would have given us. We can then apply this function to `text`.

```python
def lit(s):
    return lambda text: set([text[len(s):]]) if text.startswith(s) else null
```

## Lower Level Compilers

We can define a pattern, say `pat = lit('a')` which is now a function, not a tuple, which gives us the set of the remainders. In an interpreter we have patterns that describe the strings, i.e. the language. In a compiler we have two sets of descriptions to deal with: a description of what the pattern looks like and a description for what the compiled code looks like. In our case, the compiled code consists of Python functions, which are a good target representation because they are flexible. Compilers for languages like C generate code that is the actual machine instructions for the computer, and this is a complex process. There is an intermediate process that generates code for a Virtual Machine. Java and Python follow this approach. The `dis(code)` function from the `dis` module generates the *bytecode* for the Python virtual machine.

In [10]:
import dis
import math

dis.dis(lambda x, y: math.sqrt(x**2 + y**2))

  4           0 RESUME                   0
              2 LOAD_GLOBAL              0 (math)
             14 LOAD_METHOD              1 (sqrt)
             36 LOAD_FAST                0 (x)
             38 LOAD_CONST               1 (2)
             40 BINARY_OP                8 (**)
             44 LOAD_FAST                1 (y)
             46 LOAD_CONST               1 (2)
             48 BINARY_OP                8 (**)
             52 BINARY_OP                0 (+)
             56 PRECALL                  1
             60 CALL                     1
             70 RETURN_VALUE


In the code below, we implement the compiled functions for `lit(s)`, `seq(x, y)` and `alt(x, y)`. The point is to remember that each of `x, y` are functions returning a set, therefore the compiler for `alt(x, y)` is the union of `x(text)`, which is a set, and `y(text)`, which is another set.

In [11]:
def lit(s):
    return lambda text: set([text[len(s):]]) if text.startswith(s) else null

def seq(x, y):
    return lambda text: set().union(*map(y, x(text)))

def alt(x, y):
    return lambda text: x(text).union(y(text))
        
null = frozenset([])

def oneof(chars):
    return lambda t: set([t[1:]]) if (t and t[0] in chars) else null

def star(x): return lambda t: (set([t]) | 
                               set(t2 for t1 in x(t) if t1 != t
                                   for t2 in star(x)(t1)))

dot = lambda t: set([t[1:]]) if t else null
eol = lambda t: set(['']) if t == '' else null

def test():
    g = alt(lit('a'), lit('b'))
    assert g('abc') == set(['bc'])
    return 'test passes'

print(test())

test passes


With this approach, `match(pattern, text)` can be written as follows.

In [12]:
def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = pattern(text)
    if remainders:
        shortest = min(remainders, key=len)
        return text[:len(text)-len(shortest)]
    
def test():
    assert match(star(lit('a')), 'aaaaabbbaa') == 'aaaaa'
    assert match(lit('hello'), 'hello how are you?') == 'hello'
    assert match(lit('x'), 'hello how are you?') == None
    assert match(oneof('xyz'), 'x**2 + y**2 = r**2') == 'x'
    assert match(oneof('xyz'), '   x is here!') == None
    return 'tests pass'

print(test())

tests pass


## Recognizers and Generators

What we have done so far is the **recognizer task**. We have a function `match(pattern, text)` which *recognizes* if the prefix of `text` is in the language defined by `pattern`.

The **generator task** takes a pattern `pattern` and generates the complete language defined by that pattern. For example, the pattern `(a|b)(a|b)` generates the language `{'aa', 'ab', 'ba', 'bb}`. If the pattern is `a*` the corresponding language is an infinite set. We could use a generator function to generate each element of such language one at a time, but we will instead limit the sizes of the strings we want, and this will always produce finite sets. We will take the compiler's approach, and instead of calling `gen(pat)`, i.e., the generator, as a function, on the pattern, we will have the generator compiled into the pattern. `pat()` will therefore be a function and we will apply it to a set of integers representing the possible range of lengths that we want to retrieve and that will return a set of strings. For example, given `pat = a*`, `pat({1, 2, 3})` should return all strings of length 1, 2, or 3, i.e., `{'a', 'aa', 'aaa'}`. The functions below implement the generators for the various operations.

This is the whole compiler.

In [13]:
def lit(s):
    return lambda Ns: set([s]) if len(s) in Ns else null

def alt(x, y):
    return lambda Ns: x(Ns) | y(Ns)

def star(x):
    return lambda Ns: opt(plus(x))(Ns)

def plus(x):
    return lambda Ns: genseq(x, star(x), Ns, startx=1) #Tricky

def oneof(chars):
    return lambda Ns: set(chars) if 1 in Ns else null

def seq(x, y):
    return lambda Ns: genseq(x, y, Ns)

def opt(x):
    return alt(epsilon, x)

dot = oneof('?')    # You could expand the alphabet to more chars.
epsilon = lit('')   # The pattern that matches the empty string.

def test():
    
    f = lit('hello')
    assert f(set([1, 2, 3, 4, 5])) == set(['hello'])
    assert f(set([1, 2, 3, 4]))    == null 
    
    g = alt(lit('hi'), lit('bye'))
    assert g(set([1, 2, 3, 4, 5, 6])) == set(['bye', 'hi'])
    assert g(set([1, 3, 5])) == set(['bye'])
    
    h = oneof('theseletters')
    assert h(set([1, 2, 3])) == set(['t', 'h', 'e', 's', 'l', 'r'])
    assert h(set([2, 3, 4])) == null
    
    return 'tests pass'

print(test())

tests pass


We can make the compiler more efficient. For example, in the definition of `lit(s)` we call `set([s])` every time the output of `lit(s)` is called. This seems wasteful. A better way of writing it is:

In [14]:
def lit(s):
    set_s = set([s])  # We create this only once
    # Every time we call the function below, we refer to the set_s defined above
    return lambda Ns: set_s if len(s) in Ns else null

Similarly, we can pull out the `set(chars)` in the defintion of `oneof(chars)`.

We still must define `genseq()`. If we pass two arguments, `x, y` to `seq(x, y)` (not `genseq()`), this returns a function of `Ns`, `fn(Ns)` which returns a set of texts that match. In this respect, `seq(x, y)` is delaying the computation of the output. `geneseq(x, y, Ns)`, instead, immediately calculates the output set. One thing we know about this function is that we will have to call `x(Nx)`, where `Nx` is a set of numbers which we don't yet know, and then we will have to call `y(Ny)`, where `Ny` is a possibly different set of numbers, then we have to concatenate together the results and see if this concatenation is within the allowable set defined by `Ns`. What do we know about `Nx` and `Ny` with respect to `Ns`? `Ns` could be a dense set, say `{0, 1, 2, ..., 10}` or it could be a sparse set, say just `{10}`, but in either case, `Nx + Ny <= 10`, and `Nx` can be anything up to 10. For the `y(Ny)` we have two choices: we could wait for `x(Nx)` to return its results and pass them through `y` or we could do it all at once and then try to combine them together and see if they match up. This is easier because in such case `Ny` could also be any number up to 10 in our example. So, both `Nx` and `Ny` could be anything up to 10 inclusive and if we get some results out, for each of them we add them up and check if they are in `Ns`. A candidate solution for the `geneseq()` function is shown below.

In [15]:
def genseq(x, y, Ns):
    Nss = range(max(Ns) + 1)
    return set(m1 + m2
               for m1 in x(Nss)
               for m2 in y(Nss)
               if len(m1 + m2) in Ns)

This function, however, can give rise to infinite recursions. Where do we use recursion? In two functions:`plus()` and `star()`, but `star()` is defined in terms `plus()`, so we need to fix `plus()` in order to avoid infinite recursion. We are essentially defining `x+` as `xx*`, i.e., `seq(x, (star, x))`. In most cases, this works, but if we define `pat = plus(opt(a))`, `opt(a)` means that we are picking either `a` or the empty string, and as we go through the loop we may pick the empty string an infinite number of times and we are never going to get past the values in the set `Ns`, and we will keep going forever. **TODO**: clarify this mess.

This is why we have `startx=1` in `star()`, i.e., we always ask `x` to have a length of at least 1, and this is how we break the recursion. We redefine `geneseq()` as:

In [16]:
def genseq(x, y, Ns, startx=0):
    "Set of matches to xy whose total len is in Ns, with x-match's len in Ns-len(..."
    # Tricky part: x+ is defined as x+ = x x*
    # To stop the recursion, the first x must generate at least 1 char,
    # and then the recursive x* has that many fewer characters. We use
    # startx=1 to say that x must match at least 1 character.
    if not Ns:
        return null
    xmatches = x(set(range(startx, max(Ns) + 1)))
    Ns_x = set(len(m) for m in xmatches)
    Ns_y = set(n - m for n in Ns for m in Ns_x if n - m >= 0)
    ymatches = y(Ns_y)
    return set(m1 + m2
               for m1 in xmatches for m2 in ymatches
               if len(m1 + m2) in Ns)

def test_gen():
    def N(hi):
        return set(range(hi + 1))
    a, b, c = map(lit, 'abc')
    assert star(oneof('ab'))(N(2)) == set(['', 'a', 'aa', 'ab', 'ba', 'bb', 'b'])
    assert (seq(star(a), seq(star(b), star(c)))(set([4])) == set(
        ['aaaa', 'aaab', 'aaac', 'aabb', 'aabc', 'aacc', 'abbb', 'abbc',
         'abcc', 'accc', 'bbbb', 'bbbc', 'bbcc', 'bccc', 'cccc']))
    assert (seq(plus(a), seq(plus(b), plus(c)))(set([5])) == set(
        ['aaabc', 'aabbc', 'aabcc', 'abbbc', 'abbcc', 'abccc']))
    assert (seq(oneof('bcfhrsm'), lit('at'))(N(3)) == set(
        ['bat', 'cat', 'fat', 'hat', 'mat', 'rat', 'sat']))
    assert (seq(star(alt(a, b)), opt(c))(set([3])) == set(
        ['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'baa', 'bab',
         'bac', 'bba', 'bbb', 'bbc']))
    assert lit('hello')(set([5])) == set(['hello'])
    assert lit('hello')(set([4])) == set()
    assert lit('hello')(set([6])) == set()
    return 'test_gen passes'

print(test_gen())

test_gen passes


In all the above we have taken advantage of the *composability* of Python functions. Functions, unlike statements and expressions, which can only be composed by the programmer, can be composed dynamically. Functions also provide *control over time*: we can divide some of the work we want to do such that we do some now and some later. Expressions don't allow this separation.

## Changing `seq()`

The `seq()` function is binary, in the sense that it takes two arguments. If we want a sequence of four objects, say `a, b, c, d`, we need to call `seq(a, seq(b, seq(c, d)))`. It would be much easier if we could just write `seq(a, b, c, d)`. We want to refactor this function, but aren't we changing it's API? We should ask ourselves:

1. Which other functions does `seq()` interact with in our program?
2. If I change `seq()`, are these changes *backward compatible*? In other words, do I have to modify also the functions `seq()` interacts with?
3. Are the changes *internal* or *external*? Am I changing something inside `seq()` that doesn't affect the callers, or am I changing the interface to the outside world?

### Function mapping: decorators

We can refactor `seq()` without changing the API. To do this, we need to *map* our binary function `f(x, y)` and convert it to an n-ary function `g(x, y, ...)`. This mapping is done via a function, `n_ary()` in the example below, that takes the binary function `f(x, y)` and returns an n-ary function.

In [17]:
def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such that
    f(x, y, z) = f(x, f(y, z)) etc. Also allow for f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

def myseq(x, y):
    "My own seq function (meh)"
    return ('myseq', x, y)

myseq = n_ary(myseq)
print(myseq('a', 'b', 'c'))

('myseq', 'a', ('myseq', 'b', 'c'))


The pattern above is so common in Python that there is a special notation: the **decorator notation**. We can leverage it as follows:

In [18]:
@n_ary
def myseq(x, y):
    "My own seq function (meh)"
    return ('myseq', x, y)

print(myseq('a', 'b', 'c'))

('myseq', 'a', ('myseq', 'b', 'c'))


One limitation, however, is that if we check the docstring of the decorated function, this is what we get:

In [19]:
help(myseq)

Help on function n_ary_f in module __main__:

n_ary_f(x, *args)



Luckily `functools` has a function called `update_wrapper()` that takes two functions and copies the name, the documentation plus other things from the old function to the new function. For this, we need to modify the definition of `n_ary()`.

In [20]:
from functools import update_wrapper

def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such that
    f(x, y, z) = f(x, f(y, z)) etc. Also allow for f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    update_wrapper(n_ary_f, f)  # update_wrapper(new_fn, old_fn) 
    return n_ary_f

@n_ary
def myseq(x, y):
    "My own seq function (meh)"
    return ('myseq', x, y)

In [21]:
help(myseq)

Help on function myseq in module __main__:

myseq(x, y)
    My own seq function (meh)



An even better approach is to create our own decorator that adds the the `update_wrapper()` call to, in the case above, `n_ary_f()`. We will call this new decorator `@decorator`. If we use `n_ary()` as a decorator on `myseq()` and apply `@decorator` to `n_ary()`, we have two updates to consider: one for the function we want to decorate, and one for the decorator itself. The pattern we want to follow is:

```python
def decorator(d):  # d is a decorator function
    def _d(f):
        update_wrapper(d(f), f)
    update_wrapper(_d, d)
    return _d
```

With this setup, `n_ary = decorator(n_ary)` would be updated by `update_wrapper(_d, d)` and `myseq = n_ary(myseq)` would be updated by `update_wrapper(d(f), f)`.

This is what we ultimately get:

In [22]:
def decorator(d):
    "Make function d a decorator. d wraps a function f"
    def _d(f):
        return update_wrapper(d(f), f)
    update_wrapper(_d, d)
    return _d

@decorator
def n_ary(f):
    """Given bynary function f(x, y) return an n-ary function such that
    f(x, y ,z) = f(x, f(y, z)), etc. Also allow for f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

@decorator
def myseq(x, y):
    return ('myseq', x, y)

In [23]:
help(myseq)

Help on function myseq in module __main__:

myseq(x, y)



Even more confusingly, the following code also works (due to [Darius Bacon](https://github.com/darius)). **TODO** understand what's going on here. Video 46 - Decorated Decorators Solution

In [24]:
def decorator(d):
    "Make function d a decorator. d wraps a function fn."
    return lambda fn: update_wrapper(d(fn), fn)

decorator = decorator(decorator)

## Cache Management

We want to leverage the concept of **memoization**. Particularly with recursive functions we will be making the same function calls over and over again. If the result of those function calls does not change, and it takes a long time to be computed, it is better to store the input and the relative result in a cache. For example:

```python
def fib(n):
    if n in in cache:
        return cache[n]
    cache[n] = result = # code to compute the result
    return result
```

However, we may have many functions where we may want to use memoization, and we don't want to rewrite the code above over and over. We can implement this with a decorator, let's call it `@memo`. It looks like this:

In [25]:
@decorator
def memo(f):
    """Decorator that caches the return value for each call to f(args).
    Then, when called again with same args, we can just look it up."""
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # Some elements of args can't be a dict key
            return f(args)
    return _f

Here we have use a `try-except` pattern rather than an `if-else` one. It's like asking for forgiveness (`try-catch`) as opposed to asking for permissiono (`if-else`). In this case we use the `try-except` because we have the second type of exception: `TypeError`, which happens when the argument is not hashable, for example if we use a list as a key. If we used a particularly simple hash function for lists of integers, say the sum of the elements, we may have `y = [1, 2, 3]` which would be associated with the hash value 6. If, however, we modify the list so that `y[0] = 10`, now the hash value is 15.

To see how effective our `@memo` decorator is we may compare the decorated version of the function with the original one. We may measure time or, more interestingly, the number of function calls. We did something like this in a previous lesson, but now we will do it with a decorator.

In [26]:
@decorator
def countcalls(f):
    "Decorator that makes the function count calls to it, in callcounts[f]."
    def _f(*args):
        callcounts[_f] +=1
        return f(*args)
    callcounts[_f] = 0
    return _f

callcounts = {}

@countcalls
def fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

fib(10)

89

In [27]:
@countcalls
@memo
def fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

fib(10)

89

**TODO** this testing function is currently useless. It already treats the API calls as functions rather as than tuples above.

One fascinating thing is that, in the case of the function *without* memoization, if we compute the number of function calls for `fib(n)`, let's call it `calls[n]`, and divide by the number of calls for `fib(n-1)`, let's call it `calls[n-1]`, this ratio tends to the golden ratio $(1 + \sqrt{5})/2$ for n tending to infinity.

## Trace Tool

We have seen three main types of tools:

1. Debugging tools: `countcalls()`
2. Performance tools: `memo()`.
3. Expressiveness tools: `n_ary()`

Expressiveness tools give you more power to say more about your language. Performance tools make things faster. We want to add a debugging tool called `trace()` that helps debugging functions. In our Fibonacci example, every time we make a call we indent to the right. Every time we return we de-indent to the right.

In [28]:
@decorator
def trace(f):
    def _f(*args):
        signature = '%s(%s)' % (f.__name__, ', '.join(map(repr, args)))
        print('%s--> %s' % (trace.level * indent, signature))
        trace.level += 1
        try:
            result = f(*args)
            print('%s<-- %s === %s' % ((trace.level - 1) * indent, signature, result))
        finally:
            # If the function returns an error, we want to make sure we decrement.
            trace.level -= 1
        return result
    trace.level = 0
    return _f

## Disable Decorator

We want to have one more debug tool that we will call `disabled()`. This is just the identity function, and the idea is to deactivate some of the decorators we may have used. If we redefine `trace = disabled` and reload our program, and now `@trace` will return the function itself.

## Back To Languages

Imagine we want to manipulate algebraic expressions like `y + (3 + x)`. Regular expressions can handle a fixed number of parenthesized expressions, but not an arbitrary number of them (the problem is making sure that right parentheses match left parentheses). We will need **context free languages** to solve this problem.
If I have an expression like `(m * x) + b` we want to make sure that `x` is parsed together with `m` and not together with `b`.  In other words, we want to be able to write expressions like `... + ... + ...`, where multiplications are part of the `...`. We refer to the `...` parts as to **terms**. We want to write a grammar that defines the language of these expressions. Remember that the grammar is the *description* and the language is the set of all possible strings described by the grammar. 

In general we say that an expression consists of a term followed by an operator (for simplicity say only '+/-') and another expression, or `Exp => Term[-+]Exp|Term` The `|Term` part covers the case where our whole expression consists of one single term. This is the base case of our recursive definition of expression. 
```
Expr => Term[-+]Expr | Term
```

The rule for a term is similar.

```
Term[*/]
```

- a: this is an expression consisting of one term, a
- a + b: this is an expression consisting of two terms.
- a + b + c: this expression consists of three terms.

We can then write the rule for a term, taking into account the operations etc. Norvig writes down the following as the complete grammar for our language. We don't have something to interpret it, but we are operating on a "wishful thinking" basis.

```text
Exp => Term [+-] Exp | Term
Term => Factor [*/] Term | Factor
Factor => Funcall | Var | Num | [(] Exp [)]
Funcall => Var[(] Exps [)]
Exps => Exp [,] Exps | Exp
Var => [a-zA-Z_]\w*
Num => [-+]?[0-9]+([.][0-9]*)?
```

Now we need a function that can take the string above and use it in something we can use as a parser, ideally a function `G = grammar(...)` where `...` is the set of rules above, represented as a string. What would be a good format for `G`? A dictionary, the keys of which would be the names on the l.h.s. of the rules, and the values would be some object corresponding to the representation on the r.h.s. Norvig's choice is to use tuples of possible choices. He uses tuples and not sets because order matters. Each element of the tuple is going to be a sequence, more precisely a list, and each element of the list is going to be an atom, where an atom can be either a name of the categories in the rules above, or a regular expression matching them. It would look something like

```python
G = {'Exp': (['Term', '[+-]', 'Exp'], ['Term']),
     'Term': (['Factor', '[*/]', 'Term'], ['Factor'])}
```

We could have written the grammar as a dictionary from the beginning, but the idea here is to first imagine the language as we wish it was, and then and only then write it in a way the computer can understand. All we have to do is write the function `Grammar` that converts the string of rules into a dictionary.

In [29]:
description = r"""
Exp     => Term [+-] Exp | Term
Term    => Factor [*/] Term | Factor
Factor  => Funcall | Var | Num | [(] Exp [)]
Funcall => Var[(] Exps [)]
Exps    => Exp [,] Exps | Exp
Var     => [a-zA-Z_]\w*
Num     => [-+]?[0-9]+([.][0-9]*)?
"""

def split(text, sep=None, maxsplit=-1):
    "Like str.split applied to text, but strips whitespace from each piece."
    return [t.strip() for t in text.strip().split(sep, maxsplit) if t]

def grammar(description):
    """Convert a description to a grammar."""
    G = {}
    for line in split(description, '\n'):
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

In [30]:
tmp = grammar(description)
tmp

{'Exp': (['Term', '[+-]', 'Exp'], ['Term']),
 'Term': (['Factor', '[*/]', 'Term'], ['Factor']),
 'Factor': (['Funcall'], ['Var'], ['Num'], ['[(]', 'Exp', '[)]']),
 'Funcall': (['Var[(]', 'Exps', '[)]'],),
 'Exps': (['Exp', '[,]', 'Exps'], ['Exp']),
 'Var': (['[a-zA-Z_]\\w*'],),
 'Num': (['[-+]?[0-9]+([.][0-9]*)?'],)}

The grammar so defined would parse an expression like `m*x+b` correctly, but would fail on `m * x + b`. The following version of `grammar()` allows for spaces.

In [31]:
def grammar(description, whitespace=r'\s*'):
    """Convert a description to a grammar. Each line is a rule for a
    non-terminal symbol; it looks like this:
        Symbol => A1 A2 ... | B1 B2 ... | C1 C2...
    where the right-hand side is one or more alternatives, separated by spaces.
    An atom is either a symbol on some left-hand side, or it is a regular
    expression that will be passed to re.match to match a token.
    Notation for *, +, or ? not allowed in a rule alternative (but ok within
    a token). Use '\' to continue long lines. You must include spaces or tabs
    around '=>' and '|'. That's within the grammar description itself.
    The grammar that gets defined allows whitespace betwen tokens by default.
    Specify '' as the second argument to grammar() to disallow this (or supply
    any regular expression to describe allowable whitespace between tookens)"""
    G = {' ': whitespace}
    description = description.replace('\t', ' ') # Replace tabs with spaces
    for line in split(description, '\n'):
        lhs, rhs = split(line, ' => ', 1)
        alternatives = split(rhs, ' | ')
        G[lhs] = tuple(map(split, alternatives))
    return G

## Parser

We now need a parser, i.e., a function `parse(symbol, text, G)`. `text` is the text we want to parse and `G` is the grammar. This function returns a single result, i.e., a single remainder and not a set of remainders as in the previous case. This to make the grammar unambiguous. For example, if we want to pars an `Exp` we want first to see if we can parse the first alternative (remember, `Exp => Term [+-] Exp | Term`). If we can, we don't look at the alternative. We read rules from left to right, this is why earlier we said that order matters. If we wrote `Exp => Term | Term [+-] Exp` and we tried to parse `a + 3` the parser would stop at `a` and ignore the rest.

Previously we used regular expressions as *recognizers*. Here we use them as **parsers**, where they recognize whether an expression is part of the language, but they also provide an internal structure, the parse tree. Ultimately `parse(symbol, text, G)` will return a tuple with the parse tree and the remainder. If we parse

```python
parse('Exp', 'a * x', G)
```

we want to get back

```python
(['Exp', ['Term', ['Factor', ['Var', 'a']], '*', ['Term', ['Factor', ['Var', 'x']]]]],
 '')
```

The first row is the parse tree, the second row, containing only the empty string, is the remainder, which is emtpy because we consumed the whole expression. We will assume that failure returns the tuple `(None, None)`.

There are four cases we need to parse. We must be able to parse

1. An expression: `Exp`.
2. A regular expression: `[+-]`.
3. Alternatives: `([...], [...])`.
4. A list of atoms representing a sequence: `[..., ..., ...]`.

We will associate the following functions with the 4 cases above.

1. `parse_atom()`.
2. A variable `tokenizer`.
3. Included in `parse_atom()`.
4. `parse_sequence()`.

The tokenizer has two jobs: first, it has to handle whitespace that occurs before the token.
`parse_sequence`

In [32]:
import re

def parse(start_symbol, text, grammar):
    """Example call: parse('Exp', '3*x + b', G).
    Returns a (tree, remainder) pair. If remainder is '', it parsed the whole
    string. Failure iff remainder is None. This is a deterministic PEG parser,
    so rule order (left-to-right) matters. Do 'e => T op E | T', putting the
    longest parse first; don't do 'E =? T | T op E'
    Also, no left recursion allowed: don't do 'E => E op T'"""
    tokenizer = grammar[' '] + '(%s)'

    def parse_sequence(sequence, text):
        result = []
        for atom in sequence:
            tree, text = parse_atom(atom, text)
            if text is None:
                return Fail
            result.append(tree)
        return result, text

    @memo
    def parse_atom(atom, text):
        if atom in grammar:  # Non-terminal: tuple of alternatives
            for alternative in grammar[atom]:
                tree, rem = parse_sequence(alternative, text)
                if rem is not None:
                    return [atom] + tree, rem
            return Fail
        else:  # Terminal: match characters againast start of text
            m = re.match(tokenizer % atom, text)
            return Fail if (not m) else (m.group(1), text[m.end():])
    
    # Body of parse:
    return parse_atom(start_symbol, text)

Fail = (None, None)

You may notice there is a `@memo` decorator above `parse_atom`. Why? Suppose we have a very long term `(x+y....)` and we parse it using the rule `Exp => Term [+-] Exp | Term`. We parse the long term, then we look for a `[+-]`. If we do not find it, we fall back to the alternative, `Term`, which implies that we have to go back and parse the long term again. This is inefficient. We would like our parser to do this work only once. Adding the `@memo` decorator makes this parser faster.

`parse_atom(atom, text)` takes two strings as arguments, and strings are hashable. `parse(start_symbol, text, grammar)`, however, takes also the `grammar` argument, which is not hashable, therefore we cannot memoize the whole `parse()` function.

The grammar below is based on the rules defined at [www.w3.org](http://www.w3.org/Addressing/URL/5_BNF.html) and can be used to parse URLs. The `verify()` function finds all the tokens that are on the lhs, on the rhs. `Terminals` are tokens that are on the rhs but not on the lhs. `Suspects` are the tokens that look like they should appear on the lhs but they don't.
`Orphans` are tokens that appear on the lhs but not on the rhs, and are therefore useless.

In [33]:
URL = grammar(
    """
url => httpaddress | ftpaddress | mailtoaddress
httpaddress => http:// hostport /path? ?search?
ftpaddress => ftp:// login /path ; ftptype | ftp:// login / path
/path? => path | ()
?search? => [?] search | ()
mailtoaddress => mailto: xalphas @ hostname
hostport => host : port | host
host => hostname | hostnumber
hostname => ialpha . hostname | ialpha
hostnumber => digits . digits . digits . digits
ftptype => A formcode | E formcode | I | L digits
formcode => [NTC]
port => digits | path
path => void | segment / path | segment
segment => xalphas
search => xalphas + search | xalphas
login => userpassword hostport | hostport
userpassword => user : password @ | user @
user => alphanum2 user | alphanum2
password => alphanum2 password | password
path => void | segment / path | segment
void => ()
digits => digits digits | digit
digit => [0-9]
alpha => [a-zA-Z]
safe => [-$_@.&+]
extra => [()!*''""]
escape => % hex hex
hex => [0-9a-fA-F]
alphanum => alpha | digit
alphanums => alpha | digit | [-_.+]
ialpha => alpha xalphas | alpha
xalphas => xalpha xalphas | xalpha
xalpha => alpha | digit | safe | extra | escape
""", whitespace='()')

# Helper function to verify grammars
def verify(G):
    lhstokens = set(G) - set([' '])
    rhstokens = set(t for alts in G.values() for alt in alts for t in alt)
    def show(title, tokens):
        print(title, '=', ' '.join(sorted(tokens)))
    show('Non-Terms', G)
    show('Terminals', rhstokens - lhstokens)
    show('Suspects ', [t for t in (rhstokens - lhstokens) if t.isalnum()])
    show('Orphans ', lhstokens - rhstokens)


In [34]:
verify(URL)

Non-Terms =   /path? ?search? alpha alphanum alphanums digit digits escape extra formcode ftpaddress ftptype hex host hostname hostnumber hostport httpaddress ialpha login mailtoaddress password path port safe search segment url user userpassword void xalpha xalphas
Terminals = % ( () ) + . / /path : ; @ A E I L [()!*''""] [-$_@.&+] [-_.+] [0-9] [0-9a-fA-F] [?] [NTC] [a-zA-Z] alphanum2 ftp:// http:// mailto:
Suspects  = A E I L alphanum2
Orphans  = alphanum alphanums url


## Summary

This unit was about tools, how to build useful tools, how to apply them to components of a domain. We focused on one particular tool: language. We explored the ability to define our own language rather than rely on what Python makes available to us. We talked about grammars, interpreters, and compilers. The other tool we explored are functions. We saw that functions are very powerful, in ways statements cannot be, and this is because statements are not as composable as functions are. If you want to reuse a statement the only thing you can do is copy&paste, and this is a problem when you need to update a statement. With functions we do not have this limitation because we can compose them, rather than copy&paste them. When you modify the a function it is automatically modified wherever it is called. We talked about decorators as functions, about functions as objects, and we showed some patterns on how to put them together.

## Homework - JSON parser

The homework requires writing a grammar for the JSON language. You can look at [json.org](http://www.json.org). There is a little grammar on the right hand side, but it is not in the format we expect, so it needs to be translated into it. You should be able to parse the top level of the text, called `value`

```python
JSON = grammar(
    ## your description here
)
```

In [35]:
def json_parse(text):
    return parse('value', text, JSON)

def json_test():
    for e in examples:
        print(e, ' == ', parse('value', e, JSON))

examples = ['["testing", 1, 2, 3]',
            '-123.456e+789',
            '{"age": 21, "state": "CO", "occupation": "rides the rodeo"}']

## Solution

In [36]:
JSON = grammar(
"""
object => {} | { members }
members => pair, members | pair
pair => string : value
array => [[] []] | [[] elements []]
elements => value, elements | value
value => string | number | object | array | true | false | null
string => "[^"]*"
number => int frac exp | int frac | int exp | int
int => -?[1-9][0-9]*
frac => [.][0-9]+
exp => [eE][-+]?[0-9]+
""", whitespace='\*s'
)

## Homework - Inverse Function

We want to be able to compute inverse functions, and we want to be able to do this only once, and not for each function. For example, it is straightforward to write a function to compute the square of a number using only elementary operations (multiplication), but writing a function that computes the square root using only elementary operations is a lot harder (Newton's method). We would like to be able to write just `sqrt = inverse(square)` and be done with it. We are going ot consider only functions defined on the non-negative numbers and are monotonically increasing. Here is a simple definition, and the goal is to write a more efficient version. The final version should have a runtime closer to the logarithm of the input to `f_1`. Two hints for this algorithm

1. Binary search.
2. Newton's method.

In [37]:
def inverse(f, delta=1/128.):
    """Given a function y = f(x) that is a monotonically increasing function on
    non-negative numbers, return the fuynction x = f_1(y) that is an approximate
    inverse, picking the closest value to the inverse within delta."""
    def f_1(y):
        x = 0
        while f(x) < y:
            x += delta
        # Now x is too big, x-delta is too small; pick the closest to y
        return x if (f(x) - y < y - f(x - delta)) else x - delta
    return f_1

def square(x):
    return x * x

sqrt = inverse(square)

print(sqrt(100))

10.0


## Solution

We start with a certain step on x. If f(x) i greater, we double the step and we keep doubling until we overshoot f(x), this gives us `low` and `high`. We now use binary search to zero-in on the right value.

In [38]:
def inverse(f, delta=1/1024.):
    """Given a function y = f(x) that is a monotonically increasing function on
    non-negative integers, return the function x = f_1(y) that is an approximate
    inverse, picking the closest integer to the inverse."""
    def f_1(y):
        lo, hi = find_bounds(f, y)
        return binary_search(f, y, lo, hi, delta)
    return f_1

def find_bounds(f, y):
    "Find values lo, hi such that f(lo) <= y <= f(hi)"
    # Keep doubling x until f(x) >= y; that's hi;
    # and lo will be either the previous x or 0.
    x = 1.
    while f(x) < y:
        x = x * 2.
    lo = 0 if (x == 1) else x/2.
    return lo, x

def binary_search(f, y, lo, hi, delta):
    "Given f(lo) <= y <= f(hi), return x such that f(x) is within delta of y."
    # Continually split the region in half
    while lo <= hi:
        x = (lo + hi) / 2.
        if f(x) < y:
            lo = x + delta
        elif f(x) > y:
            hi = x - delta
        else:
            return x
    return hi if (f(hi) - y < y - f(lo)) else lo

# Homework - Find HTML Tags

Finding HTML tags with `str.find()` is not robust to spaces, line-feeds etc. We are looking only for opening tags, not closing ones. Something like `<a href="unit3.py">`, i.e., an opening angle bracket, a tag, optional attributes and a closing bracket. Write a function `findtags(text)` that returns a list of strings. You can use Python's `re` module, our own implementation, the context-free grammar we created, whatever.

### Solution

The solution below uses the `re` module. For readability we split the regular expression looking for optional attributes and the one looking for tags.

In [39]:
def findtags(text):
    parms = '(\w+\s*=\s*"[^"]*"\s*)*'
    tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
    return re.findall(tags, text)

## Challenge Problem - Grammar and Parser for Regular Expressions

Having an API for regular expressions is useful, but the expression `plus(opt(alt(lit('a'), lit('b'))))` can be expressed in strign form as `(a|b)?+`, which is much more concise. The goal of this challenge is to write a grammar and a parser for regular expressions.

You should first build a RE grammar using the tools we provided, then a parser `parse('RE', text)` that returns some sort of tree that is not yet the API form, so we need one more function to convert the tree format to the API format.

```python
REGRAMMAR = grammar("""
RE => ## your description here
""", whitespace='')

def parse_re(pattern):
    return convert(parse('RE', pattern, REGRAMMAR))

def convert(tree):
    ## your code here
```