# Lesson 3 - Introduction - Language

Python has a syntax for statements, expressions, classes and operator overloading (e.g. `x + y` where `x` and `y` are instances of a given class. Here the `+` operator is overloaded). It also contains some mini-languages like string formatting. These are small, highly specialized languages that serve a specific purpose. The overarching idea in this unit is that one can develop a domain-specific language to represent and solve specific problems. To this end, we will describe what a *language* is, what a *grammar* is, the difference between a *compiler* and an *interpreter*, and how to use languages as a design tool.

## Regular Expressions

REs are an example of a language that can be expressed as strings. For example `'a*b*c*'`. To make sense of inputs like these we need to make sense of what the possible *grammars* are and what are the possible *languages* that those grammars correspond to. Regular Expression operate on strings and use patterns represented by strings, but they can become fairly complicated, have nested structures that make them look more like trees than like strings. In the rest of this unit we will be dealing with an internal representation of REs based on trees that will be different from the one above. The input will still be in the form of strings, but they will be internally represented as trees, and we will be dealing with trees from now on.

A **grammar** is a description of a language and a **language** is a set of strings. In our example above, `'a*b*c*` is the description of the grammar and strings like `abc`, `aaabbcc`, `ccccc` form the language associated with such grammar.

Representing grammars as strings, like `'a*b*c*` or `'a+b?` is convenient for small expressions, but it can become complex with longer ones. We are going to use a representation that is more **compositional**. We are going to describe an **Application Programming Interface (API)**, (as opposed to the UI), i.e., a series of function calls that can be used to describe the grammar of a RE.

In the example above, the grammar is described via a Regular Expression. Python has the `re` module, so we could leverage the functions in that module, but the point of this unit is not to learn how to use REs, but rather how to build a **language processor**.

The **API calls** listed below are the building blocks of our language description, i.e., of our grammar.

- `lit(s)` is the **literal** of string `s`. `lit('a')` describes the language consisting only of character string `'a'`, i.e., `{'a'}`, and nothing else.
- `seq(x, y)` is the **sequence** of x and y. `seq(lit('a'), lit('b'))` would describe the language consisting only of the string `'ab'`, i.e. `{'ab'}`. **Question**: is it a *concatentaion*?
- `alt(x, y)` stands for **alternatives**. `seq(lit('a'), lit('b'))` would correspond of two possibilities: either `a` or `b`, i.e., `{'a', 'b'}`.
- `star(x)` means **zero or more repetitions** of `'x'`, and would therefore correspond to `{'a', 'aa', 'aaa',...}` and so on.
- `oneof(s)` is the same as `alt(c1, c2, ...)` where `'c1'. 'c2', ...` etc. are the characters string `s` is made of. `oneof('abc')` matches `{'a', 'b', 'c'}`.
- `eol` means **end of line* matches only the end of a character string and nothing else, so it matches the empty string `{''}`, but only at the end. `seq(lit('a'), eol)` matches `{'a'}` if it is at the end of the line.
- `dot` matches any possible character `{'a', 'b', 'c',...}`
- `plus(s)` means **one or more repetitions** of `'x'`. **NOTE** it does not seem to be implemented in this unit.

The API calls above implement a subset of regular expression metacharacters. These are based on [Rob Pike's regular expression matcher](https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html). The operators above define patterns that are searched fo by these two functions:

- `search(pattern, text)`: returns a string that is the earliest match of the pattern in the text, and if there is more than one match in the same location, it will be the longest of those.
- `match(pattern, text)`: matches only if the pattern occurs at the very start of `text`, so `match('def', 'abcdef')` would return `None`.

The naming of these functions is different from that used by Pike, but is consistent with the naming used in the `re` module. Below we provide the implementations of `search()` and `match()`, plus a couple of utility functions. Of particular interest is the structure of `match_star(p, pattern, text)`, which uses a clever recursive call.

`match1(p, text)` function returns `True` if the first character of `text` is `p` *or* if is `p` is `.`, i.e., if the first character is supposed to be *any* character.

In [24]:
def match1(p, text):
    """Return True if first character of text matches pattern character p."""
    if not text:
        return False
    return p == '.' or p == text[0]

`search(pattern, text)` returns `True` if `pattern` is found anywhere in `text`. If the first character of `pattern` is `^`, `pattern` must appear at the very beginning of `text`. Note that this function depends on the function `match(pattern, text)` defined below.

In [25]:
def search(pattern, text):
    "Return True if pattern appears anywhere in text."
    if pattern.startswith('^'):
        return match(pattern[1:], text)
    else:
        return match('.*' + pattern, text)

`match(pattern, text)` returns `True` if `pattern` appears at the start of the in the text. It covers the following special cases:

- `pattern` is the empty string `''`. It returns `True` (**WHY?**).
- `pattern` is `'$'`, i.e., which is true only when the end and the beginning of the string coincide, which only happens for the empty string.
- If `pattern` is a multicharacter string and the *second* character is either `'*'` or `'?'`, it splits the pattern into three pieces: the first character `p`, the operator `op`, and the pattern `pat`.
  - If `op` is `'*'` it calls `match_star(p, pat, text)`.
  - If `op` is `'?'` (zero or one occurrances of the character before) it checks whether `p` matches the first character of `text` and whether `pat` matches the rest of `text`. If this is the case, it returns `True`. If this condition is `False`, it checks whether `pat` matches (all of) `text`.
- Finally, if none of the above holds, it makes a recursive call where it checks whether the first character of `pattern` matches the first character of `text` with `match()` and uses (recursively) `match(pattern[1:], test[1:])` to match the rest of the pattern against the rest of the text.

In [26]:
def match(pattern, text):
    "Return True if pattern appears at the start of text."
    if pattern == '':
        return True
    elif pattern == '$':
        return (text == '')
    elif len(pattern) > 1 and pattern[1] in '*?':
        p, op, pat = pattern[0], pattern[1], pattern[2:]
        if op == '*':
            return match_star(p, pat, text)
        elif op == '?':
            if match1(p, text) and match(pat, text[1:]):
                return True
            else:
                return match(pat, text)
        else:
            return (match1(pattern[0], text) and
                    match(pattern[1:], text[1:]))

`match_star(p, pattern, text)` returns `True` if zero or more instances of `p` are followed by `pattern`. As above, matching an arbitrary number of instances is achieved via a recursive call to `match_star()` itself.

In [27]:
def match_star(p, pattern, text):
    """Return True if any number of char p, followed by pattern,
    matches text."""
    return (match(pattern, text) or  # match zero times
            (match1(p, text) and     # match exactly one time
             match_star(p, pattern, text[1:])))  # Brilliant!

## Concept Inventory

Let's make a list of the concept we need to consider in this unit:

- Patterns.
- Texts we want to match the patterns against, and the result of this mathing.

This is pretty much it, however, we will also consider the following concepts:

- A concept of **partial result**.
- A notion of **control over iteration**.

To understand why these additional notions are needed, let's consider the following example: our pattern is `'a*b+'` and our text is `'aaab'`. If our pattern is represented in our API as `seq(star('a'), plus(lit('b')))`, the first part, `star('a')` would match the first character, but the second part of the pattern, `'b+'` would not match. Only after matching the first three `'a'`s we finally match a `'b'`. We would then need a mechanism to iterate through all possible substrings matching  the first of the pattern, and this seems quite tricky. A similar situation occurs, if we need to evaluate between alternatives. In such cases we need some form of control over this form of iteration.

It turns out (no explanation provided in the lesson) that representing these partial results as a **set of remainders of the text** is a good choice. By *remainder* we mean everything left after matching `pattern`. For example, if `pattern = '^a'` and `text = 'abacus'`, the remainder is `'bacus'`.  The `matchset(pattern, text)` function below returns this set of remainders. For example, `matchset(star(lit(a)), text='aaab')` would then return `set(['aaab', 'aab', 'ab', 'b'])`.

To implement `matchset(pattern, text)` we first introduce a utility function `components(pattern)` which breaks a pattern into three parts: the operator `op` and the arguments `x` and `y`, which will be `None` if missing.

**Important**: here `pattern` is a *tuple* of the form `(op, x, y)`, for example `('lit', 'abc')`.

In [28]:
null = frozenset()  # Acts as Null

def components(pattern):
    """Return the op, x, and y arguments; x and y are None if missing.
    pattern is a tuple (op, x, y) with x, y optional."""
    x = pattern[1] if len(pattern) > 1 else None
    y = pattern[2] if len(pattern) > 2 else None
    return pattern[0], x, y

In [29]:
#----------------
# User Instructions
#
# The function, matchset, takes a pattern and a text as input
# and returns a set of remainders. For example, if matchset 
# were called with the pattern star(lit(a)) and the text 
# 'aaab', matchset would return a set with elements 
# {'aaab', 'aab', 'ab', 'b'}, since a* can consume none, one, two
# or all three of the a's in the text.
#
# dot:   matches any character.
# oneof: matches any of the characters in the string it is 
#        called with. oneof('abc') will match a or b or c.

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)
    elif 'dot' == op:
        return set([text[1:]])
    elif 'oneof' == op:
        return set([text[text.find(x)+1:]])
    elif 'eol' == op:
        return set(['']) if text == '' else null
    elif 'star' == op:
        return (set([text]) |
                set(t2 for t1 in matchset(x, text)
                    for t2 in matchset(pattern, t1) if t1 != text))
    else:
        raise ValueError('unknown pattern: %s' % pattern)

When `op == 'alt'` the function returns the result of `matchset(x, text)` or `matchset(y, text)`. This to account for the fact that `x` and `y` may be composite expressions. The result is, therefore, the *union* of these two sets.

Let's try to understand the case for `'seq'`. Given `('seq', x, y)`, the expression

```python
set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
```

`matchset(x, text)` returns a set the elements of which, indicated by `t1`, become the inputs of `matchset(y, t1)`, which in turn returns a set the elements of which we indicate with `t2`.

In [30]:
def test():
    assert matchset(("lit", "abc"), "abcdef") == set(["def"])
    assert matchset(
        ("seq", ("lit", "hi "), ("lit", "there ")), "hi there nice to meet you"
    ) == set(["nice to meet you"])
    assert matchset(("alt", ("lit", "dog"), ("lit", "cat")), "dog and cat") == set(
        [" and cat"]
    )
    assert matchset(("dot",), "am i missing something?") == set(
        ["m i missing something?"]
    )
    assert matchset(("oneof", "a"), "aabc123") == set(["abc123"])
    assert matchset(("eol",), "") == set([""])
    assert matchset(("eol",), "not end of line") == frozenset([])
    assert matchset(("star", ("lit", "hey")), "heyhey!") == set(
        ["!", "heyhey!", "hey!"]
    )
    return "tests pass"

print(test())

tests pass


### Search and Match

With this definition of `matchset()` we can implement `search(pattern, text)` and `match(pattern, text)`.

In [31]:
#---------------
# User Instructions
#
# Complete the search and match functions. Match should
# match a pattern only at the start of the text. Search
# should match anywhere in the text.

null = frozenset()

def search(pattern, text):
    "Match pattern anywhere in text; return longest earliest match or None."
    for i in range(len(text)):
        m = match(pattern, text[i:])
        if m:
            return m
        
def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = matchset(pattern, text)
    if remainders:
        shortest = min(remainders, key=len)
        return text[:text.find(shortest)] if len(text) > 1 else text

def test():
    assert match(('star', ('lit', 'a')),'aaabcd') == 'aaa'
    assert match(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == None
    assert match(('alt', ('lit', 'b'), ('lit', 'a')), 'ab') == 'a'
    assert search(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == 'b'
    return 'tests pass'

print(test())

tests pass


## Filling Out The API

In the code below we provide the implementations of the various API calls. When we call an operator as a function, say `lit('hello')`, we obtain a tuple containing the name of the operator and, depending on the operator, the arguments or other operator-arguments tuples, e.g., `('lit', 'hello)`.

In [32]:
def lit(string):
    return ('lit', string)

def seq(x, y):
    return ('seq', x, y)

def alt(x, y):
    return ('alt', x, y)

def star(x):
    return ('star', x)

def plus(x):
    return ('seq', x, ('star', x))

def opt(x):
    return alt(lit(''), x) #opt(x) means that x is optional

def oneof(chars):
    return ('oneof', tuple(chars))

dot = ('dot',)
eol = ('eol',)

def test():
    assert lit('abc') == ('lit', 'abc')
    assert seq(('lit', 'a'), ('lit', 'b')) == ('seq', ('lit', 'a'), ('lit', 'b'))
    assert alt(('lit', 'a'), ('lit', 'b')) == ('alt', ('lit', 'a'), ('lit', 'b'))
    assert star(('lit', 'a')) == ('star', ('lit', 'a'))
    assert plus(('lit', 'c')) == ('seq', ('lit', 'c'), ('star', ('lit', 'c')))
    assert opt(('lit', 'x')) == ('alt', ('lit', ''), ('lit', 'x'))
    assert oneof('abc') == ('oneof', ('a', 'b', 'c'))
    return 'tests pass'

print(test())

tests pass


## Compiling

Let's summarize how *interpreters* work. In the case of REs we have *patterns*, e.g., `(a|b)+`, and we have *languages*, i.e., set of strings like `{'a', 'b', 'ab', 'ba', ...}` defined by the pattern, and then we have **interpreters** like `matchset(pattern, text)` which return a set of strings. We say that `matchset` is an interpreter because it takes a pattern as a data structure and operates over that pattern. As we can see from its implementation, it has a big `if-elif-else` statement where it checks what type of operator we have in order to select the next action.

There is an inherent inefficiency, in that the pattern is only defined once, but we may want to apply that same pattern to many different texts. Every time we have to go through the sequence of `if-elif-else` in order to figure out what type of operator we have, but we should already know that.

There is another type of interpreter, called the **compiler**, which does all this work at once, the very first time the pattern is defined. Whereas an interpreter takes a pattern and a text and operates on those, a compiler has two steps. In the first step there is a compilation function which takes just the pattern and returns a *compiled object*, let's call it `c`. Then the compiled object is executed taking the text as argument: `c(text)`. While in the interpreter all the work is done by the interpreter itself, in our case by `matchset()`, in the case of a compiler some of the work is done during the compilation stage and some happens every time we get a new text.

In the case of `lit` our API returns:

```python
def lit(s):
    return ('lit', 's')
```

We defined `matchset(pattern, text)` such that when the pattern contains `'lit'` we compute the remainders as:

```python
def matchset(pattern, text):
    op, x, y = components(pattern)
    # other code
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    # other code
```

Now, as soon as we construct a literal, instead of getting a tuple we will get a *function* that returns the set that `matchset()` would have given us. We can then apply this function to `text`.

```python
def lit(s):
    return lambda text: set([text[len(s):]]) if text.startswith(s) else null
```

## Lower Level Compilers

We can define a pattern, say `pat = lit('a')` which is now a function, not a tuple, which gives us the set of the remainders. In an interpreter we have patterns that describe the strings, i.e. the language. In a compiler we have two sets of descriptions to deal with: a description of what the pattern looks like and a description for what the compiled code looks like. In our case, the compiled code consists of Python functions, which are a good target representation because they are flexible. Compilers for languages like C generate code that is the actual machine instructions for the computer, and this is a complex process. There is an intermediate process that generates code for a Virtual Machine. Java and Python follow this approach. The `dis(code)` function from the `dis` module generates the *bytecode* for the Python virtual machine.

In [33]:
import dis
import math

dis.dis(lambda x, y: math.sqrt(x**2 + y**2))

  4           0 RESUME                   0
              2 LOAD_GLOBAL              0 (math)
             14 LOAD_METHOD              1 (sqrt)
             36 LOAD_FAST                0 (x)
             38 LOAD_CONST               1 (2)
             40 BINARY_OP                8 (**)
             44 LOAD_FAST                1 (y)
             46 LOAD_CONST               1 (2)
             48 BINARY_OP                8 (**)
             52 BINARY_OP                0 (+)
             56 PRECALL                  1
             60 CALL                     1
             70 RETURN_VALUE


In the code below, we implement the compiled functions for `lit(s)`, `seq(x, y)` and `alt(x, y)`. The point is to remember that each of `x, y` are functions returning a set, therefore the compiler for `alt(x, y)` is the union of `x(text)`, which is a set, and `y(text)`, which is another set.

In [34]:
def lit(s):
    return lambda text: set([text[len(s):]]) if text.startswith(s) else null

def seq(x, y):
    return lambda text: set().union(*map(y, x(text)))

def alt(x, y):
    return lambda text: x(text).union(y(text))
        
null = frozenset([])

def oneof(chars):
    return lambda t: set([t[1:]]) if (t and t[0] in chars) else null

def star(x): return lambda t: (set([t]) | 
                               set(t2 for t1 in x(t) if t1 != t
                                   for t2 in star(x)(t1)))

dot = lambda t: set([t[1:]]) if t else null
eol = lambda t: set(['']) if t == '' else null

def test():
    g = alt(lit('a'), lit('b'))
    assert g('abc') == set(['bc'])
    return 'test passes'

print(test())

test passes


With this approach, `match(pattern, text)` can be written as follows.

In [35]:
def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = pattern(text)
    if remainders:
        shortest = min(remainders, key=len)
        return text[:len(text)-len(shortest)]
    
def test():
    assert match(star(lit('a')), 'aaaaabbbaa') == 'aaaaa'
    assert match(lit('hello'), 'hello how are you?') == 'hello'
    assert match(lit('x'), 'hello how are you?') == None
    assert match(oneof('xyz'), 'x**2 + y**2 = r**2') == 'x'
    assert match(oneof('xyz'), '   x is here!') == None
    return 'tests pass'

print(test())

tests pass


## Recognizers and Generators

What we have done so far is the **recognizer task**. We have a function `match(pattern, text)` which *recognizes* if the prefix of `text` is in the language defined by `pattern`.

The **generator task** takes a pattern `pattern` and generates the complete language defined by that pattern. For example, the pattern `(a|b)(a|b)` generates the language `{'aa', 'ab', 'ba', 'bb}`. If the pattern is `a*` the corresponding language is an infinite set. We could use a generator function to generate each element of such language one at a time, but we will instead limit the sizes of the strings we want, and this will always produce finite sets. We will take the compiler's approach, and instead of calling `gen(pat)`, i.e., the generator, as a function, on the pattern, we will have the generator compiled into the pattern. `pat()` will therefore be a function and we will apply it to a set of integers representing the possible range of lengths that we want to retrieve and that will return a set of strings. For example, given `pat = a*`, `pat({1, 2, 3})` should return all strings of length 1, 2, or 3, i.e., `{'a', 'aa', 'aaa'}`. The functions below implement the generators for the various operations.

This is the whole compiler.

In [36]:
def lit(s):
    return lambda Ns: set([s]) if len(s) in Ns else null

def alt(x, y):
    return lambda Ns: x(Ns) | y(Ns)

def star(x):
    return lambda Ns: opt(plus(x))(Ns)

def plus(x):
    return lambda Ns: genseq(x, star(x), Ns, startx=1) #Tricky

def oneof(chars):
    return lambda Ns: set(chars) if 1 in Ns else null

def seq(x, y):
    return lambda Ns: genseq(x, y, Ns)

def opt(x):
    return alt(epsilon, x)

dot = oneof('?')    # You could expand the alphabet to more chars.
epsilon = lit('')   # The pattern that matches the empty string.

def test():
    
    f = lit('hello')
    assert f(set([1, 2, 3, 4, 5])) == set(['hello'])
    assert f(set([1, 2, 3, 4]))    == null 
    
    g = alt(lit('hi'), lit('bye'))
    assert g(set([1, 2, 3, 4, 5, 6])) == set(['bye', 'hi'])
    assert g(set([1, 3, 5])) == set(['bye'])
    
    h = oneof('theseletters')
    assert h(set([1, 2, 3])) == set(['t', 'h', 'e', 's', 'l', 'r'])
    assert h(set([2, 3, 4])) == null
    
    return 'tests pass'

print(test())

tests pass


We can make the compiler more efficient. For example, in the definition of `lit(s)` we call `set([s])` every time the output of `lit(s)` is called. This seems wasteful. A better way of writing it is:

In [37]:
def lit(s):
    set_s = set([s])  # We create this only once
    # Every time we call the function below, we refer to the set_s defined above
    return lambda Ns: set_s if len(s) in Ns else null

Similarly, we can pull out the `set(chars)` in the defintion of `oneof(chars)`.

We still must define `genseq()`. If we pass two arguments, `x, y` to `seq(x, y)` (not `genseq()`), this returns a function of `Ns`, `fn(Ns)` which returns a set of texts that match. In this respect, `seq(x, y)` is delaying the computation of the output. `geneseq(x, y, Ns)`, instead, immediately calculates the output set. One thing we know about this function is that we will have to call `x(Nx)`, where `Nx` is a set of numbers which we don't yet know, and then we will have to call `y(Ny)`, where `Ny` is a possibly different set of numbers, then we have to concatenate together the results and see if this concatenation is within the allowable set defined by `Ns`. What do we know about `Nx` and `Ny` with respect to `Ns`? `Ns` could be a dense set, say `{0, 1, 2, ..., 10}` or it could be a sparse set, say just `{10}`, but in either case, `Nx + Ny <= 10`, and `Nx` can be anything up to 10. For the `y(Ny)` we have two choices: we could wait for `x(Nx)` to return its results and pass them through `y` or we could do it all at once and then try to combine them together and see if they match up. This is easier because in such case `Ny` could also be any number up to 10 in our example. So, both `Nx` and `Ny` could be anything up to 10 inclusive and if we get some results out, for each of them we add them up and check if they are in `Ns`. A candidate solution for the `geneseq()` function is shown below.

In [38]:
def genseq(x, y, Ns):
    Nss = range(max(Ns) + 1)
    return set(m1 + m2
               for m1 in x(Nss)
               for m2 in y(Nss)
               if len(m1 + m2) in Ns)

This function, however, can give rise to infinite recursions. Where do we use recursion? In two functions:`plus()` and `star()`, but `star()` is defined in terms `plus()`, so we need to fix `plus()` in order to avoid infinite recursion. We are essentially defining `x+` as `xx*`, i.e., `seq(x, (star, x))`. In most cases, this works, but if we define `pat = plus(opt(a))`, `opt(a)` means that we are picking either `a` or the empty string, and as we go through the loop we may pick the empty string an infinite number of times and we are never going to get past the values in the set `Ns`, and we will keep going forever. **TODO**: clarify this mess.

This is why we have `startx=1` in `star()`, i.e., we always ask `x` to have a length of at least 1, and this is how we break the recursion. We redefine `geneseq()` as:

In [39]:
def genseq(x, y, Ns, startx=0):
    "Set of matches to xy whose total len is in Ns, with x-match's len in Ns-len(..."
    # Tricky part: x+ is defined as x+ = x x*
    # To stop the recursion, the first x must generate at least 1 char,
    # and then the recursive x* has that many fewer characters. We use
    # startx=1 to say that x must match at least 1 character.
    if not Ns:
        return null
    xmatches = x(set(range(startx, max(Ns) + 1)))
    Ns_x = set(len(m) for m in xmatches)
    Ns_y = set(n - m for n in Ns for m in Ns_x if n - m >= 0)
    ymatches = y(Ns_y)
    return set(m1 + m2
               for m1 in xmatches for m2 in ymatches
               if len(m1 + m2) in Ns)

def test_gen():
    def N(hi):
        return set(range(hi + 1))
    a, b, c = map(lit, 'abc')
    assert star(oneof('ab'))(N(2)) == set(['', 'a', 'aa', 'ab', 'ba', 'bb', 'b'])
    assert (seq(star(a), seq(star(b), star(c)))(set([4])) == set(
        ['aaaa', 'aaab', 'aaac', 'aabb', 'aabc', 'aacc', 'abbb', 'abbc',
         'abcc', 'accc', 'bbbb', 'bbbc', 'bbcc', 'bccc', 'cccc']))
    assert (seq(plus(a), seq(plus(b), plus(c)))(set([5])) == set(
        ['aaabc', 'aabbc', 'aabcc', 'abbbc', 'abbcc', 'abccc']))
    assert (seq(oneof('bcfhrsm'), lit('at'))(N(3)) == set(
        ['bat', 'cat', 'fat', 'hat', 'mat', 'rat', 'sat']))
    assert (seq(star(alt(a, b)), opt(c))(set([3])) == set(
        ['aaa', 'aab', 'aac', 'aba', 'abb', 'abc', 'baa', 'bab',
         'bac', 'bba', 'bbb', 'bbc']))
    assert lit('hello')(set([5])) == set(['hello'])
    assert lit('hello')(set([4])) == set()
    assert lit('hello')(set([6])) == set()
    return 'test_gen passes'

print(test_gen())

test_gen passes


In all the above we have taken advantage of the *composability* of Python functions. Functions, unlike statements and expressions, which can only be composed by the programmer, can be composed dynamically. Functions also provide *control over time*: we can divide some of the work we want to do such that we do some now and some later. Expressions don't allow this separation.

## Changing `seq()`

The `seq()` function is binary, in the sense that it takes two arguments. If we want a sequence of four objects, say `a, b, c, d`, we need to call `seq(a, seq(b, seq(c, d)))`. It would be much easier if we could just write `seq(a, b, c, d)`. We want to refactor this function, but aren't we changing it's API? We should ask ourselves:

1. Which other functions does `seq()` interact with in our program?
2. If I change `seq()`, are these changes *backward compatible*? In other words, do I have to modify also the functions `seq()` interacts with?
3. Are the changes *internal* or *external*? Am I changing something inside `seq()` that doesn't affect the callers, or am I changing the interface to the outside world?

### Function mapping: decorators

We can refactor `seq()` without changing the API. To do this, we need to *map* our binary function `f(x, y)` and convert it to an n-ary function `g(x, y, ...)`. This mapping is done via a function, `n_ary()` in the example below, that takes the binary function `f(x, y)` and returns an n-ary function.

In [40]:
def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such that
    f(x, y, z) = f(x, f(y, z)) etc. Also allow for f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

def myseq(x, y):
    "My own seq function (meh)"
    return ('myseq', x, y)

myseq = n_ary(myseq)
print(myseq('a', 'b', 'c'))

('myseq', 'a', ('myseq', 'b', 'c'))


The pattern above is so common in Python that there is a special notation: the **decorator notation**. We can leverage it as follows:

In [41]:
@n_ary
def myseq(x, y):
    "My own seq function (meh)"
    return ('myseq', x, y)

print(myseq('a', 'b', 'c'))

('myseq', 'a', ('myseq', 'b', 'c'))


One limitation, however, is that if we check the docstring of the decorated function, this is what we get:

In [42]:
help(myseq)

Help on function n_ary_f in module __main__:

n_ary_f(x, *args)



Luckily `functools` has a function called `update_wrapper()` that takes two functions and copies the name, the documentation plus other things from the old function to the new function. For this, we need to modify the definition of `n_ary()`.

In [43]:
from functools import update_wrapper

def n_ary(f):
    """Given binary function f(x, y), return an n_ary function such that
    f(x, y, z) = f(x, f(y, z)) etc. Also allow for f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    update_wrapper(n_ary_f, f)  # update_wrapper(new_fn, old_fn) 
    return n_ary_f

@n_ary
def myseq(x, y):
    "My own seq function (meh)"
    return ('myseq', x, y)

In [44]:
help(myseq)

Help on function myseq in module __main__:

myseq(x, y)
    My own seq function (meh)



An even better approach is to create our own decorator that adds the the `update_wrapper()` call to, in the case above, `n_ary_f()`. We will call this new decorator `@decorator`. If we use `n_ary()` as a decorator on `myseq()` and apply `@decorator` to `n_ary()`, we have two updates to consider: one for the function we want to decorate, and one for the decorator itself. The pattern we want to follow is:

```python
def decorator(d):  # d is a decorator function
    def _d(f):
        update_wrapper(d(f), f)
    update_wrapper(_d, d)
    return _d
```

With this setup, `n_ary = decorator(n_ary)` would be updated by `update_wrapper(_d, d)` and `myseq = n_ary(myseq)` would be updated by `update_wrapper(d(f), f)`.

This is what we ultimately get:

In [47]:
def decorator(d):
    "Make function d a decorator. d wraps a function f"
    def _d(f):
        return update_wrapper(d(f), f)
    update_wrapper(_d, d)
    return _d

@decorator
def n_ary(f):
    """Given bynary function f(x, y) return an n-ary function such that
    f(x, y ,z) = f(x, f(y, z)), etc. Also allow for f(x) = x."""
    def n_ary_f(x, *args):
        return x if not args else f(x, n_ary_f(*args))
    return n_ary_f

@decorator
def myseq(x, y):
    return ('myseq', x, y)

In [48]:
help(myseq)

Help on function myseq in module __main__:

myseq(x, y)



Even more confusingly, the following code also works (due to [Darius Bacon](https://github.com/darius)). **TODO** understand what's going on here. Video 46 - Decorated Decorators Solution

In [46]:
def decorator(d):
    "Make function d a decorator. d wraps a function fn."
    return lambda fn: update_wrapper(d(fn), fn)

decorator = decorator(decorator)

## Cache Management

We want to leverage the concept of **memoization**. Particularly with recursive functions we will be making the same function calls over and over again. If the result of those function calls does not change, and it takes a long time to be computed, it is better to store the input and the relative result in a cache. For example:

```python
def fib(n):
    if n in in cache:
        return cache[n]
    cache[n] = result = # code to compute the result
    return result
```

However, we may have many functions where we may want to use memoization, and we don't want to rewrite the code above over and over. We can implement this with a decorator, let's call it `@memo`. It looks like this:

In [63]:
@decorator
def memo(f):
    """Decorator that caches the return value for each call to f(args).
    Then, when called again with same args, we can just look it up."""
    cache = {}
    def _f(*args):
        try:
            return cache[args]
        except KeyError:
            cache[args] = result = f(*args)
            return result
        except TypeError:
            # Some elements of args can't be a dict key
            return f(args)
    return _f

Here we have use a `try-except` pattern rather than an `if-else` one. It's like asking for forgiveness (`try-catch`) as opposed to asking for permissiono (`if-else`). In this case we use the `try-except` because we have the second type of exception: `TypeError`, which happens when the argument is not hashable, for example if we use a list as a key. If we used a particularly simple hash function for lists of integers, say the sum of the elements, we may have `y = [1, 2, 3]` which would be associated with the hash value 6. If, however, we modify the list so that `y[0] = 10`, now the hash value is 15.

To see how effective our `@memo` decorator is we may compare the decorated version of the function with the original one. We may measure time or, more interestingly, the number of function calls. We did something like this in a previous lesson, but now we will do it with a decorator.

In [64]:
@decorator
def countcalls(f):
    "Decorator that makes the function count calls to it, in callcounts[f]."
    def _f(*args):
        callcounts[_f] +=1
        return f(*args)
    callcounts[_f] = 0
    return _f

callcounts = {}

@countcalls
def fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

fib(10)

89

In [65]:
@countcalls
@memo
def fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

fib(10)

89

**TODO** this testing function is currently useless. It already treats the API calls as functions rather as than tuples above.

In [17]:
def test_search():
    a, b, c = lit('a'), lit('b'), lit('c')
    abcstars = seq(star(a), seq(star(b), star(c)))
    dotstar = star(dot)
    assert search(lit('def'), 'abcdefg') == 'def'
    assert search(seq('def', eol), 'abcdef') == 'def'
    assert search(seq(lit('def'), eol), 'abcdefg') == None
    assert search(a, 'not the start') == 'a'
    assert match(a, 'not the start') == None
    assert match(abcstars, 'aaabbbccccccccdef') == 'aaabbbcccccccc'
    assert match(abcstars, 'junk') == ''
    assert all(match(seq(abcstars, eol), s) == s for s in 'abc aaabbccc aaaabcccc'.split())
    assert all(match(seq(abcstars, eol), s) == None for s in 'cab aaabbcccd aaaa-b-cccc'.split())
    r = seq(lit('ab'), seq(dotstar, seq(lit('aca'), seq(dotstar, seq(a, eol)))))
    assert all(search(r, s) is not None for s in 'abracadabra abacaa about-acacia-flora'.split())
    assert all(match(seq(c, seq(dotstar, b)), s) is not None for s in 'cab cob carob cb carbuncle'.split())
    assert not any(match(seq(c, seq(dot, b)), s) for s in 'carb cb across scab'.split())
    return 'test_search passes'

# print(test_search())