# Lesson 3 - Introduction - Language

Python has a syntax for statements, expressions, classes and operator overloading (e.g. `x + y` where `x` and `y` are instances of a given class. Here the `+` operator is overloaded). It also contains some mini-languages like string formatting. These are small, highly specialized languages that serve a specific purpose. The overarching idea in this unit is that one can develop a domain-specific language to represent and solve specific problems. To this end, we will describe what a *language* is, what a *grammar* is, the difference between a *compiler* and an *interpreter*, and how to use languages as a design tool.

## Regular Expressions

REs are an example of a language that can be expressed as strings. For example `'a*b*c*'`. To make sense of inputs like these we need to make sense of what the possible *grammars* are and what are the possible *languages* that those grammars correspond to. Regular Expression operate on strings and use patterns represented by strings, but they can become fairly complicated, have nested structures that make them look more like trees than like strings. In the rest of this unit we will be dealing with an internal representation of REs based on trees that will be different from the one above. The input will still be in the form of strings, but they will be internally represented as trees, and we will be dealing with trees from now on.

A **grammar** is a description of a language and a **language** is a set of strings. In our example above, `'a*b*c*` is the description of the grammar and strings like `abc`, `aaabbcc`, `ccccc` form the language associated with such grammar.

Representing grammars as strings, like `'a*b*c*` or `'a+b?` is convenient for small expressions, but it can become complex with longer ones. We are going to use a representation that is more **compositional**. We are going to describe an **Application Programming Interface (API)**, (as opposed to the UI), i.e., a series of function calls that can be used to describe the grammar of a RE.

In the example above, the grammar is described via a Regular Expression. Python has the `re` module, so we could leverage the functions in that module, but the point of this unit is not to learn how to use REs, but rather how to build a **language processor**.

The **API calls** listed below are the building blocks of our language description, i.e., of our grammar.

- `lit(s)` is the **literal** of string `s`. `lit('a')` describes the language consisting only of character string `'a'`, i.e., `{'a'}`, and nothing else.
- `seq(x, y)` is the **sequence** of x and y. `seq(lit('a'), lit('b'))` would describe the language consisting only of the string `'ab'`, i.e. `{'ab'}`. **Question**: is it a *concatentaion*?
- `alt(x, y)` stands for **alternatives**. `seq(lit('a'), lit('b'))` would correspond of two possibilities: either `a` or `b`, i.e., `{'a', 'b'}`.
- `star(x)` means **zero or more repetitions** of `'x'`, and would therefore correspond to `{'a', 'aa', 'aaa',...}` and so on.
- `oneof(s)` is the same as `alt(c1, c2, ...)` where `'c1'. 'c2', ...` etc. are the characters string `s` is made of. `oneof('abc')` matches `{'a', 'b', 'c'}`.
- `eol` means **end of line* matches only the end of a character string and nothing else, so it matches the empty string `{''}`, but only at the end. `seq(lit('a'), eol)` matches `{'a'}` if it is at the end of the line.
- `dot` matches any possible character `{'a', 'b', 'c',...}`
- `plus(s)` means **one or more repetitions** of `'x'`. **NOTE** it does not seem to be implemented in this unit.

The API calls above implement a subset of regular expression metacharacters. These are based on [Rob Pike's regular expression matcher](https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html).

Below we create a function to test the implementation of these operators. The will not currently pass, since we have not implemented the API calls. The function below relies on two functions:

- `search(pattern, text)`: returns a string that is the earliest match of the pattern in the text, and if there is more than one match in the same location, it will be the longest of those.
- `match(pattern, text)`: matches only if the pattern occurs at the very start of `text`, so `match('def', 'abcdef')` would return `None`.

Below we provide the implementations of `search()` and `match()`, plus a couple of utility functions. Of particular interest is the structure of `match_star(p, pattern, text)`, which uses a clever recursive call. Since we have not yet implemented the API calls, `test_search()` will not pass.

The `match1(p, text)` function returns `True` if the first character of `text` is `p` *or* if is `p` is `.`, i.e., if the first character is supposed to be *any* character.

In [5]:
def match1(p, text):
    """Return True if first character of text matches pattern character p."""
    if not text:
        return False
    return p == '.' or p == text[0]

`search(pattern, text)` returns `True` if `pattern` is found anywhere in `text`. If the first character of `pattern` is `^`, `pattern` must appear at the very beginning of `text`. Note that this function depends on the function `match(pattern, text)` defined below.

In [6]:
def search(pattern, text):
    "Return True if pattern appears anywhere in text."
    if pattern.startswith('^'):
        return match(pattern[1:], text)
    else:
        return match('.*' + pattern, text)

`match(pattern, text)` returns `True` if `pattern` appears at the start of the in the text. It covers the following special cases:

- `pattern` is the empty string `''`. It returns `True` (**WHY?**).
- `pattern` is `'$'`, i.e., which is true only when the end and the beginning of the string coincide, which only happens for the empty string.
- If `pattern` is a multicharacter string and the *second* character is either `'*'` or `'?'`, it splits the pattern into three pieces: the first character `p`, the operator `op`, and the pattern `pat`.
  - If `op` is `'*'` it calls `match_star(p, pat, text)`.
  - If `op` is `'?'` it checks whether `p` matches the first character of `text` and whether `pat` matches the rest of `text`. If this is the case, it returns `True`. If this condition is `False`, it applies the case when the first character is not matched, and checks whether `pat` matches (all of) `text`.
- Finally, if none of the above holds, it makes a recursive call where it checks whether the first character of `pattern` matches the first character of `text` and uses (recursively) `match(pattern[1:], test[1:])` to match the rest of the pattern against the rest of the text.

In [7]:
def match(pattern, text):
    "Return True if pattern appears at the start of text."
    if pattern == '':
        return True
    elif pattern == '$':
        return (text == '')
    elif len(pattern) > 1 and pattern[1] in '*?':
        p, op, pat = pattern[0], pattern[1], pattern[2:]
        if op == '*':
            return match_star(p, pat, text)
        elif op == '?':
            if match1(p, text) and match(pat, text[1:]):
                return True
            else:
                return match(pat, text)
        else:
            return (match1(pattern[0], text) and
                    match(pattern[1:], text[1:]))

`match_star(p, pattern, text)` returns `True` if zero or more instances of `p` are followed by `pattern`. As above, matching an arbitrary number of instances is achieved via a recursive call to `match_star()` itself.

In [8]:
def match_star(p, pattern, text):
    """Return True if any number of char p, followed by pattern,
    matches text."""
    return (match(pattern, text) or  # match zero times
            (match1(p, text) and     # match exactly one time
             match_star(p, pattern, text[1:])))

## Concept Inventory

Let's make a list of the concept we need to consider in this unit:

- Patterns.
- Texts we want to match the patterns against, and the result of this mathing.

This is pretty much it, however, we will also consider the following concepts:

- A concept of **partial result**.
- A notion of **control over iteration**.

To understand why these additional notions are needed, let's consider the following example: our pattern is `'a*b+'` and our text is `'aaab'`. If our pattern is represented in our API as `seq(star('a'), plus(lit('b')))`, the first part, `star('a')` would match the first character, but the second part of the pattern, `'b+'` would not match. Only after matching the first three `'a'`s we finally match a `'b'`. We would then need a mechanism to iterate through all possible substrings matching  the first of the pattern, and this seems quite tricky. A similar situation occurs, if we need to evaluate between alternatives. In such cases we need some form of control over this form of iteration.

It is difficult to see it now, but if we choose to represent these partial results as a **set of remainders of the text**, where by *remainder* we mean everything left after the match, then everything falls in place, and much of the above trickiness goes away. We are therefore going to introduce a function called `matchset(pattern, text)` that returns this set of remainders. For example `matchset(star(lit(a)), text='aaab')` the output would be the set `{aaab, aab, ab, b}`.

To implement `matchset(pattern, text)` we first introduce a utility function `components(pattern)` which breaks a pattern into three parts: the operator `op` and the arguments `x` and `y`, which will be `None` if missing.

**Important**: here `pattern` is a *tuple* of the form `(op, x, y)`, for example `('lit', 'abc')`.

In [10]:
null = frozenset()  # Acts as Null

def components(pattern):
    "Return the op, x, and y arguments; x and y are None if missing."
    x = pattern[1] if len(pattern) > 1 else None
    y = pattern[2] if len(pattern) > 2 else None
    return pattern[0], x, y

In [11]:
#----------------
# User Instructions
#
# The function, matchset, takes a pattern and a text as input
# and returns a set of remainders. For example, if matchset 
# were called with the pattern star(lit(a)) and the text 
# 'aaab', matchset would return a set with elements 
# {'aaab', 'aab', 'ab', 'b'}, since a* can consume none, one, two
# or all three of the a's in the text.
#
# dot:   matches any character.
# oneof: matches any of the characters in the string it is 
#        called with. oneof('abc') will match a or b or c.

def matchset(pattern, text):
    "Match pattern at start of text; return a set of remainders of text."
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null
    elif 'seq' == op:
        return set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
    elif 'alt' == op:
        return matchset(x, text) | matchset(y, text)
    elif 'dot' == op:
        return set([text[1:]])
    elif 'oneof' == op:
        return set([text[text.find(x)+1:]])
    elif 'eol' == op:
        return set(['']) if text == '' else null
    elif 'star' == op:
        return (set([text]) |
                set(t2 for t1 in matchset(x, text)
                    for t2 in matchset(pattern, t1) if t1 != text))
    else:
        raise ValueError('unknown pattern: %s' % pattern)

When `op == 'alt'` the function returns the result of `matchset(x, text)` or `matchset(y, text)`. This to account for the fact that `x` and `y` may be composite expressions. The result is, therefore, the *union* of these two sets.

Let's try to understand the case for `'seq'`. Given `('seq', x, y)`, the expression

```python
set(t2 for t1 in matchset(x, text) for t2 in matchset(y, t1))
```

applies recursively `matchset`

In [12]:
def test():
    assert matchset(("lit", "abc"), "abcdef") == set(["def"])
    assert matchset(
        ("seq", ("lit", "hi "), ("lit", "there ")), "hi there nice to meet you"
    ) == set(["nice to meet you"])
    assert matchset(("alt", ("lit", "dog"), ("lit", "cat")), "dog and cat") == set(
        [" and cat"]
    )
    assert matchset(("dot",), "am i missing something?") == set(
        ["m i missing something?"]
    )
    assert matchset(("oneof", "a"), "aabc123") == set(["abc123"])
    assert matchset(("eol",), "") == set([""])
    assert matchset(("eol",), "not end of line") == frozenset([])
    assert matchset(("star", ("lit", "hey")), "heyhey!") == set(
        ["!", "heyhey!", "hey!"]
    )
    return "tests pass"

print(test())

tests pass


## Filling Out The API

In the code below we provide the implementations of the various API calls. When we call an operator as a function, say `lit('hello')`, we obtain a tuple containing the name of the operator and, depending on the operator, the arguments or other operator-arguments tuples, e.g., `('lit', 'hello)`.

In [13]:
def lit(string):
    return ('lit', string)

def seq(x, y):
    return ('seq', x, y)

def alt(x, y):
    return ('alt', x, y)

def star(x):
    return ('star', x)

def plus(x):
    return ('seq', x, ('star', x))

def opt(x):
    return alt(lit(''), x) #opt(x) means that x is optional

def oneof(chars):
    return ('oneof', tuple(chars))

dot = ('dot',)
eol = ('eol',)

def test():
    assert lit('abc') == ('lit', 'abc')
    assert seq(('lit', 'a'), ('lit', 'b')) == ('seq', ('lit', 'a'), ('lit', 'b'))
    assert alt(('lit', 'a'), ('lit', 'b')) == ('alt', ('lit', 'a'), ('lit', 'b'))
    assert star(('lit', 'a')) == ('star', ('lit', 'a'))
    assert plus(('lit', 'c')) == ('seq', ('lit', 'c'), ('star', ('lit', 'c')))
    assert opt(('lit', 'x')) == ('alt', ('lit', ''), ('lit', 'x'))
    assert oneof('abc') == ('oneof', ('a', 'b', 'c'))
    return 'tests pass'

print(test())

tests pass


### Search and Match

In [14]:
#---------------
# User Instructions
#
# Complete the search and match functions. Match should
# match a pattern only at the start of the text. Search
# should match anywhere in the text.

null = frozenset()

def search(pattern, text):
    "Match pattern anywhere in text; return longest earliest match or None."
    for i in range(len(text)):
        m = match(pattern, text[i:])
        if m:
            return m
        
def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = matchset(pattern, text)
    if remainders:
        shortest = min(remainders, key=len)
        return text[:text.find(shortest)] if len(text) > 1 else text

def test():
    assert match(('star', ('lit', 'a')),'aaabcd') == 'aaa'
    assert match(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == None
    assert match(('alt', ('lit', 'b'), ('lit', 'a')), 'ab') == 'a'
    assert search(('alt', ('lit', 'b'), ('lit', 'c')), 'ab') == 'b'
    return 'tests pass'

print(test())

tests pass


## Compiling

Let's summarize how *interpreters* work. In the case of REs we have *patterns*, e.g., `(a|b)+`, and we have *languages*, i.e., set of strings like `{'a', 'b', 'ab', 'ba', ...}` defined by the pattern, and then we have **interpreters** like `matchset(pattern, text)` which return a set of strings. We say that `matchset` is an interpreter because it takes a pattern as a data structure and operates over that pattern. As we can see from its implementation, it has a big `if-elif-else` statement where it checks what type of operator we have in order to select the next action.

There is an inherent inefficiency, in that the pattern is only defined once, but we may want to apply that same pattern to many different texts. Every time we have to go through the sequence of `if-elif-else` in order to figure out what type of operator we have, but we should already know that.

There is another type of interpreter, called the **compiler**, which does all this work at once, the very first time the pattern is defined. Whereas an interpreter takes a pattern and a text and operates on those, a compiler has two steps. In the first step there is a compilation function which takes just the pattern and returns a *compiled object*, let's call it `c`. Then there is the execution of that compiled object where we apply it to the text, i.e., `c(text)` to get the result. While in the interpreter all the work is done by the interpreter itself, in our case by `matchset()`, in the case of a compiler some of the work is done during the compilation stage and some happens every time we get a new text.

Let's see how it works with a reduced form of `matchset(pattern, text)` limited to the `lit` case.

**TODO** this testing function is currently useless. It already treats the API calls as functions rather as than tuples above.

In [18]:
def litmatchset(pattern, text):
    op, x, y = components(pattern)
    if 'lit' == op:
        return set([text[len(x):]]) if text.startswith(x) else null

We are going to take individual statements like the one above and we are going to put them in various parts of the compiler, and each of those parts are going to live in the constructor for the individual type of pattern. In the case of `lit` we defined

```python
def lit(s):
    return ('lit', 's')
```

but now we are going to return  a function, and precisely a function that returns line 4 of `litmatchset`. Now, as soon as we construct a literal, instead of getting a tuple we get the function from the text to the result that `matchset()` would have given us. We can then apply this function to `text`.

In [23]:
def lit(s):
    return lambda text: set([text[len(s):]]) if text.startswith(s) else null

lit('a')('a string')

{' string'}

## Lower Level Compilers

We can define a pattern, say `pat = lit('a')` which is now a function, not a tuple, which gives us the set of the remainders. In an interpreter we have patterns that describe the strings, i.e. the language. In a compiler we have two sets of descriptions to deal with: a description of what the pattern looks like and a description for what the compiled code looks like. In our case, the compiled code consists of Python functions, which are a good target representation because they are flexible. Compilers for language like C generate code that is the actual machine instructions for the computer, and this is a complex process. There is an intermediate process that generates code for a Virtual Machine. Java and Python follow this approach. The `dis(code)` function from the `dis` module generates the *bytecode* for the Python virtual machine.

In [25]:
import dis
import math

dis.dis(lambda x, y: math.sqrt(x**2 + y**2))

  4           0 RESUME                   0
              2 LOAD_GLOBAL              0 (math)
             14 LOAD_METHOD              1 (sqrt)
             36 LOAD_FAST                0 (x)
             38 LOAD_CONST               1 (2)
             40 BINARY_OP                8 (**)
             44 LOAD_FAST                1 (y)
             46 LOAD_CONST               1 (2)
             48 BINARY_OP                8 (**)
             52 BINARY_OP                0 (+)
             56 PRECALL                  1
             60 CALL                     1
             70 RETURN_VALUE


In the code below, we implement the compiled functions for `lit(s)`, `seq(x, y)` and `alt(x, y)`. The point is to remember that each of `x, y` are functions returning a set, therefore the compiler for `alt(x, y)` is the union of `x(text)`, which is a set, and `y(text)`, which is another set.

In [27]:
def lit(s):
    return lambda text: set([text[len(s):]]) if text.startswith(s) else null

def seq(x, y):
    return lambda text: set().union(*map(y, x(text)))

def alt(x, y):
    return lambda text: x(text).union(y(text))
        
null = frozenset([])

def oneof(chars):
    return lambda t: set([t[1:]]) if (t and t[0] in chars) else null

def star(x): return lambda t: (set([t]) | 
                               set(t2 for t1 in x(t) if t1 != t
                                   for t2 in star(x)(t1)))

dot = lambda t: set([t[1:]]) if t else null
eol = lambda t: set(['']) if t == '' else null

def test():
    g = alt(lit('a'), lit('b'))
    assert g('abc') == set(['bc'])
    return 'test passes'

print(test())

test passes


With this approach, `match(pattern, text)` can be written as follows.

In [29]:
def match(pattern, text):
    "Match pattern against start of text; return longest match found or None."
    remainders = pattern(text)# your code here.
    if remainders:
        shortest = min(remainders, key=len)
        return text[:len(text)-len(shortest)]
    
# null = frozenset([])  # Defined in a cell above

def test():
    assert match(star(lit('a')), 'aaaaabbbaa') == 'aaaaa'
    assert match(lit('hello'), 'hello how are you?') == 'hello'
    assert match(lit('x'), 'hello how are you?') == None
    assert match(oneof('xyz'), 'x**2 + y**2 = r**2') == 'x'
    assert match(oneof('xyz'), '   x is here!') == None
    return 'tests pass'

print(test())

tests pass


## Recognizers and Generators

What we have done so far is the **recognizer task**. We have a function `match(pattern, text)` which *recognizes* if the prefix of `text` is in the language defined by `pattern`.

The **generator task** takes a pattern `pattern` and generates the complete language defined by that pattern. For example, the pattern `(a|b)(a|b)` generates the language `{'aa', 'ab', 'ba', 'bb}`. If the pattern is `a*` the corresponding language is an infinite set. We could use a generator function to generate each element of such language one at a time, but we will instead limit the sizes of the strings we want, and this will always produce finite sets. We will take the compiler's approach, and instead of calling `gen(pat)`, i.e., the generator, as a function, on the pattern, we will have the generator compiled into the pattern. `pat()` will therefore be a function and we will apply it to a set of integers representing the possible range of lengths that we want to retrieve and that will return a set of strings. For example, given `pat = a*`, `pat({1, 2, 3})` should return all strings of length 1, 2, or 3, i.e., `{'a', 'aa', 'aaa'}`. The functions below implement the generators for the various operations.

This is the whole compiler.

In [30]:
def lit(s):
    return lambda Ns: set([s]) if len(s) in Ns else null

def alt(x, y):
    return lambda Ns: x(Ns) | y(Ns)

def star(x):
    return lambda Ns: opt(plus(x))(Ns)

def plus(x):
    return lambda Ns: genseq(x, star(x), Ns, startx=1) #Tricky

def oneof(chars):
    return lambda Ns: set(chars) if 1 in Ns else null

def seq(x, y):
    return lambda Ns: genseq(x, y, Ns)

def opt(x):
    return alt(epsilon, x)

dot = oneof('?')    # You could expand the alphabet to more chars.
epsilon = lit('')   # The pattern that matches the empty string.

def test():
    
    f = lit('hello')
    assert f(set([1, 2, 3, 4, 5])) == set(['hello'])
    assert f(set([1, 2, 3, 4]))    == null 
    
    g = alt(lit('hi'), lit('bye'))
    assert g(set([1, 2, 3, 4, 5, 6])) == set(['bye', 'hi'])
    assert g(set([1, 3, 5])) == set(['bye'])
    
    h = oneof('theseletters')
    assert h(set([1, 2, 3])) == set(['t', 'h', 'e', 's', 'l', 'r'])
    assert h(set([2, 3, 4])) == null
    
    return 'tests pass'

print(test())

tests pass


We can make the compiler more efficient. For example, in the definition of `lit(s)` we call `set([s])` every time the output of `lit(s)` is called. This seems wasteful. A better way of writing it is:

In [None]:
def lit(s):
    set_s = set([s])  # We create this only once
    # Every time we call the function below, we refer to the set_s defined above
    return lambda Ns: set_s if len(s) in Ns else null

Similarly, we can pull out the `set(chars)` in the defintion of `oneof(chars)`.

We still must define `genseq()`. If we pass two arguments, `x, y` to `seq(x, y)` (not `genseq()`), this returns a function of `Ns`, `fn(Ns)` which returns a set of texts that match. In this respect, `seq(x, y)` is delaying the computation of the output. `geneseq(x, y, Ns)`, instead, immediately calculates the output set. One thing we know about this function is that we will have to call `x(Nx)`, where `Nx` is a set of numbers which we don't yet know, and then we will have to call `y(Ny)`, where `Ny` is a possibly different set of numbers, then we have to concatenate together the results and see if the concatenation of some `x` and some `y` is within the allowable set defined by `Ns`. What do we konw about `Nx` and `Ny` with respect to `Ns`? `Ns` could be a dense set, say `{0, 1, 2, ..., 10}` or it could be a sparse set, say just `{10}`, but in either case, `Nx + Ny <= 10`, and `Nx` can be anything up to 10, therefore

**TO BE CONTINUED** - VIDEO Genseq

In [17]:
def test_search():
    a, b, c = lit('a'), lit('b'), lit('c')
    abcstars = seq(star(a), seq(star(b), star(c)))
    dotstar = star(dot)
    assert search(lit('def'), 'abcdefg') == 'def'
    assert search(seq('def', eol), 'abcdef') == 'def'
    assert search(seq(lit('def'), eol), 'abcdefg') == None
    assert search(a, 'not the start') == 'a'
    assert match(a, 'not the start') == None
    assert match(abcstars, 'aaabbbccccccccdef') == 'aaabbbcccccccc'
    assert match(abcstars, 'junk') == ''
    assert all(match(seq(abcstars, eol), s) == s for s in 'abc aaabbccc aaaabcccc'.split())
    assert all(match(seq(abcstars, eol), s) == None for s in 'cab aaabbcccd aaaa-b-cccc'.split())
    r = seq(lit('ab'), seq(dotstar, seq(lit('aca'), seq(dotstar, seq(a, eol)))))
    assert all(search(r, s) is not None for s in 'abracadabra abacaa about-acacia-flora'.split())
    assert all(match(seq(c, seq(dotstar, b)), s) is not None for s in 'cab cob carob cb carbuncle'.split())
    assert not any(match(seq(c, seq(dot, b)), s) for s in 'carb cb across scab'.split())
    return 'test_search passes'

# print(test_search())