# assignment-03


## Part 1: Searching Unsorted Lists

As we know, the binary search algorithm takes as input a sorted list
of length $n$ and a specified key and is able to find it (or conclude
that it is not in the list) in $O(\log n)$ time. Let's consider a
slightly different problem in which we are given an unsorted list $L$
with a key $x$, and we want determine whether $x$ is in $L$. For each
part below, design an algorithm using the prescribed sequence
operation. Note that you can preprocess the list as needed.

**1a)** Use `iterate` to implement the `isearch` stub, and check that your
code passes the test cases given by `test_isearch` (feel free to add
additional cases). 

.  
.  

**1b)** What is the work and span of this algorithm?

> $O(n)$ work and $O(n)$ span

.  
.  


**1c)** Now, use `reduce` to implement the `rsearch` stub. Test it with `test_rsearch`.


.  

**1d)** What is the work and span of the resulting algorithm, assuming that `reduce` is implemented as specified in the lecture notes?

> $O(n)$ work and $O(\lg n)$ span

.  


**1e)** Finally, let's consider another implementation of `reduce` as given
by `ureduce` in `main.py`. That is, if you replace `reduce` from part b) with
`ureduce` then there should be no difference in output. However, what
is the work and span of the resulting algorithm for `rsearch`?

> $W(n) = W(n/3) + W(2n/3) + 1$

> $W(n) = O(n)$

> leaf dominated (1 -> 2 -> 4 -> ...)

> we know that the total number of leaves is $n$ -- no part of the input is inspected more than once.

> $S(n) = O(\lg n)$

> balanced: (1->1->1->1)

> worst levels  * number of levels

> base of log doesn't matter for big O

> $\lg_{3/2} n \in O(\lg n)$

.  
.  
.  
> Why doesn't base of log matter for big O? recall change of base formula:

> $\log_a n = \frac{\log_b n}{\log_b a}$, where $\log_b a$ is a constant.

In [33]:
### PART 1: SEARCHING UNSORTED LISTS

# search an unordered list L for a key x using iterate
def isearch(L, key):
    ###TODO
    def search_f(key, cur_output, next_input):
        return next_input == key or cur_output
    return iterate(lambda c,n: search_f(key, c, n), False, L)
    ###

def test_isearch():
    assert isearch([1, 3, 5, 4, 2, 9, 7], 2) == (2 in [1, 3, 5, 4, 2, 9, 7])
    assert isearch([1, 3, 5, 2, 9, 7], 7) == (7 in [1, 3, 5, 2, 9, 7])
    assert isearch([1, 3, 5, 2, 9, 7], 99) == (99 in [1, 3, 5, 2, 9, 7])
    assert isearch([], 2) == (2 in [1, 3, 5])


def iterate(f, x, a):
    # done. do not change me.
    if len(a) == 0:
        return x
    else:
        return iterate(f, f(x, a[0]), a[1:])

# search an unordered list L for a key x using reduce
def rsearch(L, x):
    ###TODO
    def search_f(x, a, b):
        return a==x or b==x or a is True or b is True
    return reduce(lambda c,n: search_f(x, c, n), False, L)
    ###

def test_rsearch():
    assert rsearch([1, 3, 5, 4, 2, 9, 7], 2) == (2 in [1, 3, 5, 4, 2, 9, 7])
    assert rsearch([1, 3, 5, 2, 9, 7], 7) == (7 in [1, 3, 5, 2, 9, 7])
    assert rsearch([1, 3, 5, 2, 9, 7], 99) == (99 in [1, 3, 5, 2, 9, 7])
    assert rsearch([], 2) == (2 in [1, 3, 5])

def reduce(f, id_, a):
    # done. do not change me.
    if len(a) == 0:
        return id_
    elif len(a) == 1:
        return a[0]
    else:
        # can call these in parallel
        res = f(reduce(f, id_, a[:len(a)//2]),
                 reduce(f, id_, a[len(a)//2:]))
        return res

    
def ureduce(f, id_, a):
    if len(a) == 0:
        return id_
    elif len(a) == 1:
        return a[0]
    elif len(a) == 2:
        return f(a[0], a[1])
    else:
        # can call these in parallel
        return f(ureduce(f, id_, a[:len(a)//3]),
                 ureduce(f, id_, a[len(a)//3:]))
    
# def ureduce(f, id_, a):
#     if len(a) == 0:
#         return id_
#     elif len(a) == 1:
#         return a[0]
#     else:
#         # can call these in parallel
#         return f(reduce(f, id_, a[:len(a)//3]),
#                  reduce(f, id_, a[len(a)//3:]))


test_isearch()
test_rsearch()
# rsearch([1, 3, 5, 4, 2, 9, 7,10,2,102,1,201,2], 2)


## Part 2: Document indexing

A key component of search engines is a data structure called an **inverted index** which maps each word to the list of documents it appears in.

Assume we have three documents with ids 0,1,2:

```python
[
    ('document one is cool is it', 0),
    ('document two is also cool', 1),
    ('document three is kinda neat', 2)
]
```

then an inverted index would be

```python
[('also', [1]),
 ('cool', [0, 1]),
 ('document', [0, 1, 2]),
 ('is', [0, 1, 2]),
 ('it', [0]),
 ('kinda', [2]),
 ('neat', [2]),
 ('one', [0]),
 ('three', [2]),
 ('two', [1])]
```

To implement this in map-reduce, we will implement our own map and reduce functions, similar to `recitation-04`.

The map function `doc_index_map` is already complete. E.g.

```python
>>> doc_index_map('document one is cool is it', 0)
    [('document', 0), ('one', 0), ('is', 0), ('cool', 0), ('is', 0), ('it', 1)] 
```

The reduce function is also implemented, but it has a bug:

```python
>>> doc_index_reduce(['is', [0,0,1,2]])
    ('is', [0,0,1,2])
```

The problem is that document ids are duplicated in the final output (e.g., `0` in the above example).

While of course we could just fix `doc_index_map` to not emit duplicates, we will instead modify the `doc_index_reduce` function. We will do so with the help of another function `dedup` which takes in two sorted, deduplicated lists and returns their concatenation without any duplicates:

```python
>>> dedup([1,2,3], [3,4,5])
[1,2,3,4,5]
```

1a. Implement `dedup` **in constant time** and test it with `test_dedup`. 

1b. Modify the `doc_index_reduce` function to use both `dedup` and `reduce`. Test it with `test_doc_index_reduce`.

In [61]:
from collections import defaultdict

def run_map_reduce(map_f, reduce_f, docs):
    # done. do not change me.
    """    
    The main map reduce logic.
    
    Params:
      map_f......the mapping function
      reduce_f...the reduce function
      docs.......list of input records
    """
    # 1. call map_f on each element of docs and flatten the results
    # e.g., [('i', 1), ('am', 1), ('sam', 1), ('i', 1), ('am', 1), ('sam', 1), ('is', 1), ('ham', 1)]
    pairs = flatten(list(map(map_f, docs)))
    # 2. group all pairs by by their key
    # e.g., [('am', [1, 1]), ('ham', [1]), ('i', [1, 1]), ('is', [1]), ('sam', [1, 1])]
    groups = collect(pairs)
    # 3. reduce each group to the final answer
    # e.g., [('am', 2), ('ham', 1), ('i', 2), ('is', 1), ('sam', 2)]
    return [reduce_f(g) for g in groups]


def doc_index_map(doc_tuple):
    """
    Params:
      doc_tuple....a tuple (docstring, docid)
    Returns:
      a list of tuples of form (word, docid), where token is a whitespace delimited element of this string.

    Note that the returned list can contain duplicates.
    E.g.
    >>> doc_index_map('document one is cool is it', 0)
    [('document', 0), ('one', 0), ('is', 0), ('cool', 0), ('is', 0), ('it', 1)]    
    """
    ### done. do not change me.
    doc, docid = doc_tuple[0], doc_tuple[1]
    return [(token, docid) for token in doc.split()]

def dedup(a, b):
    """
    Return a concatenation of two lists without any duplicates.
    Assume that input lists a and b already sorted and deduplicated.
    This should be done in _constant_ time (ignoring any time to create or concatenate lists).
    e.g.
    >>> dedup([1,2,3], [3,4,5])
    [1,2,3,4,5]
    """
    ###TODO
    if a[-1] == b[0]:
        return a[:-1] + b
    else:
        return a+b
    ###
    
def doc_index_reduce(group):
    """
    Fix this function to instead call the reduce and dedup functions
    to return the _unique_ list of document ids that this word appears in.
    
    Params:
      group...a tuple of the form (word, list_of_docids), indicating the docids containing this word, with duplicates.
    Returns:
      tuple of form (word, list_of_docids), where duplicate docids have been removed.
      
    >>> doc_index_reduce(['is', [0,0,1,2]])
    ('is', [0,1,2])
    """
    # fix this line
    # return (group[0], group[1])
    ###TODO
    return (group[0], reduce(dedup, [], [[x] for x in group[1]]))
    ###
    
def test_dedup():
    assert dedup([1,2,3], [3,4,5]) == [1,2,3,4,5]
    assert dedup([1,2,3], [5,6]) == [1,2,3,5,6]
    
def test_doc_index_reduce():
    assert doc_index_reduce(['is', [0,0,1,2]]) == ('is', [0,1,2])
    assert doc_index_reduce(['is', [0,0,0,0,1,1,1,1,1,1,2,2,2,2]]) == ('is', [0,1,2])

def test_index():
    res = run_map_reduce(doc_index_map, doc_index_reduce,
               [('document one is cool is it', 0),
                ('document two is also cool', 1),
                ('document three is kinda neat', 2)
               ])    
    assert res == [('also', [1]),
                   ('cool', [0, 1]),
                   ('document', [0, 1, 2]),
                   ('is', [0, 1, 2]),
                   ('it', [0]),
                   ('kinda', [2]),
                   ('neat', [2]),
                   ('one', [0]),
                   ('three', [2]),
                   ('two', [1])]
    

def collect(pairs):
    """
    Implements the collect function (see text Vol II Ch2)
    >>> collect([('i', 1), ('am', 1), ('sam', 1), ('i', 1)])
    [('am', [1]), ('i', [1, 1]), ('sam', [1])]    
    """
    ### done
    result = defaultdict(list)
    for pair in sorted(pairs):
        result[pair[0]].append(pair[1])
    return list(result.items())


def plus(x, y):
    # done. do not change me.
    return x + y

def iterate(f, x, a):
    # done. do not change me.
    """
    Params:
      f.....function to apply
      x.....return when a is empty
      a.....input sequence
    """
    if len(a) == 0:
        return x
    else:
        return iterate(f, f(x, a[0]), a[1:])
    
def flatten(sequences):
    # done. do not change me.
    return iterate(plus, [], sequences)

def reduce(f, id_, a):
    # done. do not change me.
    if len(a) == 0:
        return id_
    elif len(a) == 1:
        return a[0]
    else:
        return f(reduce(f, id_, a[:len(a)//2]),
                 reduce(f, id_, a[len(a)//2:]))

In [69]:
# as of python 3.7, dictionaries preserve insertion order.
d = dict()
d['a'] = 1
d['b'] = 2
print(d.items())

e = dict()
e['b'] = 1
e['a'] = 2
print(e.items())

dict_items([('a', 1), ('b', 2)])
dict_items([('b', 1), ('a', 2)])


## Part 3: Parenthesis Matching

A common task of compilers is to ensure that parentheses are matched. That is, each open parenthesis is followed at some point by a closed parenthesis. Furthermore, a closed parenthesis can only appear if there is a corresponding open parenthesis before it. So, the following are valid:

- `( ( a ) b )`
- `a () b ( c ( d ) )`

but these are invalid:

- `( ( a )`
- `(a ) ) b (`

Below, we'll solve this problem three different ways, using iterate, scan, and divide and conquer.

**2a. iterative solution** Implement `parens_match_iterative`, a solution to this problem using the `iterate` function. **Hint**: consider using a single counter variable to keep track of whether there are more open or closed parentheses. How can you update this value while iterating from left to right through the input? What must be true of this value at each step for the parentheses to be matched? To complete this, complete the `parens_update` function and the `parens_match_iterative` function. The `parens_update` function will be called in combination with `iterate` inside `parens_match_iterative`. Test your implementation with `test_parens_match_iterative`.

.  
. 



**2b.** What are the recurrences for the Work and Span of this solution? What are their Big Oh solutions?


> $W(n) = O(n)$

> $S(n) = O(n)$

.  
. 



**2c. scan solution** Implement `parens_match_scan` a solution to this problem using `scan`. **Hint**: We have given you the function `paren_map` which maps `(` to `1`, `)` to `-1` and everything else to `0`. How can you pass this function to `scan` to solve the problem? You may also find the `min_f` function useful here. Implement `parens_match_scan` and test with `test_parens_match_scan`

.  
. 



**2d.** Assume that any `map`s are done in parallel, and that we use the efficient implementation of `scan` from class. What are the recurrences for the Work and Span of this solution? 

```python
    history, last = scan(plus, 0, list(map(paren_map, mylist)))
    return last == 0 and reduce(min_f, 0, history) >= 0
```

> - map has $O(n)$ work and $O(1)$ span
> - scan has $O(n)$ work and $O(\lg n)$ span
> - reduce has $O(n)$ work and $O(\lg n)$ span
> - So, combination has $O(n)$ work and $O(\lg n)$ span


**2e. divide and conquer solution** Implement `parens_match_dc_helper`, a divide and conquer solution to the problem. A key observation is that we *cannot* simply solve each subproblem using the above solutions and combine the results. E.g., consider '((()))', which would be split into '(((' and ')))', neither of which is matched. Yet, the whole input is matched. Instead, we'll have to keep track of two numbers: the number of unmatched right parentheses (R), and the number of unmatched left parentheses (L). `parens_match_dc_helper` returns a tuple (R,L). So, if the input is just '(', then `parens_match_dc_helper` returns (0,1), indicating that there is 1 unmatched left parens and 0 unmatched right parens. Analogously, if the input is just ')', then the result should be (1,0). The main difficulty is deciding how to merge the returned values for the two recursive calls. E.g., if (i,j) is the result for the left half of the list, and (k,l) is the output of the right half of the list, how can we compute the proper return value (R,L) using only i,j,k,l? Try a few example inputs to guide your solution, then test with `test_parens_match_dc_helper`.



.  
. 





**2f.** What are the recurrences for the Work and Span of this solution? What are their Big Oh solutions?

> $W(n) = 2W(n/2) + 1$

> leaf dominated: 1 -> 2 -> 4 -> 8 -> ... 

> number of leaves is $2^{\lg n}$

> -> $O(n)$

.  
.  

> $S(n) = S(n/2) + 1$

> balanced: 1 -> 1 -> 1 -> ...

> -> $O(\lg n)$



In [70]:
# parens match
import math    

def iterate(f, x, a):
    """
    Params:
      f.....function to apply
      x.....return when a is empty
      a.....input sequence
    """
    if len(a) == 0:
        return x
    else:
        return iterate(f, f(x, a[0]), a[1:])

def parens_update(current_output, next_input):
    """
    This function will be passed to the `iterate` function to 
    solve the balanced parenthesis problem.
    
    Like all functions used by iterate, it takes in:
    current_output....the cumulative output thus far (e.g., the running sum when doing addition)
    next_input........the next value in the input
    
    Returns:
      the updated value of `current_output`
    """
    ###TODO    
# #     # almost but not quite: consider ") ("
#     if next_input == '(':            # new open parens 
#         return current_output + 1
#     elif next_input == ')':          # new close parens
#         return current_output - 1
#     else:
#         return current_output
        
    
    if current_output == -math.inf:  # in an invalid state; carry it forward
        return current_output
    if next_input == '(':            # new open parens 
        return current_output + 1
    elif next_input == ')':          # new close parens
        if current_output <= 0:      # close before an open -> invalid
            return -math.inf
        else:                        # valid
            return current_output - 1
    else:                            # ignore non-parens input
        return current_output
    ###
    
def parens_match_iterative(mylist):
    """
    Implement the iterative solution to the parens matching problem.
    This function should call `iterate` using the `parens_update` function.
    
    Params:
      mylist...a list of strings
    Returns
      True if the parenthesis are matched, False otherwise
      
    e.g.,
    >>>parens_match_iterative(['(', 'a', ')'])
    True
    >>>parens_match_iterative(['('])
    False
    """
    ### TODO
    return iterate(parens_update, 0, mylist) == 0
    ###

def test_parens_match_iterative():
    assert parens_match_iterative(['(', ')']) == True
    assert parens_match_iterative(['(']) == False
    assert parens_match_iterative([')']) == False
    assert parens_match_iterative(['(', 'a', ')', '(', ')']) == True
    assert parens_match_iterative(['(',  '(', '(', ')', ')', ')']) == True
    assert parens_match_iterative(['(', '(', ')']) == False
    assert parens_match_iterative(['(', 'a', ')', ')', '(']) == False
    assert parens_match_iterative([]) == True
    
test_parens_match_iterative()

In [45]:
def scan(f, id_, a):
    """
    This is a horribly inefficient implementation of scan
    only to understand what it does.
    We'll discuss how to make it more efficient later.
    """
    return (
            [reduce(f, id_, a[:i+1]) for i in range(len(a))],
             reduce(f, id_, a)
           )

def paren_map(x):
    """
    Returns 1 if input is '(', -1 if ')', 0 otherwise.
    This will be used by your `parens_match_scan` function.
    
    Params:
       x....an element of the input to the parens match problem (e.g., '(' or 'a')
       
    >>>paren_map('(')
    1
    >>>paren_map(')')
    -1
    >>>paren_map('a')
    0
    """
    if x == '(':
        return 1
    elif x == ')':
        return -1
    else:
        return 0

def min_f(x,y):
    """
    Returns the min of x and y. Useful for `parens_match_scan`.
    """
    if x < y:
        return x
    return y

def parens_match_scan(mylist):
    """
    Implement a solution to the parens matching problem using `scan`.
    This function should make one call each to `scan`, `map`, and `reduce`
    
    Params:
      mylist...a list of strings
    Returns
      True if the parenthesis are matched, False otherwise
      
    e.g.,
    >>>parens_match_scan(['(', 'a', ')'])
    True
    >>>parens_match_scan(['('])
    False
    
    """
    ###TODO
    history, last = scan(plus, 0, list(map(paren_map, mylist)))
    print(history, last)
    return last == 0 and reduce(min_f, 0, history) >= 0
    ###

def test_parens_match_scan():
    assert parens_match_scan(['(', ')']) == True
    assert parens_match_scan(['(']) == False
    assert parens_match_scan([')']) == False
    assert parens_match_scan(['(', 'a', ')', '(', ')']) == True
    assert parens_match_scan(['(',  '(', '(', ')', ')', ')']) == True
    assert parens_match_scan(['(', '(', ')']) == False
    assert parens_match_scan(['(', 'a', ')', ')', '(']) == False
    assert parens_match_scan([]) == True

# test_parens_match_scan()    
parens_match_scan(['(', ')', ')', '('])

[1, 0, -1, 0] 0


False

In [60]:
# D&C parens_match

def parens_match_dc_helper(mylist):
    """
    Recursive, divide and conquer solution to the parens match problem.
    
    Returns:
      tuple (R, L), where R is the number of unmatched right parentheses, and
      L is the number of unmatched left parentheses. This output is used by 
      parens_match_dc to return the final True or False value
    """
    ###TODO
    # Base cases
    if len(mylist) == 0:
        return [0,0]
    elif len(mylist) == 1:
        if mylist[0] == '(':
            return (0, 1) # one unmatched (
        elif mylist[0] == ')':
            return (1, 0) # one unmatched )    
        else:
            return (0, 0)
    r1,l1 = parens_match_dc_helper(mylist[:len(mylist)//2])
    r2,l2 = parens_match_dc_helper(mylist[len(mylist)//2:])
    # Combination:
    # Return the tuple (R,L) using some combination of the values i,j,k,l defined above.
    # This should be done in constant time.
    if l1 > r2:
        return (r1, (l1 - r2) + l2)
    else:
        return ( (r2 - l1) + r1,   l2)
    ###
    # if we did this, would return negative values 
    # return ((r2-l1)+r1, (l1-r2)+l2)
    
def parens_match_dc(mylist):
    """
    Calls parens_match_dc_helper. If the result is (0,0),
    that means there are no unmatched parentheses, so the input is valid.
    
    Returns:
      True if parens_match_dc_helper returns (0,0); otherwise False
    """
    # done.
    n_unmatched_left, n_unmatched_right = parens_match_dc_helper(mylist)
    return n_unmatched_left==0 and n_unmatched_right==0

def test_parens_match_dc():
    assert parens_match_dc(['(', ')']) == True
    assert parens_match_dc(['(']) == False
    assert parens_match_dc([')']) == False
    assert parens_match_dc(['(', 'a', ')', '(', ')']) == True
    assert parens_match_dc(['(',  '(', '(', ')', ')', ')']) == True
    assert parens_match_dc(['(', '(', ')']) == False
    assert parens_match_dc(['(', 'a', ')', ')', '(']) == False
    assert parens_match_dc([]) == True    

test_parens_match_dc()
# parens_match_dc(['(', 'a', ')', ')', '(', ])

|input 1 | input 2|$R_1$|$L_1$|$R_2$|$L_2$| ->|$R_o$|$L_o$|
|--------|--------|-----|-----|-----|-----|-  |-----|-----|
|   (    | )      | 0   | 1   | 1   | 0   |.  |  0  |  0  |
|   ( (  | )      | 0   | 2   | 1   | 0   |.  |  0  |  1  |
|  ( ( ( | ) ) (  | 0   | 3   | 2   | 1   |.  |  0  |  2  |
|   (    | ) ) (  | 0   | 1   | 2   | 1   |.  |  1  |  1  |
|   (    | ) (    | 0   | 1   | 1   | 1   |.  |  0  |  1  |
|   )    | (      | 1   | 0   | 0   | 1   |.  |  1  |  1  |
| ( )    | ( )    | 0   | 0   | 0   | 0   |.  |  0  |  0  |
| ( a    | a )    | 0   | 1   | 1   | 0   |.  |  0  |  0  |
| ( )    | ) (    | 0   | 0   | 1   | 1   |.  |  1  |  1  |
| ( (    | ) )    | 0   | 2   | 2   | 0   |.  |  0  |  0  |
| ( (    | ) (    | 0   | 2   | 1   | 1   |.  |  0  |  2  |
| ) (    | ) )    | 1   | 1   | 2   | 0   |.  |  2  |  0  |


| a ) | ( )  | 1   | 0   | 0   | 0   |.  |  0  |  0  |

- $R_1$: number of unmatched right parentheses in input 1
- $L_1$: number of unmatched left parentheses in input 1
- $R_2$: number of unmatched right parentheses in input 2
- $L_2$: number of unmatched left parentheses in input 2

If $L_1 > R_2$ $~~~~\Rightarrow~~~~$ $L_o = (L_1 - R_2) + L_2$

If $L_1 \le R_2$ $~~~~\Rightarrow~~~~$ $L_o = L_2$

If $L_1 > R_2$ $~~~~\Rightarrow~~~~$ $R_o = R_1$

If $L_1 \le R_2$ $~~~~\Rightarrow~~~~$ $R_o = (R_2 - L_1) + R_1$

In [56]:
l1 = 2
r1 = 0
l2 = 1
r2 = 1

if l1 > r2:
    print('return_one:  %d %d' % (r1, l2 + l1 - r2))
else:
    print('return_two %d %d' % (r1 + r2 - l1, l2))


return_one:  0 2
