# CSPB 3104 Assignment 5:

***
# Instructions

This assignment is to be completed as a python3 notebook.  When you upload, please upload the completed notebook (ipynb file).

The questions  provided  below will ask you to either write code or 
write answers in the form of markdown.

 Markdown syntax guide is here: [click here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

Using markdown you can typeset formulae using latex.
This way you can write nice readable answers with formulae like thus:

The algorithm runs in time $\Theta\left(n^{2.1\log_2(\log_2( n \log^*(n)))}\right)$, 
where $\log^*(n)$ is the inverse _Ackerman_ function.

__Double click anywhere on this box to find out how your instructor typeset it. Press Shift+Enter to go back.__

***

## Question 1: AVL Trees.

 AVL Trees are yet another self balancing binary search tree (BST) that are sometimes used in the place of  red black trees.
 The key property of an AVL tree is that 

 *for all nodes $n$ in the tree*, $\left|\ \text{height}(n.\text{left}) - \text{height}(n.\text{right}) \right| \leq 1$

 In words, the height of the left subtree and right subtree at any node can differ by at most $1$.
 
 Let $h$ be the height of an AVL tree and $n$ be the number of nodes in the tree.  The goal of this problem is to prove a relationship between $h$ and $n$.  We've broken this into two steps:

 (A) Prove that $n \geq F_h$, where $F_h$ is the $h^{th}$ Fibonacci number. ($F_0 = 1, F_1 = 1, F_2 = 2, \ldots $)
  (*Hint* Use strong induction with two base cases. First establish the property for all AVL trees of heights 0 and 1. Next, assuming
  it holds for trees of height $\leq h$, prove it for trees of height $h+1$ ).
  
  
  Next, it is a fact that for any $k \geq 30$, $F_k \geq 1.5^k$.
  
 (B) Using the above fact and the result from part A,  show that $h = \Theta(\log(n))$.

 (C) We will briefly examine inserting a node into an AVL tree through an example. On the left, we have shown an AVL tree and to the right we show the result after a BST insert has happened.

![AVL Tree Before and After Insertion](avl-tree-insert-problem-img.jpeg "AVL Tree Insertion" )

  Devise a sequence of left and right rotations that will restore the AVL tree property.
Explain for each rotation what is the root node at which we are rotating and which direction. If you wish, you may insert images showing the trees before/after rotation using markdown (see how we inserted the image. But do not forget to upload the images with the submission).






 ### Answer 1 (Expected length: 15 lines)

__(A)__

Using Strong Induction with two base cases:

Base cases:

$h = 0$, This means the AVL has only the root node. So, $n = 1$ and $F_0 = 1$. This leaves $ n \geq F_0 = 1 \geq 1$

$h = 1$, This means the AVL has two nodes, as the height is 1. So, $n = 2$ and $F_1 = 1$. This leaves $ n \geq F_1 = 2 \geq 1$

For the inductive step, we let $k$ be an integer such that $k \geq h$. Therefore $n \geq F_k$.

Assuming the inductive hypothesis is true, we now have to prove that $k + 1, n \geq F_{k+1}$

A property of AVL trees, as stated above, is that $\left|\ \text{height}(n.\text{left}) - \text{height}(n.\text{right}) \right| \leq 1$. This means that the height of the left node can be expressed as $h(n.left) = h - 1$ and the height of the right node can be expressed as $h(n.right) = h - 2$. As $k \geq h$, the height of any of these subtrees can be at most $k$.

Using the inductive hypothesis, we obtain that the height for the left subtree satisfies the condition $n.left \geq F_k$, and we also obtain that $n.right \geq F_{k-1}$.

Adding both of these together, we obtain that $n.left + n.right \geq F_k + F_{k-1}$. To obtain n, we have to add 1 to the sum of the amount of nodes on each subtree (this accounts for the root). So, $n = 1 + n.left + n.right$. Adding one increases the inequality on the left side, which has no impact on the condition to be true (as it is greater than or equal). 

Substituting means that we have $n \geq F_{k+1}$.

__(B)__ 

Assuming the proof and the above fact are true:

We can use this to prove that $h = \Theta(log_2 n)$

As $n \geq 1.5^h$, for the sake of this example, we'll show that $n = 1.5^h$

Getting the logarithm of each of this means $log_2(n) = log_2(1.5^h)$.

This can be further simplified as $log_2(n) = h * log_2(1.5)$, by using the property of logarithms.

As $log_2(1.5) 1) is a constant, and 2) is more or less equal to 0.58.

We can either say 

1) $h = \frac{log_2(n)}{0.58} = \Theta(log_2 n)$ or 2) $log_2(n) = h * c_1$ for some constant $c_1$. This also makes $h = \Theta(log_2 n)$

h is asymptotycally equal to $\Theta(log_2 n)$ in both cases.

__(C)__ 

To obtain a balanced tree, two rotations need to be performed.

First, a left rotation on the root needs to be performed. The resulting tree would look like:

root: 22, left child of root $n_0$ = 18, right child of root $n_1$ = 29

$n_0$ = 18, left child of $n_0$ is $n_2$ = 12, right child of $n_0$ is $n_3$ = 21

$n_1$ = 29, left child of $n_1$ is $n_4$ = 26, left child of $n_4$ is $n_5$ = 24

$n_2$ = 12, right child of $n_2$ is $n_6$ = 16

This tree is the result of a left rotation on the root.

To obtain the appropriate balanced AVL tree now, a right rotation must be done on the right subtree. This means that we did a left-right rotation altogether, to obtain successfully obtain a balanced AVL tree.

The resulting tree would look like the following (the left subtree of the root would be unaffected):

root: 22, left child of root $n_0$ = 18, right child of root $n_1$ = 26

$n_0$ = 18, left child of $n_0$ is $n_2$ = 12, right child of $n_0$ is $n_3$ = 21

$n_1$ = 26, left child of $n_1$ is $n_4$ = 24, right child of $n_1$ is $n_5$ = 29

$n_2$ = 12, right child of $n_2$ is $n_6$ = 16


This resulting tree satisfies all the AVL properties, as the difference between height of each subtree is no more than 1.



***
## Question 2: Bloom Filters


 A bloom filter is a fast set data structure that maintains a set $S$ of keys. One can insert keys into the set and test whether a given key $k$ belongs to the set. It may used in applications where the keys are "complicated" objects such as TCP packets or images that are expensive to compare with each other. 
 

 The data structure is an array $T$ of Booleans size $m$ with $l$ different hash functions $h_1, \ldots, h_l$.
 Initially, `T[i] = FALSE` for all `i`.

 If a key $k$ is to be inserted 
 we first compute $i_1 = h_1(k), \ldots, i_l = h_l(k)$ and then we set $T[i_1] = \cdots T[i_l] = \text{TRUE}$.

 __Note:  A bloom filter is *not* a hash table, but they both use hash functions in interesting ways.__

 __(A)__ Suppose we wish to find out if an element $k$ is a member of the set by checking if
$T[h_1(k)], \ldots, T[h_l(k)]$ are all true. Explain whether this can lead to a *false positive* i.e,
the approach wrongly concludes that $k$ belongs to the set when it was never inserted; or *false negative*
i.e, the approach wrongly concludes that $k$ does not belong to the set when it does.

 __(B)__ Suppose our hash functions are guaranteed to be uniform. I.e, for any randomly chosen
key $k$, for any hash function $h_i$ and cell $j$, 
  $$ \mathbb{P}( h_i(k) = j)  = \frac{1}{m} $$
 If $n$ keys are chosen at random and inserted into the filter, compute that probability that any given cell $T[j]$ is set to FALSE after this.

 __(C)__ Use the results from previous set to estimate the probabilisty of a false positive. I.e, some $l$ cells
$i_1, i_2, \ldots, i_l$ are simultaneously set to TRUE.

 



### Answer 2 { Expected Size: 15 lines}

__(A)__

This approach can lead to false positives. To start off, finding a desired key in the array involves performing every hash on the key and toggling the resulting position in the array to TRUE. Checking for a key $k$ in this manner might lead to false positives, because the hash of a certain key might result in the same position as the key being checked. That position would be labelled as TRUE. This could happen with other keys, resulting in the positions of $k$ being labelled as TRUE even though $k$ never existed in the array. This is a false positive. This approach does not lead to false negatives, as all positions of the array are instantiated to FALSE, and only toggled in that position when the resulting hash of the key indicates it.

__(B)__

The probability of setting a specific cell $j$ to TRUE is $\frac{1}{m}$. Conversely, the probability that a key does not set a cell $j$ to TRUE is given by: $(1 - \frac{1}{m})$.

As this is only for one key, the probability needs to be calculated for $n$ keys. This means that for a single hash function, the probability that all $n$ keys (hashed with a single hash function) do not set the cell to true is:

$(1 - \frac{1}{m})^n$

The above problem assumes only one hash function, when in reality, there are $l$ hash functions (or $l$ cells) in a bloom filter. This leaves the probability that after inserting $n$ keys cell $j$ remains FALSE to:

$(1 - \frac{1}{m})^{nl}$

__(C)__

To estimate the probability of a false positive for a given key (some cells that belong to the hash of a given key are all set to TRUE), we must use the formula obtained from the past problem.

Given that we calculated the probability for cell $j$ to remain FALSE after inserting $n$ keys, we can obtain the inverse (probability they were set to TRUE) by doing the following:

$(1 - (1 - \frac{1}{m})^{nl})$

This probability only holds for a single specific cell. Given that a given key is hashed $l$ times, this results in $l$ spots being filled ($l$ hash functions). Accounting for this, the full probability for a false positive is:

$P$ = $(1 - (1 - \frac{1}{m})^{nl})^l$



## Testing your solutions -- Do not edit code beyond this point