# Heap Data structure
In this notebook we will look at heap data structure and come up with a simple implementation in Python

We have seen Queues and Stack before which are support FIFO(First In First Out) and LIFO(Last In First Out) ordering of elements added to these data structures respectively. They are both used in BFS and DFS traversal of graph/trees respectively. We will now look at a another special type of data structure called Heap which has a very typical usecases.

When choosing a Data structure its important to think which operation we will perfrom frequently. For example, in Djikstra's algorithm, a part of it goes through all edges and vertices (m + n) gives us a complexity of $\theta(m + n)$, however, find the next vertex with lowest Djikstra score requires $\theta(n)$ and thus the entire algorithm has complexity $\theta{((m + n) \cdot n)}$ which is quadratic. Imagine we have an algorithm which gives us this next vertex to pick in $\theta(log(n))$, then our algorithm's complexity is $\theta((m + n) \cdot log(n))$ which is way faster than quadratic complexity. As choosing the next vertex in Djikstra's algorothm is a frequent operation, making it run faster makes the entire algorithm run faster.

With this in mind, lets define the heap data structure
***
Heap data structure lets us maintain the minimum/maximum value of an evolving set of objects.

The key here is the word evolving. Finding the minimum/maximum from a fixed set of values can be done in linear time. However maintaining the minimum and maximum from an evolving stream of objects supporting two operations Extract-Min(or Extract-Max) and Insert is not straight forward. We may think of sorting the numbers, and lets look at the time complexity of these operations

- Case 1, Sorting the array

    - Extract-Min: The time complexity if extracting the min value from a sorted set is $\theta(1)$
    - Insert: Initial operation with a list of numbers will require $\theta(n \cdot log(n))$ with each subsequent insert requiring linear time $\theta(n)$
- Case 2: Keeping unordered linked list

    - Extract-Min: Scanning the unordered linked list to extract the minimum will take $\theta(n)$
    - Insert: This straightforward and we just add the object to the end of the linked list in $\theta(1)$
As we can see, both options has a linear time operation for either insert of extract and what we need is s datastructure that allows both these operations to be performed much faster than linear time.

The Heap Datastructure will give us the following running time guarantees

Operation	Complexity
Insert	$\theta(logn)$
Extract-Min	$\theta(logn)$
Find-Min	$\theta(1)$
Delete	$\theta(logn)$
Heapify	$\theta(n)$
Naive implementation of Find-Min simply extracts min and inserts it back in $\log(n)$ time but we will implement the datastructure which will do it in constant time

Similarly, heapify can simply sort the input in $\theta(nlog(n))$ time (or perform Insert on all n elements) but we will see how we can heapify the unordered array in linear time.

Before we implement this datastructure, lets look at a very good use of it. Let's start with Selection sort algorithm which we will implement below.



In [11]:
def selectionSort(arr):
    print('Before Sorting', arr)
    for i in range(len(arr)-1):
        minidx = i
        for j in range(i+1,len(arr)):
            if arr[minidx] > arr[j]:
                minidx = j
        if minidx != i:
            arr[i], arr[minidx] = arr[minidx], arr[i]
    print('After Sorting', arr)
    
selectionSort([2, 4, 1, 6, 9, 7, 3, 5, 8])

('Before Sorting', [2, 4, 1, 6, 9, 7, 3, 5, 8])
('After Sorting', [1, 2, 3, 4, 5, 6, 7, 8, 9])


As we see above, selection sort scans all elements after the index i to find the minimum value after the index at i and swaps the minimim found at after i with i if we find one. Thus in first iteration we have n comparisons and subsequent comparisons are 1 less then previous iteration, Therefore the number of comparisons for an array of size n is n + (n - 1) + (n - 2) + ... 1 = $\frac{(n)(n + 1)}{2}$ which is $\theta(n^2)$

As we can see the most frequent operation we do is find the minimum starting at an index. We therefore see a good use of heap here where initially heapify the array in linear time and then keep extracting the minimum element in $\theta(log(n))$ n times giving us the time complexity of $\theta(n \cdot log(n))$

We also know that no comparison based sorting algorithm can perform better than $\theta(n \cdot log(n))$, which also means heap cannot perform Extract-Min better than $\theta(log(n))$ as any better complexity will give us the time complexity of the sorting algorithm better than $\theta(n \cdot log(n))$ which is not possible.
***
Quiz 10.1

The answer of (b), $\theta(n \cdot log(n))$
***
One application of Heaps is median maintenance. The goal of this problem is to find the median of the given stream of numbers. Finding median of a static list of numbers if not difficult. However, doing so for a stream of numbers efficiently requires us to use two Heaps. Let us write a Python implementation of this problem. Since we havent implemented heaps ourselves, we will use the standard Python package for heaps heapq

In [81]:
class MedianMaintenance:
    
    def __init__(self):
        self.minHeap, self.maxHeap = [], []
        
    def __repr__(self):
        return str([-1*i for i in list(self.maxHeap)[::-1]]  + list(self.minHeap))
        #return str([1*i for i in list(self.maxHeap)]  + list(self.minHeap))


        
    def addElement(self, element):
        import heapq
        
        if len(self.minHeap) == 0 or self.minHeap[0] < element:
            #print("min heap push: {}".format(element))
            heapq.heappush(self.minHeap, element)
        else: 
            #print("max heap push: {}".format(-element))
            heapq.heappush(self.maxHeap, -element)

        # ------------  ------------
        # |  Max Heap|  |  Min Heap|  
        # ------------  ------------
        # Two heaps are used as above, the values prior to median are in the Max heap on the left  
        # and those after the median, including the median in case odd numbers are in the min heap on the right 
        # We maintain the length of min heap no more than 1 greater than max heap. In case of total even numbers
        # the lengths of both heaps are same, in case of total odd numbers, the min heap will have one element more
        # than the max heap. The below two loops het us maintain this invariant.
        # 
            
        while len(self.minHeap) > len(self.maxHeap):
            heapq.heappush(self.maxHeap, -heapq.heappop(self.minHeap))
            
        while len(self.maxHeap) - len(self.minHeap) > 1:
            heapq.heappush(self.minHeap, -heapq.heappop(self.maxHeap))
        
        if len(self.maxHeap) != 0  and len(self.minHeap) != 0 :
            if -self.maxHeap[0] > self.minHeap[0]:
                heapq.heappush(self.minHeap, -heapq.heappop(self.maxHeap))
                heapq.heappush(self.maxHeap, -heapq.heappop(self.minHeap))
        
            
    def median(self):        
        #Odd number of elements, the top of the Max priority queue (left half) is the median
        #Even elements, the median is top of left heap
        return -self.maxHeap[0]
            
    
stream = MedianMaintenance()

#li = [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]
li = [9, 10, 6, 2, 7, 1, 5, 8, 3, 4, 15, 17, 13]
#li = [6331, 2793, 1640]

for i in li:
    stream.addElement(i)
    print('Heap {} has median {}'.format(stream, stream.median()))
        

Heap [9] has median 9
Heap [9, 10] has median 9
Heap [6, 9, 10] has median 9
Heap [2, 6, 9, 10] has median 6
Heap [6, 2, 7, 9, 10] has median 7
Heap [1, 2, 6, 7, 10, 9] has median 6
Heap [2, 1, 5, 6, 7, 10, 9] has median 6
Heap [2, 1, 5, 6, 7, 8, 9, 10] has median 6
Heap [3, 2, 1, 5, 6, 7, 8, 9, 10] has median 6
Heap [1, 2, 4, 3, 5, 6, 7, 9, 10, 8] has median 5
Heap [4, 1, 2, 5, 3, 6, 7, 8, 9, 10, 15] has median 6
Heap [4, 1, 2, 5, 3, 6, 7, 8, 9, 10, 15, 17] has median 6
Heap [5, 4, 1, 2, 6, 3, 7, 8, 10, 9, 13, 15, 17] has median 7


## Challenging problem
Download the following text file:  Median.txt <br>
The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 3 lecture on heap applications). The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one. Letting $x_i$ denote the $ith$ number of the file, the $kth$ median $m_k$ is defined as the median of the numbers $x_1$,…,$x_k$ (So, if k is odd, then $m_k$ is $((k+1)/2)th$ smallest number among $x_1$,…,$x_k$; if k is even, then m_k  is the $(k/2)th$ smallest number among $x_1$,…,$x_k$)

In the box below you should type the sum of these 10000 medians, modulo 10000 (i.e., only the last 4 digits). That is, you should compute ($m_1$+$m_2$+$m_3$+⋯+$m_10000$) mod 10000.

TODO : OPTIONAL EXERCISE: Compare the performance achieved by heap-based and search-tree-based implementations of the algorithm.



In [80]:
import urllib3

stream = MedianMaintenance()

# Test case
http = urllib3.PoolManager()
r1 = http.request('GET', "https://d3c33hcgiwev3.cloudfront.net/_6ec67df2804ff4b58ab21c12edcb21f8_Median.txt?Expires=1562112000&Signature=fvYNpxU9Nq4cQRSz-Np3ZI5zXgrPGDloTI8EL53doDAL4Q6phAFpfXfaXWBMg0Y5u2ITRTF3d86qP--TovGaS4PODf0yvb~Rcl~GPwN1QSQS6jdqq7-3VpZg2OxmjnH2SmKpQtWl2qzBeSb0tYOGTYV2ueA2LhKxoV0G3WSAV9E_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A")
IntegerMatrixStringJoin = r1.data.split('\r\n')
IntegerMatrixString = IntegerMatrixStringJoin[:-1]
median_sum = 0
for i in IntegerMatrixString:
    stream.addElement(int(i))
    median_sum += stream.median()
print('Heap {} \n\nhas median {}'.format(stream, stream.median()))
print('\n\n\nSum of 10000 medians of the stream:  {}'.format(median_sum % 10000))




Heap [3466, 558, 63, 1918, 1140, 1, 2112, 2144, 1170, 2537, 255, 1321, 2365, 2633, 39, 1518, 1850, 1865, 2886, 3302, 2348, 2511, 3448, 2655, 3375, 779, 929, 2634, 422, 3724, 2363, 1297, 2120, 1042, 1008, 1810, 713, 738, 636, 2322, 1507, 683, 2311, 591, 267, 498, 2026, 1643, 1079, 3484, 3370, 1168, 495, 1868, 1254, 735, 1213, 1472, 2287, 1741, 1481, 2793, 235, 2142, 1307, 442, 355, 1495, 1787, 3823, 1568, 674, 292, 3148, 298, 1240, 1144, 1946, 286, 2077, 340, 656, 126, 1275, 1789, 3705, 1467, 3499, 2134, 3610, 251, 1470, 80, 975, 248, 1890, 767, 507, 737, 2574, 2770, 3180, 112, 49, 2111, 2773, 899, 1280, 398, 572, 2556, 3275, 3202, 2705, 1874, 3196, 631, 1688, 635, 3404, 2263, 1060, 1847, 3292, 2866, 52, 827, 2362, 974, 1779, 1737, 2333, 1994, 47, 2844, 2890, 698, 1071, 427, 1870, 2688, 868, 2811, 2960, 1864, 927, 581, 463, 144, 1709, 3039, 1428, 406, 1250, 1331, 449, 733, 1471, 1460, 3321, 2598, 1223, 1055, 1824, 164, 1273, 922, 700, 1539, 780, 1232, 163, 813, 885, 864, 196, 471, 4, 14

We will now look at an application of heaps to implement Djikstra Algorithm. Recall that in Djikstra's algorithm the crux was to pick the next vertex with lowest Djikstra score greedily. In absence of datastructure like Heap, we have to look at all edges where one vertex has its Djikstra's score calculated and another one lies outside the frontier. This is an expensive operation and the algorithm in absence of heap runs in polynomial time in worst case. Lets implement heap using heapq package in the following code snippet. 