# CH10 Heaps

In [1]:
# Some important notes:
# A heap is a specialized binary tree. Specifically, it is a complete binary tree.
# The keys must satisfy heap property - the key at each node is atleast as great as the keys stored at its children.
# A max-heap can be implemented as an array => the children of the node at index i are at indices 2i+1 and 2i+2.
# A max-heap supports O(logn) insertions, O(1) time lookup for the max element, and O(logn) deletion of the max element.
# The extract-max operation is defined to delete and return the maximum element.
# Use heaps when the problem needs largest and smallest elements and does not care about fast lookup, delete, or search operations for arbitrary elements
# A heap is a good choice when you need to compute the k largest(min-heap) or k smallest(max-heap) elements in a collection.
# The following are the functionalities provided by Pythons heapq module:
# - heapq.heapify(L), which transforms the elements in L into a heap in-place,
# - heapq.nlargest(k, L) (heapq.nsmallest(k, L))retumstheklargest(smallest)elementsin L
# - heapq. heappush(h, e), which pushes a new element on the heap,
# - heapq.heappop(h), which pops the smallest element from the heap,
# - heapq.heappushpop(h, a), which pushes a on the heap and then pops and retums the smallest element, and
# - e = h[0], which returns the smallest element on the heap without popping it.
# - heapq.merge(*iterables), merges multiple sorted inputs into a single sorted outputs
# - heapq only supports min-heap functionality. To use this module to build max-heap, negate the input elements 
# The enumerate() function takes a collection (e.g. a tuple) and returns it as an enumerate object.
# The enumerate() function adds a counter as the key of the enumerate object.

In [2]:
# Task: Suppose you were asked to write a program which takes a sequence of strings presented in "streaming" fashion: you cannot back up to read an earlier value. 
# Your program must compute the k longest strings in the sequence. All that is required is the k longest strings-it is not required to order these strings.

# We need to remove the smallest string each time a better string appears. So, we can build a min-heap to solve this problem.
# Time Complexity: O(nlogk) where O(logk) time is used to process each string, which is the time to add and to remove the minimum element from the heap
import itertools
import heapq
def top_k(k, stream):
    min_heap =  [(len(s),s) for s in itertools.islice(stream, k)]
    heapq.heapify(min_heap) # Transform list x into a heap, in-place, in linear time.
    
    for next_string in stream:
        # Push item on the heap, then pop and return the smallest item from the heap. 
        # The combined action runs more efficiently than heappush() followed by a separate call to heappop().
        if(len(next_string) > len(min_heap[0])): #opt code using this check to reduce the number of push pops 
            heapq.heappushpop(min_heap, (len(next_string), next_string))
    
    # heapq.nsmallest: Return a list with the n smallest elements from the dataset defined by iterable. 
    # key, if provided, specifies a function of one argument that is used to extract a comparison key from each element in the iterable: key=str.lower Equivalent to: sorted(iterable, key=key)[:n]
    return [p[1] for p in heapq.nsmallest(k, min_heap)]

stream = ['abcc', 'ad', 'e', 'aaaaaaa', 'b']
k = 2
print(top_k(2, stream)) # Returns the k largest strings in the stream

['abcc', 'aaaaaaa']


## 10.1 Mege sorted files

In [3]:
# You are given 500 files, each containing stock trade information for an S&P 500 company. 
# Each trade is encoded by a line in the following format: 1232111, AAPL, 30, 456. 12.
# The first number is the time of the trade expressed as the number of milliseconds since the start
# of the day's trading. Lines within each file are sorted in increasing order of time. The remaining
# values are the stock symbol, number of shares, and price. You are to create a single file containing
# all the trades from the 500 files, sorted in order of increasing trade times. The individual files are
# of the order of 5-100 megabytes; the combined file will be of the order of five gigabytes. In the
# abstract, we are trying to solve the following problem.

# Write a program that takes as input a set of sorted sequences and computes the union of these sequences as a sorted sequence. 
# For example, if the input is <3,5,7>, (0,5), and <0,6,28>, then the output is (0, 0, 3, 5, 6, 6,7, 28).

# Brute Force: Merge all the files and then sort them Time Complexity: O(nlogn) where n is the total # entries in all files
# Optimized Approach: Create a min-heap of size 500 then initialize it with all the 0th elements of each file.
# Pop the smallest element from the heap and push the next element from the popped element file

# Time Complexity: O(nlogk) where logk is the time taken to extract-min and insert new element.
# Space Complexity: O(k) + additional space to store result
def merge_sorted_arrays(sorted_arrays):
    min_heap = []
    # Build a list of iterators for each sorted array
    sorted_arrays_iters = [iter(x) for x in sorted_arrays]
    
    # Create the min-heap with 0th elements taken from each sorted array
    for array_num, it in enumerate(sorted_arrays_iters):# The enumerate() function adds a counter as the key of the enumerate object.
        first_element = next(it, None)
        if first_element is not None:
            heapq.heappush(min_heap, (first_element, array_num))
    
    result  = []
    while min_heap:
        smallest_element, smallest_array_num = heapq.heappop(min_heap)
        smallest_array_iter = sorted_arrays_iters[smallest_array_num]
        result.append(smallest_element)
        next_element = next(smallest_array_iter, None)
        if next_element is not None:
            heapq.heappush(min_heap, (next_element, smallest_array_num))
    return result

l1 = [1,2,3,4,5, 1000]
l2 = [100,200,300]
l3 = [50,60,70,150]
sorted_arrays = [l1, l2, l3]
print(f'The final sorted merged list: {merge_sorted_arrays(sorted_arrays)}')

def merge_sorted_arrays_pythonic(sorted_arrays):
    # heapq.merge(*iterables) Merge multiple sorted inputs into a single sorted output 
    # (for example, merge timestamped entries from multiple log files). Returns an iterator over the sorted values.
    return list(heapq.merge(*sorted_arrays))
print(f'The final sorted merged list: {merge_sorted_arrays_pythonic(sorted_arrays)}')

# Alternatively, k files can be merged by taking two at a time using the merge step from merge sort.
# Time Complexity is going to be same O(nlogk) as there are going to be logk stages
# Space Complexity: merge sort space complexity is always going to be more than heap O(n)

The final sorted merged list: [1, 2, 3, 4, 5, 50, 60, 70, 100, 150, 200, 300, 1000]
The final sorted merged list: [1, 2, 3, 4, 5, 50, 60, 70, 100, 150, 200, 300, 1000]


## 10.2 Sort an increasing-decreasing array

In [5]:
# An array is said to be k-increasing-decreasing if elements repeatedly increase up to a certain index
# after which they decrease, then again increase, a total of k times.
# Design an efficient algorithm for sorting a k-increasing-decreasing array.

# Brute Force: Sort the array normally Time Complexity:O(NlogN)
# Optimized Approach: Reverse all decreasing sequences. Then, we have sorted sub arrays which can be merged in O(NlogK) time.
def sort_k_increasing_decreasing_array(A):
    sorted_subarrays = []
    INCREASING, DECREASING = range(2)
    subarray_type = INCREASING
    start_idx = 0
    for i in range(1, len(A) + 1): 
        # append if reached end of the array or at the end of a decreasing or an increasing seq
        if((i==len(A) or (A[i-1]<A[i] and subarray_type == DECREASING) or (A[i-1] >= A[i] and subarray_type == INCREASING))):
            sorted_subarrays.append(A[start_idx:i] if subarray_type == INCREASING else A[i-1:start_idx-1:-1])
            # new seq started at i
            start_idx = i
            subarry_type = DECREASING if subarray_type == INCREASING else INCREASING
    return merge_sorted_arrays(sorted_subarrays)

A = [57, 131, 493, 294, 221, 339, 418, 452, 442, 190]
print(f'Sorted Array: {sort_k_increasing_decreasing_array(A)}')

Sorted Array: [57, 131, 190, 221, 294, 339, 418, 442, 452, 493]


## 10.3 Sort an almost sorted array

In [7]:
# Write a Program which takes as input a very long sequence of numbers and prints the numbers in
# sorted order. Each number is at most k away from its correctly sorted position. (Such an array is
# sometimes referred to as being k-sorted.) Fo rexample,no number in the sequence(3, -1,2,6,4,5,8)
# is more than 2 away from its final sorted position.

# Brute Force: Place the input in arrays -> sort it Time Complexity: O(nlogn) Space Complexity:O(n)
# Optimized Approach: we can take advantage of the almost-sorted property. The smallest number of the input seq must be in positions 0,1,2.
# So, store k+1 numbers in min-heap -> extract min and place it in the array.
# Time Complexity: O(nlogk) Additional Space Complexity:O(k)
def sort_approximately_sorted_array(sequence, k):
    result = []
    min_heap = []
    # push first k elements into the heap
    for x in range(0,k):
        heapq.heappush(min_heap, sequence[x])
    
    # for every new element, push it into heap and then pop min
    for x in range(k, len(sequence)):
        smallest = heapq.heappushpop(min_heap, x)
        result.append(smallest)
    
    # seq is exhausted, pop elements from min_heap and append them to the result
    while min_heap:
        smallest = heapq.heappop(min_heap)
        result.append(smallest)
    return result

sequence = [3, -1, 2, 6, 4, 5, 8]
k = 2
print(f'Sorted Array: {sort_approximately_sorted_array(sequence, k)}')

Sorted Array: [-1, 2, 3, 3, 4, 5, 6]


## 10.4 Compute the k closest stars

In [11]:
# Consider a coordinate system for the Milky Way, in which Earth is at (0,0,0). Model stars as points,
# and assume distances are in light years. The Milky Way consists of approximately 1012 stars, and
# their coordinates are stored in a file.

# Task: How would you compute the k stars which are closest to Earth?

# Brute Force: If RAM is not a limitation, we can place input in array -> sort it and find k smallest. Space Complexity: O(N) but input cannot be stored in RAM for the given dataset.
# Optimized Approach: Store the first k elements in a max-heap, when a new element comes in then place it in max-heap if it is closer to Earth than the farthest element in the heap.
# we can simply add new elements to the heap, once k+1 elements are present in the heap then we can start removing the max and then push new element in every iteration
# Time Complexity: O(nlogk) Space Complexity: O(k)
import math
class Star: 
    def __init__(self, x, y, z):
        self.x, self.y, self.z = x, y, z
    
    @property
    def distance(self):
        return math.sqrt(self.x**2 + self.y**2 + self.z**2)
    
    def __lt__(self, rhs):
        return self.distance < rhs.distance

def find_closest_k_stars(stars, k):
    max_heap = []
    for star in stars:
        heapq.heappush(max_heap, (-star.distance, star)) # negation as we want max heap property
        if len(max_heap) == k+1:
            heapq.heappop(max_heap)
    
    return [s[1] for s in heapq.nlargest(k, max_heap)] 

stars = []
stars.append(Star(1,1,1))
stars.append(Star(4,5,6))
stars.append(Star(1,2,3))
stars.append(Star(11,12,13))
stars.append(Star(21,25,26))
k = 2
result = find_closest_k_stars(stars, k)
for s in result:
    print(f'({s.x},{s.y},{s.z})', end=" ")
    

(1,1,1) (1,2,3) 

In [12]:
# Variant Design an O(nlogk) time algorithm that reads a sequence of n elements and for each
# element, starting from the kth element, prints the kth largest element read up to that point. The
# length of the sequence is not known in advance. Your algorithm cannot use more than O(k)
# additional storage. What are the worst-case inputs for your algorithm?

# Appraoch: Store the first k elements in a min-heap. kth largest element is going to be the min element of the heap. 
# when a new element comes in, check if it is larger than the smallest element of the heap if so replace it.

# Worst case input: decreasing seq

## 10.5 Compute the median of online data

In [14]:
# Design an algorithm for computing the running median of a sequence.
# For ex: input=<1,0,3,5,2,0,1> Output:<1,0.5,1,2,1.5,1>

# Brute Force: Store all elements seen so far in an array and then compute the median. Time Complexity: O(n^2)
# But this solution is not incremental. A median divides the array into two halves. When a new element is added to the collection, the parts can change by at
# most one element, and the element to be moved is the largest of the smaller half or the smallest of the larger half.
# Optimized Approach: Use two heaps: one min_heap and one max_heap - keep the heaps balanced in size.
# Time Complexity: O(logn) corresponding to insertion and extraction from a heap
def online_median(sequence):
    min_heap = []
    max_heap = [] # values in max_heap are negative
    result = []
    
    for x in sequence:
        # Push new element into min_heap, pop the largest element from min_heap and place it in max_heap
        heapq.heappush(max_heap, -heapq.heappushpop(min_heap, x))
        # If even number of elements => both should have equal len else min_heap should have one element more than max_heap
        if(len(max_heap) > len(min_heap)):
            heapq.heappush(min_heap, -heapq.heappop(max_heap))
        result.append((0.5 * (min_heap[0] + -max_heap[0])) if len(min_heap) == len(max_heap) else min_heap[0])
    return result

sequence = [1, 0, 3, 5, 2, 1]
print(f'Medians:{online_median(sequence)}')
    

Medians:[1, 0.5, 1, 2.0, 2, 1.5]


## 10.6 Compute the k largest elements in a max-heap

In [16]:
# Given a max-heap, represented as an array A, design an algorithm that computes the k
# largest elements stored in the max-heap. You cannot modify the heap. For example, the array representation is
# <561,314,401,28,756,359,271,11,3), the four largest elements are 561,,31,4,401, and 359.

# Brute Force: Perform k extract max operations but this approach modifies the heap.
# Optimized Approach: Take another max heap->put the root in it then pop it append it to result -> then fill the max_heap with the children of popped element.
def k_largest_in_binary_heap(A, k):
    if k <= 0:
        return []
    
    candidate_max_heap = []
    candidate_max_heap.append((-A[0], 0))
    result = []
    for _ in range(k):
        # Append root of the max_heap to the result
        candidate_idx = candidate_max_heap[0][1]
        result.append(-heapq.heappop(candidate_max_heap)[0])
        
        # Push left and right child of appended root to the max_heap
        left_child_idx = 2 * candidate_idx + 1
        if left_child_idx < len(A):
            heapq.heappush(candidate_max_heap, (-A[left_child_idx], left_child_idx))
        
        right_child_idx = 2 * candidate_idx + 2
        if right_child_idx < len(A):
            heapq.heappush(candidate_max_heap, (-A[right_child_idx], right_child_idx))
    return result

A = [561, 314, 401, 28, 156, 359, 271, 11, 3]
k = 4
print(f'{k} largestest elements: {k_largest_in_binary_heap(A,k)}')

4 largestest elements: [561, 401, 359, 314]
