## 10.1 Merge Sorted Files 

Write a program that takes as input a set of sorted sequences and computes the union of these sequences as a sorted sequence. For example, if the inout is <3,5,7>, <0,6>, and <0,6,28>, then the output is <0,0,3,5,6,6,7,28>.

In [6]:
import heapq

In [4]:
def merge_sorted_arrays(sorted_arrays: list) -> list:
    min_heap = []
    # Builds a list of iterators for each array in sorted_arrays
    sorted_arrays_iters = [iter(x) for x in sorted_arrays]
    
    # Puts first element from each iterator in min_heap
    for i, it in enumerate(sorted_arrays_iters):
        first_element = next(it, None)
        if first_element is not None:
            heapq.heappush(min_heap, (first_element, i))
            
    result = []
    while min_heap:
        smallest_entry, smallest_array_i = heapq.heappop(min_heap)
        smallest_array_iter = sorted_arrays_iters[smallest_array_i]
        result.append(smallest_entry)
        next_element = next(smallest_array_iter, None)
        if next_element is not None:
            heapq.heappush(min_heap, (next_element, smallest_array_i))
    return result 

In [5]:
sorted_arrays = [[3,5,7], [0,6], [0,6,28]]
merge_sorted_arrays(sorted_arrays)

[0, 0, 3, 5, 6, 6, 7, 28]

In [7]:
from heapq import heappush, heappop
heap = []
data = [1,3,5,7,9,2,4,6,8,0]
for item in data:
    heappush(heap, item)

In [8]:
print(heap)

[0, 1, 2, 6, 3, 5, 4, 7, 8, 9]


In [9]:
ordered = []
while heap:
    ordered.append(heappop(heap))

In [10]:
ordered

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [11]:
data.sort()
data == ordered

True

In [12]:
heap = []
data = [(1, 'J'), (4, 'N'), (3, 'H'), (2, 'O')]
for item in data:
    heappush(heap, item)

In [13]:
while heap:
    print(heappop(heap)[1])

J
O
H
N


In [14]:
# Pythonic solution, uses the heapq.merge() method which takes multiple inputs 
def merge_sorted_arrays_pythonic(sorted_arrays):
    return list(heapq.merge(*sorted_arrays))

In [15]:
merge_sorted_arrays_pythonic(sorted_arrays)

[0, 0, 3, 5, 6, 6, 7, 28]

Let k be the number of input sequences. Then there are no more than k elements in the min-heap. Both extract-min and insert take O(log k) time. Hence, we can do the merge in O(n log k) time. The space complexity is O(k) beyond the space needed to write the final result. In particular, if the data comes from files and is written to a file, instead of arrays, we would need only O(k) additional storage. 

Alternatively, we could recursively merge the k files, two at a time using the mergge step from merge sort. We would go from k to k/2 then k/4, etc.files. There would be log k stages, and each has time complexity O(n), so the time complexity is the same as that of the heap-based approach, i.e., O(n log k). The space complexity of any reasonable implementation of merge sort would end up being O(n), which is considerabley worse than the heap based approach when k << n. 

## 10.2 Sort an Increasing-Decreasing Array

An array is sad to be k-increasing-decreasing if elements repeatedly increase up to a certain index after which they decrease, then again increase, a total of k times. Design an efficient algorithm for sorting a k-increasing-decreasing array. 

In [20]:
def sort_k_increasing_decreasing_array(A: list) -> list:
    # Decomposes A into a set of sorted subarrays
    sorted_subarrays = []
    increasing, decreasing = range(2)
    subarray_type = increasing 
    start_idx = 0
    for i in range(1, len(A)+1):
        if (i == len(A) or #A is ended. Adds the last subarray.
            (A[i -1] < A[i] and subarray_type == decreasing) or 
            (A[i-1] >= A[i] and subarray_type == increasing)):
                sorted_subarrays.append(A[start_idx : i] if subarray_type == increasing
                                       else A[i-1: start_idx -1: -1])
                start_idx = i
                subarray_type = (decreasing
                                if subarray_type == increasing else increasing)
    return merge_sorted_arrays(sorted_subarrays)
        

In [21]:
A =[57, 131, 493, 294, 221, 339, 418, 452, 442, 190]
sort_k_increasing_decreasing_array(A)

[57, 131, 190, 221, 294, 339, 418, 442, 452, 493]

In [24]:
import itertools

In [29]:
# Pythonic solution, uses a stateful object to trace the monotonic subarrays. 
def sort_k_increasing_decreasing_array_pythonic(A):
    class Monotonic:
        def __init__(self):
            self._last = float('-inf')
        def __call__(self, curr):
            result = curr < self._last
            self._last = curr
            return result 
    
    return merge_sorted_arrays([
        list(group)[::-1 if is_decreasing else 1]
        for is_descreasing, group in itertools.groupby(A, Monotonic())
    ])

In [30]:
sort_k_increasing_decreasing_array_pythonic(A)

NameError: name 'is_decreasing' is not defined

The time complexity is O(n log k) time. 

## 10.3 Sort an Almost-sorted Array

Often data is almost-sorted -- for example, a server receives timestamped stock quotes and earlier quotes may arrive slightly after later quotes because of differneces in server loads and network routes. 

Write a program which takes as input a very long sequence of numbers and prints the numbers in sorted order. Each number is at most k away from its correctly sorted position. 

**Hint:** How many numbers must you read after reading the ith number to be sure you can place it in the correct location? k+1.

The brute-force solution is to put the sequence in anarray, sort it, and then print it. The time complexity is O(n log n), where n is the length of the input sequence. The space complexity is O(n). 

We can do better by taking advantage of the almost-sorted property. Specifically, after we have read k+1 numbers, the smallest number in the group must be smaller than all following numbers. We need to store k+1 numbers and want to be able to efficiently extract the minimum number and add a new number. A min-heap is eactly what we need. We add the first k numbers to a min-heap. Now we add additional numbers to the min-heap and extract the minimum from the heap. 

In [3]:
import itertools

In [24]:
def sort_approximately_sorted_array(sequence, k: int) -> list:
    min_heap = []
    # Adds the first k elements into min_heap. Stop if there are fewer than k elements.
    for x in itertools.islice(sequence, k):
        heapq.heappush(min_heap, x)
        
    result = []
    # For every new element, add it to min_heap and extract the smallest. 
    for x in sequence[k:]:
        smallest = heapq.heappushpop(min_heap, x)
        result.append(smallest)
        
    # sequence is exhausted, iteratively extracts the remaining elements
    while min_heap:
        smallest = heapq.heappop(min_heap)
        result.append(smallest)
    
    return result 

In [25]:
sequence = [3, -1, 2, 6 , 4, 5, 8]
k = 2
sort_approximately_sorted_array(sequence, k)

[-1, 2, 3, 4, 5, 6, 8]

In [27]:
sequence = [4,3,1,6,5,4,9,8,7,10,11,12]
k = 3
sort_approximately_sorted_array(sequence,k)

[1, 3, 4, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [19]:
min_heap = []
for x in itertools.islice(sequence, k):
    print(x)
    heapq.heappush(min_heap, x)
    print(min_heap)

3
[3]
-1
[-1, 3]


In [20]:
result = []
for x in sequence[k:]:
    smallest = heapq.heappushpop(min_heap, x)
    print(smallest)
    result.append(smallest)
    print(result)

-1
[-1]
2
[-1, 2]
3
[-1, 2, 3]
4
[-1, 2, 3, 4]
5
[-1, 2, 3, 4, 5]


In [21]:
print(min_heap)

[6, 8]


In [22]:
while min_heap:
    smallest = heapq.heappop(min_heap)
    result.append(smallest)
    print(smallest)
    

6
8


In [23]:
print(result )

[-1, 2, 3, 4, 5, 6, 8]


The time complexity is O(n log k). The space complexity is O(k).

## 10.4 Compute the k Closet Stars 

Consider a coordinate system for the Milky Way, in which Earth is at (0,0,0). Model stars as points, and assume distances are light years. The Milky Way consists of approximately 10^12 stars, and their coordinates are stored in a file. 

How would you compute the k stars which are closest to Earth? 

In [32]:
import math 

In [33]:
class Star:
    def __init__(self, x:float, y: float, z:float) -> None:
        self.x, self.y, self.z = x, y, z
    
    @property 
    def distance(self) -> float:
        return math.sqrt(self.x**2 + self.y**2 + self.z**2)
    
    def __lt__(self, rhs: 'Star') -> bool:
        return self.distance < rhs.distance 
    
    
def find_closest_k_stars(stars, k: int) -> list:
    # max_heap to store the colest k stars seen so far. 
    max_heap = []
    for star in stars:
        # add each star to the max-heap. If the max-heap size exceeds k, remove
        # the maximum element from the max-heap.
        # As python has only min-heap, insert tuple (negative of distance, star)
        # to sort in reversed distance order. 
        heapq.heappush(max_heap, (-star.distance, star))
        if len(max_heap) == k+1:
            heapq.heappop(max_heap)
        
    # Iteratively extract from the max-heap, which yields the stars sorted 
    # according from furthest to closest.
    return [s[1] for s in heapq.nlargest(k, max_heap)]

In [34]:
A = Star(3,4,5)

In [35]:
A.distance

7.0710678118654755

In [36]:
B = Star(1,1,1)
C = Star(-0.5,0.7,0.9)
D = Star(9,9,9)
E = Star(0.1,0.1,0.1)
F = Star(-1,-1,-1)
G = Star(0,9,111)

In [38]:
stars = [A,B,C,D,E,F,G]
new_star = find_closest_k_stars(stars, 2)

In [41]:
for star in new_star:
    print(star.x, star.y, star.z)

0.1 0.1 0.1
-0.5 0.7 0.9


The time complexity is O(n log k) and the space complexity is O(k). 

## 10.5 Compute the Median of Online Data 

You want to compute the running median of a sequence of numbers. The sequence is presented to you in a streaming fashion--you cannnot back up to read an earlier value, and you need to output the median after reading in each new element. For example, if the input is 1,0,3,5,2,0,1 the output is 1,0.5,1,2,2,1.5,1. 

Design an algorithm for computing the running median of a sequence. 

**Sol:** The brute-force approach is to store all the elements seen so far in an array and compute the median using, for example, Solution 11.8 on Page 163 for finding the kth smallest entry in an array. This has time complexity O(n^2) for computing the running median for the first n elements. 

The shortcoming of the brute-force approach is that it is not incremental, i.e., it does not take advantave of the previous computation. Note that the median of a collection divides the collection into two equal parts. When a new element is added to the collection, the parts can change by at most one element, and the element to be moved is the largest of the smaller half or the smallest of the larger half. 

We can use two heaps, a max-heap for the smaller half and a min-heap for the larger half. We will keep these heaps balanced in size. The max-heap has the property that we can efficiently extract the largest element in the smaller part; the min-heap is similar. 

In [42]:
def online_median(sequence) -> list:
    # min_heap stores the larger half seen so far
    min_heap = []
    # max_heap stores the smaller half seen so far
    max_heap = []
    result = []
    
    for x in sequence:
        heapq.heappush(max_heap, -heapq.heappushpop(min_heap, x))
        # Ensure min_heap and max_heap have equal number of elements if an even 
        # number of elements is read; otherwise, min_heap must have one more 
        # element than max_heap
        if len(max_heap) > len(min_heap):
            heapq.heappush(min_heap, -heapq.heappop(max_heap))
            
        result.append(0.5 *(min_heap[0] + (-max_heap[0]))
                     if len(min_heap) == len(max_heap) else min_heap[0])
        
    return result 

In [47]:
sequence = [1,0,3,5,2,0,1]
online_median(sequence)

[1, 0.5, 1, 2.0, 2, 1.5, 1]

In [48]:
min_heap = []
max_heap = []
result = []

In [49]:
for x in sequence:
    print(x)
    heapq.heappush(max_heap, -heapq.heappushpop(min_heap, x))
    print(max_heap)
    print(min_heap)
    if len(max_heap) > len(min_heap):
        heapq.heappush(min_heap, -heapq.heappop(max_heap))
    print(max_heap)
    print(min_heap)

1
[-1]
[]
[]
[1]
0
[0]
[1]
[0]
[1]
3
[-1, 0]
[3]
[0]
[1, 3]
5
[-1, 0]
[3, 5]
[-1, 0]
[3, 5]
2
[-2, 0, -1]
[3, 5]
[-1, 0]
[2, 5, 3]
0
[-1, 0, 0]
[2, 5, 3]
[-1, 0, 0]
[2, 5, 3]
1
[-1, -1, 0, 0]
[2, 5, 3]
[-1, 0, 0]
[1, 2, 3, 5]


The time complexity per entry is O(log n), corresponding to insertion and extraction from a heap. 

## 10.6 Compute the k Largest Elements in a Max-heap

A heap contains limited information about the ordering of elements, so unlike fa sorted array or a balanced BST, naive algorithm for computing the k largest elements have a time complexity that depends linearly on the number of elements in the collection. 

Given a. max-heap, represented as an array A, design an algorithm that computes the k largest elements stored in the max-heap. You cannot modify the heap. 

**Sol:** The ideal data sctructure for tracking the index to process next is a data sctructure which support fast insertions, and fast extract-max, i.e., in a max-heap. So our algorithm is to create a max-heap for candidates, initialized to hold the index 0, which serves as a refereence to A[0]. The indices in the max-heap are ordered according to corresponding value in A. We then itreatively perform k extract-max operations from the max-heap. Each extraction of an index i is followed by inserting the indices of i's left child, 2i+1, and right child, 2i+2, to the max-heap. assuming these children exist. 

In [50]:
def k_largest_in_binary_heap(A: list, k: int) -> list:
    if k <= 0:
        return []
    
    # Stores the (-value, index)-pair in candidate_max_heap. 
    # This heap is ordered by value field. Uses the negative of value to get the effect 
    # of a max heap.
    candidate_max_heap = []
    # The largest element in A is at index 0.
    candidate_max_heap.append((-A[0],0))
    result = []
    for _ in range(k):
        candidate_idx = candidate_max_heap[0][1]
        result.append(-heapq.heappop(candidate_max_heap)[0])
        
        left_child_idx = 2* candidate_idx + 1
        if left_child_idx < len(A):
            heapq.heappush(candidate_max_heap, (-A[left_child_idx], left_child_idx))
        right_child_idx = 2* candidate_idx + 2
        if right_child_idx < len(A):
            heapq.heappush(candidate_max_heap, (-A[right_child_idx], right_child_idx))
    return result 

In [65]:
A = [561,314,401,28,156,359,271,11,3]

In [53]:
k_largest_in_binary_heap(A,4)

[561, 401, 359, 314]

The total number of insertion and extract-max operation is O(k), yielding an O(k logk) time complexity, and an O(k) additional space complexity. This algorithm does not modify the original heap. 

In [66]:
candidate_max_heap = []
candidate_max_heap.append((-A[0],0))
result = []

In [67]:
for _ in range(4):
    candidate_idx = candidate_max_heap[0][1]
    print(candidate_max_heap[0][0])
    print(candidate_idx)
    result.append(-heapq.heappop(candidate_max_heap)[0])
    print(result)
    left_child_idx = 2*candidate_idx + 1
    if left_child_idx < len(A):
        heapq.heappush(candidate_max_heap, (-A[left_child_idx], left_child_idx))
    right_child_idx = 2*candidate_idx +2 
    if right_child_idx < len(A):
        heapq.heappush(candidate_max_heap, (-A[right_child_idx], right_child_idx))
        

-561
0
[561]
-401
2
[561, 401]
-359
5
[561, 401, 359]
-314
1
[561, 401, 359, 314]
