# **10.1 Merge Sorted Files**

#### Write a program that takes a set of sorted sequences as input and computes the union of the sequences as a sorted sentence 
- `[3,5,7],[0,6],[0,6,28]` -> `[0,0,3,5,6,6,7,28]`
- given 500 files each containing stock information for a S&P 500 company 
- each trade encoded by a line in the following format: `1232111,AAPL,30,456.12`
    - e[0] = milliseconds since the start of the day's trading
    - e[1] = stock symbol
    - e[2] = number of shares
    - e[3] = price 
- create a single file contianing all the trades from the 500 files 
    - sorted in order of increasing time 
- individual files of the order of 5-100 megabytes
    - combined file will be the order of 5 gigabytes 

---
#### Brute Force: `O(n log n)` time 
- concatenate into a single array then sort 
- does not use the fact individual elements are sorted w/in the sequences 
---

### Min Heap:
- ideal for maintaining a collection of elements when we need to add arbitrary values and extract smallest element 
- `[3,5,7], [0,6], [0,6,28]`
    - min-heap initializd to first entry of each array -> `[3,0,0]`
        - extract smallest entry `0` and add it to the output = `[0]`
    - min-heap add 6 -> `[3,0,6]` or `[3,6,0]` (doesn't matter) 
        - extract `0` and add it to the output = `[0,0]`
    - min-heap add 6 -> `[3,6,6]`
        - extract `3` and add it to the output = `[0,0,3]`
    - min-heap add `5` -> `[5,6,6]`
        - extract `5` and add it to the output = `[0,0,3,5]`
    - min-heap add `7` -> `[7,6,6]`
        - extract `6` and add it to the output = `[0,0,3,5,6]`
    - min-heap NO ADD -> `[7,6]`
        - extract `6` and add it to the output = `[0,0,3,5,6,6]`
    - min-heap add `28` -> `[7,28]`
        - exctract `7` and add it to the output = `[0,0,3,5,6,6,7]`
    - min-heap `[28]`
        - extract `28` and add it to the output = `[0,0,3,5,6,6,7,28]`
                                                

In [2]:
from typing import List
import heapq


def merge_sort(s_array: List[List[int]]) -> List[int]:
    
    # Tuples store multiple items in a single variable 
    min_heap: List[Tuple[int,int]] = []
    
    # build list of iterators for each array in s_array
    s_array_iterators = [iter(x) for x in s_array]
    
    
    # put first element from each iterator in min_heap 
    for i, it in enumerate(s_array_iterators):
        first_element = next(it, None)
        if first_element is not None:
            # pushes 'i' on the heap 
            heapq.heappush(min_heap, (first_element, i))
   
    result = []
    while min_heap:
        # pops smallest element from the heap 
        # smallest = element 
        # smallest_array indexes to what array the smallest element came from 
        smallest, smallest_array = heapq.heappop(min_heap)
        smallest_array_iter = s_array_iterators[smallest_array]
        result.append(smallest)
        
        next_element = next(smallest_array_iter, None)
        if next_element is not None:
            # push next smallest element into the heap 
            heapq.heappush(min_heap, (next_element, smallest_array))
            
            
    return result

In [3]:
s_array = [[3,5,7],[0,6],[0,6,28]]

merge_sort(s_array)

[0, 0, 3, 5, 6, 6, 7, 28]

#### Merge in `O(n log k)` time
- k = number of input sequences 
    - no more than k elements in the min-heap
- extracting the min value = `O(log k)` time
- inserting the next value = `O(log k)` time 

#### `O(k)` beyond the space needed to write the final result to a file 
- minheap only has `k` items at a time
- data comes from a file and is written into a file
- instead of arrays -> only need `O(k)` additional space
---

### Pythonic Solution
- uses `heapq.merge()` which takes multiple inputs 

In [4]:
def python_merge(sorted_arrays):
    return list(heapq.merge(*sorted_arrays))

In [5]:
s_array = [[3,5,7],[0,6],[0,6,28]]

python_merge(s_array)

[0, 0, 3, 5, 6, 6, 7, 28]

#### Merge in `O(n log k)`
- n = elements and k = items fed into minheap 
#### `O(k)` Space Complexity
- minheap has `k` items at any given point during execution 

---
### Recursive Merge Sort
- merge k files two at a time using Merge Sort (Ch. 13) 
    - `O(n log k)` Time Complexity
        - `log k` stages each with time complexity `O(n)`
        - same as heap based approach `O(n log k)`
    - `O(n)` space complexity
        - worse than heap when `k < n` 
---