# Exercise 6.3: Extract largest numbers from stream

In this exercise we will consider an application of priority queues. We want to implement a function that extracts from a long stream of numbers the $m$ largest numbers. The memory requirement of the implementation should be linear in $m$ and the running time for processing $n$ numbers from the stream and printing the $m$ largest numbers should be in $O(n\log_2 m + m)$.

In you implementation you should build on a min priority queue. We will use an analogous interface as the (max) priority queue from notebook `heaps.ipynb` but for the internal implementation we use the Python module `heapq`. You do not need to understand (and should not access) this internal implementation but only use the following class `MinPQ` via the provided methods. Please note that we also added a method `print_all_elements` that prints the currently contained items as a list.

In [None]:
import heapq

class MinPQ:
    def __init__(self):
        self.heap = []

    def is_empty(self):
        return len(self.heap) == 0

    def size(self):
        return len(self.heap)

    def insert(self, item):
        heapq.heappush(self.heap, item)

    def minimum(self):
        if self.is_empty():
            raise Exception("invalid operation on empty pq")
        return self.heap[0]

    def extract_min(self):
        if self.is_empty():
            raise Exception("invalid operation on empty pq")
        return heapq.heappop(self.heap)

    def print_all_elements(self):
        print(self.heap)

For testing our function (see below), we need a data stream of numbers. For this purpose, we use a generator that successively generates numbers randomly according to a normal distribution.

In [None]:
import random

def numberGenerator():
    while True:
        yield random.gauss(0, 1)

The `numberGenerator` generates the numbers on demand, one after the other. We can use `next` to get the next number:

In [None]:
g = numberGenerator()
for i in range(20):
    print(next(g))

Your task is to implement the following function `largest_m_numbers` that should print the `m` largest seen numbers from a data stream after processing `n` numbers from the stream. Parameter `data_stream` is some number generator as the one above, providing the stream. If `m > n` the implementation should only print `n` numbers (which should be naturally the case).

In [None]:
def largest_m_numbers(m, data_stream, n):
    priority_queue = MinPQ()
    for _ in range(n):
        # the "_" in this for loop is just a variable name (we could as well have used "i").
        # In Python, it indicates that the variable is not used (by convention).
        next_number = next(data_stream)
        ...
        # TODO suitably process next_number

    # TODO print the m largest numbers

Once you have finished the implementation, you can use it as follows:

In [None]:
g = numberGenerator()
largest_m_numbers(5, g, 1000000)

With the given normal distribution and extracting the 5 largest of 1000000 numbers, you can expect the output to consist of numbers that are all higher than 4.

Let's also add a more controlled example, where the data stream contains the numbers 1 to 1000 in some random order. If your implementation is correct, it should output a list containing numbers 996, 997, ... , 1000 (in some order).

In [None]:
l = list(range(1, 1001))
random.shuffle(l)
g = iter(l)
largest_m_numbers(5, g, 1000)