# Generators

This session covers the following topics:
- what are generators and how to use them
- generator functions and expressions
- the **yield** statement
- advanced generator methods
- building data pipelines with multiple generators

## Using generators

Generator functions are a special kind of function that return a lazy iterator; a lazy iterator gives a value only when asked for it.
Iterators are objects that you can loop over, similar to a list.
However, unlike lists, **a lazy iterator does not store its contents in memory**.

### Example 1

Let's try to use a generator in a simple example; a common use case for generators is working with data streams or csv files.
Given a large csv file, let's say we need to count the number of rows.

The snippet below is one way of doing so. Can you figure out what should go on the commented line?

In [None]:
def csv_reader(file_name):
    file = open(file_name)
    result = file.read().split("\n")
    return result


csv_content = csv_reader("./resources/airport_log.csv")
row_count = 0

# hint: we need to iterate through csv_content and increment the row count; how can we do that?
# ...

print(f"Row count is {row_count}")


In this example, you might expect csv_content to be a list. In order to populate this list, the csv_reader() opens a file and loads its contents in csv_content. Then, the program iterates through the list and increments the row count.

Pretty reasonable, right? But how would this design work for very large files? What if the file is larger than the avalable memory?

The answer is, most probably you will get a `MemoryError` or your machine will be very very slow.

So, how could we handle these very large files? Using generator functions!
We will redefine the csv_reader() as a generator function by using the **yield** keyword instead of **return**.

In [None]:
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

csv_content = csv_reader("./resources/airport_log.csv")
row_count = 0

# hint: we need to iterate through csv_content and increment the row count; how can we do that?
# ...

print(f"Row count is {row_count}")

In this version, you open the file, loop through each line, and yield each row, instead of returning it.
You can also define a **generator expression** which is very similar to list comprehension, so you can use the generator without callong the function:

`csv_gen = (row for row in open(file_name))`

Remember, the key difference between yield and return:
- using `yield` will result in a generator object
- using `return` will result in the first line of the file *only*.

### Example 2
Another use case for generators is generating an infinite sequence.
In order to get a finite sequence, you would call range() and evaluate in a list context:

In [1]:
a = range(5)
list(a)

[0, 1, 2, 3, 4]

Generating an **infinite** sequence would require a generator, since your computer's memory is finite.

In [None]:
def infinite_sequence():
    num = 0
    while True:
        yield num
        num += 1

In this method, you first initialize the variable `num` and start an infinite loop.
Then, you immediately `yield num` so that you can capture the initial state. This mimics the action of `range()`.
After `yield`, you increment the `num`.

If you call this method in a for loop, it will run forever! (or until you stop the program manually :) )

You can also call next() on the generator object directly. This is especially useful for testing a generator in the console:

In [None]:
gen = infinite_sequence()

print(next(gen))
print(next(gen))
print(next(gen))

## So what are generators really?

Generator functions look and act just like regular functions, but with one defining characteristic: they use the Python `yield` keyword instead of `return`.

Looking at the `infinite_sequence()` definition, `yield` indicates where a value is sent back to the caller, without exiting the function afterward, unlike `return`.

Instead, the **state** of the function is remembered. That way, when `next()` is called on a generator object, either explicitly or implicitly within a loop, the `num` variable is incremented and then yielded again.

Generator expressions are very similar to other comprehentions in Python.

## Building generators with generator expressions
Generator expressions, just like list comprehentions, allow you to create a generator object with just a few lines of code.

They are also useful in the same cases list comprehensions are used, without building and holding the object in memory before iteration.

That means you have no memory penalty when using generator expressions, unlike list comprehensions.

In [None]:
list_comprehension = [num**2 for num in range(5)]
generator_expression = (num**2 for num in range(5))

print(list_comprehension)
print(generator_expression)

In this example, the expressions look very similar; can you spot the difference between them?
Hint: check the output for confirmation.

## But what about the `yield` statement?

The main job of the `yield` statement is to control the flow of a generator function, similar to `return` with a few extra benefits.

When you call a generator function or use a generator expression, it returns an iterator. When you call special methods on the resulted iterator, such as `next()`, the function is executed up to `yield`.

When the `yield` statement is hit, the function execution is suspended (not stopped completely, which happens when using `return`) and the yielded value is given to the called. When a function execution is suspended, its state is preserved (variable bindings local to the generator, internal stack, instruction pointer etc).

This allwos you to resume the function whenever you call one of the generator's methods - all function evaluation resumes right after `yield`.

Let's see what happens when using multiple `yields`.

In [None]:
def many_yields():
    yield "I remember when"
    yield "I remember, I remember when I lost my mind"
    yield "There was something so pleasant about that place"
    yield "Even your emotions have an echo in so much space"
    yield "Can't remember the rest so I'll stop here."

iter = many_yields()
print(next(iter))
print(next(iter))
print(next(iter))
print(next(iter))
print(next(iter))
# iterator exaushted. let's call next() one more time.
print(next(iter))

Because we called `next()` after the generator was exhausted - unless your generator is infinite, you can only iterate through it one time only we got a `StopIteration` exception.
This exception is merely a signal the end of an iterator.

## Exercise: create a data pipeline using generators

Data pipelines allow you to process large volumes of data without maxing out your machine's memory.
You may use the provided sample dataset, or you can find a different dataset on one of the public sets available online.

For the given dataset, let's say you are interested in going to Amsterdam from Cluj-Napoca and you want to know the average ticket price.
We assume all ticket prices are integers.

You will analyse this file and get a total average of the ticket prices.

### Strategy
- read every line of the file
- split each line into a list of values
- extract the column names
- create a dictionary with the column names (key) and lists (value)
- filter out the rows you are not interested in
- calculate the average ticket price for the records you are interested in

Normally, you can do this using a package like `pandas`, but for this case, a few generators should do the trick.

Start from the code snippet below.

In [None]:
# we will use mean on order to get the average price
from statistics import mean

# generate an iterator for the lines in the file
lines = ...

# split each line into a list and put the values into an iteratos
list_line = ...

# use the next() to store the column names into a list
cols = ...

# create dictionaries and unite them with zip()
# the keys are the column names stored in cols
# the values are the rows is list form, list_line
airport_logs_dicts = ...

# filter the rows
# we are interested in tickets from CLJ to AMS
clj_ams_prices = (
    int(airport_logs_dict["ticket_price"])
    for airport_logs_dict in airport_logs_dicts
    if ( ... )
)

# for testing purposes - check all the prices - comment this after testing
while (i := next(clj_ams_prices, False)):
    print(i)

# uncomment this after checking all the prices
# avg_ticket_price = mean(clj_ams_prices)
# print(f"The average ticket price: ${avg_ticket_price}")


### Generator vs List vs Tuple

In [7]:
import sys

a_list = []
for i in range(1, 10000):
    a_list.append(i)

tup = tuple(a_list)
gen = (x for x in a_list)

print(type(a_list))
print(type(tup))
print(type(gen))

print('size of list is', sys.getsizeof(a_list))
print('size of tup is', sys.getsizeof(tup))
print('size of gen is', sys.getsizeof(gen))

<class 'list'>
<class 'tuple'>
<class 'generator'>
size of list is 85176
size of tup is 80032
size of gen is 112


In [8]:
a_list = []
for i in range(1, 3):
    a_list.append(i)

tup = tuple(a_list)
gen = (x for x in a_list)

print('gen')
for x in gen:
    print(x)

print('tup')
for x in tup:
    print(x)

print('list')
for x in a_list:
    print(x)

print('gen')
for x in gen:
    print(x)

print('tup')
for x in tup:
    print(x)

print('list')
for x in a_list:
    print(x)

gen
1
2
tup
1
2
list
1
2
gen
tup
1
2
list
1
2
