# Yield Generator

According to the <a href='https://docs.python.org/3/glossary.html'>Python Documentation</a>, a generator is a <q>function which returns a generator iterator.  It looks like a normal function except that it contains yield expressions for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function.</q>

## Behavior

<q>Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator iterator resumes, it picks up where it left off (in contrast to functions which start fresh on every invocation).</q>

In [None]:
def generator(n: int):
    for num in range(n):
        yield num**2


for num in generator(10):
    # 0 1 4 9 16 25 36 49 64 81
    print(f'{num}', end=' ')


As <code>return</code> statements in a regular function, generators can have multiple <code>yield</code> statements.

In [None]:
def generator(n: int):
    for num in range(n):
        if num % 2 == 0:
            yield num**2
        else:
            yield num**3


for num in generator(10):
    # 0 1 4 27 16 125 36 343 64 729
    print(f'{num}', end=' ')


## Performance

According to <a href='https://realpython.com/introduction-to-python-generators/'>Kyle Stratis</a>, we can understand that generators are frequently use to read/write large chunks of data one at the time. They are excellent to optimize memory, but slower than list comprehension.


### Memory Usage

Using <code>psutil.Process</code> we can measure the current memory in bytes, dividing by <code>1024**2</code> will convert it to MB.

In [None]:
from os import getpid
from psutil import Process
import sys


def memory_usage():
    process = Process(getpid())
    return process.memory_info()[0] / float(1024 ** 2)


Sequences e.g list & tuples will always load up the memory, even though a single value is used at the same moment in a <code>for</code> loop.

In [None]:
def sequence(n: int) -> list[int]:
    return [i**2 for i in range(0, n)]


print(f'Memory Usage (Before): {memory_usage():.2f}Mb')
# WARNING: generates almost 4GB into memory, may raise MemoryError
# numbers = sequence(100000000)
numbers = sequence(100000000)
print(f'Sizeof: {sys.getsizeof(numbers) / float(1024 ** 2):.2f}Mb')
print(f'Memory Usage (After): {memory_usage():.2f}Mb')

# large-chunks of data must be deleted from memory
del numbers


Generators will yield a single value at the same moment in a <code>for</code> loop, or unless <code>next</code> is called.

In [None]:
def generator(n: int):
    for i in range(n):
        yield i**2


print(f'Memory (Before): {memory_usage():.2f}Mb')
# WARNING: safe to execute
numbers = generator(100000000)
print(f'Sizeof: {sys.getsizeof(numbers):,} bytes')
print(f'Memory (After): {memory_usage():.2f}Mb')


### Speed

Even though generators are optimal for memory usage, for small chunks of data or less intensive operations, list comprehension tends to perform faster when it has enough memory available.

In [None]:
import cProfile

# Each value is multiplied by two
cProfile.run('sum([i * 2 for i in range(10000000)])')
cProfile.run('sum((i * 2 for i in range(10000000)))')

# Each value is squared
cProfile.run('sum([i ** 2 for i in range(10000000)])')
cProfile.run('sum((i ** 2 for i in range(10000000)))')


## Examples

### Fibonacci

It's a very expensive computation when working with large chunks of data, requires memorization to speed up the calculations and a lot of memory to store it. Generators can solve both problems because a single value is called at the same time, less computations to perform and less memory usage, but it's way slower than list comprehension. 

#### Memory Usage Test

In my case, the generator version uses 40x less memory than the list comprehension version.

In [None]:
def fib(n: int):
    a, b = 0, 1
    for _ in range(n):
        yield a
        a, b = b, a + b


print(f'Memory Usage (Before): {memory_usage():.2f}Mb')
numbers = fib(300000)
print(f'Sizeof: {sys.getsizeof(numbers):,} bytes')
print(f'Memory Usage (After): {memory_usage():.2f}Mb')


In [None]:
from functools import lru_cache

# bound limit to cache growth


@lru_cache(maxsize=5)
def fib(n):
    if n in {0, 1}:
        return n
    return fib(n - 1) + fib(n - 2)


print(f'Memory Usage (Before): {memory_usage():.2f}Mb')
# WARNING: generates almost 4GB into memory, may raise MemoryError
numbers = [fib(n) for n in range(0, 300000)]
print(f'Sizeof: {sys.getsizeof(numbers):,} bytes')
print(f'Memory Usage (After): {memory_usage():.2f}Mb')

# large-chunks of data must be deleted from memory
del numbers


#### Speed Test

In my case, the generator version is 5x slower than the list comprehension version.

In [None]:
import cProfile

cProfile.run('sum([fib(i) for i in range(100000)])')
cProfile.run('sum((fib(i) for i in range(100000)))')


### Reading large files

Massive quantities of data from files can only be read by generators, other approaches lead to storing a lot of data in memory. But it's slower than reading directly into memory, developers must consider the file size, available memory and performance.

In [None]:
import dotenv
import os

# load environment variables
dotenv.load_dotenv()
folder = os.getenv('TEMP_FOLDER')
filename = os.path.join(folder, 'huge.csv')

# creates a row generator


def read_huge_file(path):
    for row in open(path, "r+"):
        yield row


# counts the number of rows in the file
generator = read_huge_file(filename)
count: int = 0

print(f'Memory Usage (Before): {memory_usage():.2f}Mb')
for row in generator:
    num = int(row)
    count += 1
print(f'Sizeof: {sys.getsizeof(generator):,} bytes')
print(f'Row {count = }')
print(f'Memory Usage (After): {memory_usage():.2f}Mb')


Using <code>pandas.read_csv</code> it's a faster option, but data is stored into memory when loading the file.

In [None]:
import pandas as pd
import dotenv
import os

dotenv.load_dotenv()
folder = os.getenv('TEMP_FOLDER')
filename = os.path.join(folder, 'huge.csv')

print(f'Memory Usage (Before): {memory_usage():.2f}Mb')
df = pd.read_csv(filename)
count: int = len(df.index)
print(f'Sizeof: {sys.getsizeof(df):,} bytes')
print(f'Row {count = }')
print(f'Memory Usage (After): {memory_usage():.2f}Mb')

# large-chunks of data must be deleted from memory
del df
