# An Introduction to Generators in Python
#### Charles M Rice
-----

Our data sources seem to keep getting bigger, leaving what humans can manage on our own in the dust. And while our computers have, thankfully, more than kept up, there are still datasets that bring the hardiest machine to its digital knees.

Fortunately, Python has a built-in tool that lets us handle large sequences in batches: generator functions. 

Today, we're going to learn about these handy generators, see how they differ from 'normal' Python functions, and then build one to handle a real-world data-processing task.

First things first: what is a generator?

Officially:
>***Generator***: 
>A function which returns a generator iterator. It looks like a normal function except that it contains yield expressions for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. [Python.org Glossary](https://docs.python.org/3/glossary.html#term-generator)

The official definition is excellent, if you already know what a generator is. Let's try to translate into English.

## Generators Iterate

A generator is a function. So far, so good. We know what functions are: functions take an input, perform some operation, and return an output, whether it's a number, a string, or some other data structure. Then they end.

For example, say you wanted a function that returned the first $n$ values of a Fibonacci sequence:

The function below does exactly what it's supposed to do: calculate the numbers in the Fibonacci sequence recursively, store the whole thing in ```sequence```, and then display ```sequence``` all at once. It runs quickly and effectively, but once it's run, that's it. Unless you set ```fib(n)``` to a variable, ```sequence``` is gone. And if you want the next ten numbers in the sequence, you have to run it again, with a new value of $n$.

In [23]:
def fibonacci(n):
    if n == 0:
        return [0]
    elif n == 1:
        return [0, 1]
    else:
        sequence = fib(n-1)
        sequence.append(lst[-1] + lst[-2])
        return sequence

Recursion has its uses, certainly, but it also ties up more and more system resources as $n$ gets larger. And what if you didn't want all the integers at once, in a list, but several at a time? Therein lies the magic of generators!

Let's rewrite our Fibonacci function now using a generator instead of recursion:

In [34]:
def fibonacci_generator():
    a,b = 0,1 #This replaces the if/elif steps in the former function
    while True: # The while statement here keeps the function 'live'
        yield a #replaces the return statement
        a,b = b, a+b

fiboGen = fibonacci_generator()

In [33]:
next(fiboGen)

1

Note that where we had a function with ```fibonacci```, we now have a generator with ```fibonacci_generator``` and ```fiboGen``` although the two functions do (more or less) the same thing. There are two big syntactical differences between the original function and our shiny new generator:
- ```yield``` replaces ```return``` in a generator, and tells ```fibr``` to produce the output for one iteration, then wait for further instruction
- ```next()``` is both the executor of the function and the further instruction. The function will use the most recent output of ```fibr``` to produce the next output, and will then wait until called again.

For example:

In [None]:
next(fiboGen)

In [None]:
# And again...
next(fiboGen)

In [None]:
# And again...
next(fiboGen)

In [None]:
# And again...
next(fiboGen)

In fact, once it's executed, ```fiboGen``` will keep returning the next value in the sequence until the sun implodes or function is scrapped altogether, whichever comes first. So how can we use these nifty built-in generators to help us with data science? (Incidentally, very few languages have such an elegant method to handle this kind of work.)

## So how often do you need the Fibonacci sequence?

As every data scientist knows, sometimes you're working with a dataset that's too big to handle all at once, but too small to merit using Big Data tools. We're going to take a look at the [liquor sales in the state of Iowa](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) for our example here. It's a large dataset of about 13 million rows by 24 columns; it would take a long time to load under the best of circumstances. Even this segment of it, which is about 10% of the whole, is a hefty 300+ MB.

In [1]:
# Import the libraries we'll be using
import numpy as np
import pandas as pd

liquor = "Iowa_Liquor_Sales_reduced.csv"

In [2]:
def get_chunk(filename):
    with open(filename, "rb") as table:
        for line in table.readlines():
            yield line
            
chunk = get_chunk(liquor)

In [None]:
next(chunk)

In [None]:
next(chunk)

And we can even read the iterations into a pandas dataframe

In [None]:
def pandas_chunk(filename):
    for chunk in pd.read_csv(filename, chunksize=5):
        yield chunk

pc = pandas_chunk(liquor)

In [None]:
next(pc)

Et voila! Our data is now being parsed and loaded into a tidy pandas dataframe piece by piece. Since we are not loading the entire dataset at one time, we can see much more quickly how the data are structured, where we might need to do serious cleaning, or handling of null values, without tying up resources storing a 300 MB dataset in memory.

## Quick Aside: Expressive Generators

As with other iterative elements of Python, generators can also be compressed into single-line statements like list comprehensions. They are called **drumroll** generator expressions!

Here's an example of a list comprehension right on top of a generator expression:

In [None]:
lst = [i*10 for i in range(10)]

In [None]:
gen = (i*10 for i in range(10))

They look basically the same, don't they? Apart from the parentheses/brackets, they almost are. The difference is that when you call ```lst``` it will return the same ten integers every time you call it. When you call ```next(gen)``` it will hit a hard stop once it's printed that tenth integer, and be inert thereafter.

## To Conclude

Generators are a remarkable tool, largely unique to Python, and we data scientists can make great use of them. Their functionality extends well beyond parsing a single large file, and can be used to assess or analyze multiple large files. For example, if we wanted to find a specific word pattern across a thousand-plus text documents without tying up too many resources, we could use a generator to automate the search. It's even possible to build full pipelines using only generator functions!

So, Dataquest community, what do you think? Will generators become a regular feature of your coding? Where else do you think they might be useful in data science?