# DataFrame from generators

This is a simple example of how to use Python generator functions to create [pandas][pandas] DataFrames.

[pandas]: https://pandas.pydata.org/

## Preliminaries

In [1]:
# Import pandas
import pandas as pd

In [2]:
# Some dummy data
some_birds = ['Owl', 'Crow', 'Dove', 'Bat']
some_fish = ['Shark', 'Cod', 'Sushi', 'Beaver']

If you are not familiar with `zip()`, check out [the documentation][zip]. Here is a quick intro:

[zip]: https://docs.python.org/3/library/functions.html#zip

In [3]:
# Can be combined with zip, e.g.:
for bird, fish in zip(some_birds, some_fish):
    print(f'Bird: {bird} \t Fish: {fish}')

Bird: Owl 	 Fish: Shark
Bird: Crow 	 Fish: Cod
Bird: Dove 	 Fish: Sushi
Bird: Bat 	 Fish: Beaver


Below is an example of a [generator function][gf]:

[gf]: https://wiki.python.org/moin/Generators

In [4]:
# generator function that yields pairs of birds and fishes
def my_generator():
    for bird, fish in zip(some_birds, some_fish):
        yield bird, fish

In [5]:
# generator instances will be "used up"
my_gen = my_generator()

# Iterate over the generator with next()
print(next(my_gen))
print(next(my_gen))
print(next(my_gen))
print(next(my_gen))

('Owl', 'Shark')
('Crow', 'Cod')
('Dove', 'Sushi')
('Bat', 'Beaver')


Now, there is only no "pair" left, so the next step will raise a `StopIteration`:

In [6]:
try:
    next(my_gen)
except StopIteration:
    print('Generator has reach its end.')

Generator has reach its end.


As long as the iterator is not infinite, it can be turned into a list (like any iterable):

In [7]:
list(my_generator())

[('Owl', 'Shark'), ('Crow', 'Cod'), ('Dove', 'Sushi'), ('Bat', 'Beaver')]

In [8]:
# for-loops steps through iterators..
for pair in my_generator():
    print(pair)

('Owl', 'Shark')
('Crow', 'Cod')
('Dove', 'Sushi')
('Bat', 'Beaver')


## Use with pandas
Simply passing the generator function as `data`-parameter to [`pd.DataFrame`][df] will create a DataFrame with each yield's data as a row.
Each `yield` will produce a row, with the first element in the first column, second in second column, etc.

[df]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
[t]: https://www.tutorialspoint.com/python/python_tuples.htm

In [9]:
# Create DataFrame
columns = ['Flyers', 'Swimmers']
pd.DataFrame(my_generator(), columns=columns)

Unnamed: 0,Flyers,Swimmers
0,Owl,Shark
1,Crow,Cod
2,Dove,Sushi
3,Bat,Beaver


### Complete example
Here is a complete, isolated example:

In [10]:
import numpy as np
import pandas as pd

In [11]:
def get_data_from_imagined_file(year, month):
    t_probs = [0.4, 0.4, 0.1, 0.1] if 4 < month < 10 else [0.01, 0.39, 0.5, 0.1]
    w_probs = [0.3, 0.1, 0.4, 0.1, 0.05, 0.05] if year >= 2000 else [0.1, 0.1, 0.2, 0.2, 0.35, 0.05]
    temp = np.random.choice(['Too warm', 'Nice', 'Too cold', 'Unknown'], p=t_probs)
    weather = np.random.choice(['Clody', 'Rainy', 'Sunny', 'Hail', 'Froggy', 'Unknown'], p=w_probs)
    return temp, weather

def create_weather_data_from_some_organized_folder_structure():
    for year in range(1999, 2001):
        for month in range(1, 13):
            temperature, weather = get_data_from_imagined_file(year, month)
            yield year, month, temperature, weather
            
columns = ['Year', 'Month', 'Temp', 'Sky']
data = create_weather_data_from_some_organized_folder_structure()
pd.DataFrame(data, columns=columns)

Unnamed: 0,Year,Month,Temp,Sky
0,1999,1,Too warm,Froggy
1,1999,2,Too cold,Froggy
2,1999,3,Unknown,Hail
3,1999,4,Too cold,Rainy
4,1999,5,Too warm,Sunny
5,1999,6,Too warm,Sunny
6,1999,7,Nice,Sunny
7,1999,8,Too warm,Sunny
8,1999,9,Too warm,Hail
9,1999,10,Nice,Froggy
