# Generators

In [38]:
# Before starting, we run the "Generate data" notebook to make sure we have everything we need.
%run "./01. Generate data.ipynb"

## `for` loops in Python are very flexible

Classic loop

In [4]:
for i in range(4):
    print(i)

0
1
2
3


Loop over any list

In [5]:
for word in ['Cheese', 'Sausage', 'Bread']:
    print(word)

Cheese
Sausage
Bread


Loop over a dictionary

In [6]:
italian_to_english = {
    'ciao': 'hi',
    'sottaceti': 'pickles',
    'pizza': 'pizza',
}

for italian, english in italian_to_english.items():
    print(italian, '->', english)

ciao -> hi
sottaceti -> pickles
pizza -> pizza


Loop over a text file

In [7]:
with open('consumption_201710.csv') as f:
    for line in f:
        print(line, end='')

USER_ID,TV_201710_M,TV_201710_A,TV_201710_N,VOD_201710_M,VOD_201710_A,VOD_201710_N
0,1690,2515,4285,892,953,2805
1,1243,1952,3105,1240,1259,3711
2,1203,1797,2910,200,162,538
3,312,461,787,1256,1261,3663
4,242,379,609,215,226,729
5,279,394,677,97,106,307
6,105,123,222,911,916,2703
7,468,677,1015,233,272,695
8,1702,2607,4301,238,212,660
9,547,774,1409,183,228,549
10,926,1362,2302,668,690,1972
11,316,433,721,1215,1208,3517
12,1483,2049,3428,638,680,1862
13,306,487,887,1397,1487,4256
14,462,679,1049,138,148,464
15,823,1216,2011,740,689,2201
16,1914,2761,4415,375,377,1228
17,217,373,610,1333,1250,3867
18,798,1116,1866,493,498,1589
19,2012,3080,4808,1024,1004,2922
20,1494,2299,3917,895,985,2620
21,1026,1562,2602,1375,1350,4055
22,641,931,1518,416,454,1320
23,767,1190,1853,960,947,2863
24,808,1219,1993,605,599,1826
25,782,1124,1874,232,269,759
26,2040,2971,5039,1341,1308,3921
27,845,1324,2151,1101,1133,3226
28,1668,2586,4146,479,478,1533
29,633,883,1514,1026,1071,3243
30,1272,1999,3119,415,43

Libraries often define their own "iterators"

In [8]:
import pandas as pd

world_pop = pd.DataFrame(
    columns=['Country', '2000', '2015', '2030'], 
    data=[['China', 1270, 1376, 1416],
          ['India', 1053, 1311, 1528],
          ['United States', 283, 322, 356],
          ['Indonesia', 212, 258, 295]],
)
world_pop

Unnamed: 0,Country,2000,2015,2030
0,China,1270,1376,1416
1,India,1053,1311,1528
2,United States,283,322,356
3,Indonesia,212,258,295


In [None]:
for column_name in world_pop:
    print(column_name)

In [None]:
for idx, row in world_pop.iterrows():
    print('{0:-^7}'.format(idx))
    print(row)

**IMPORTANT CONCEPT: these iterators do not create a list in memory over which `for` iterates!**

## Defining your own for-loop thingy: generators

"Generators" are like functions, but for loops.

### First contact

iterate over the first `n` odd numbers

In [None]:
def odd_numbers(n):
    """ Generator for the first `n` odd numbers. """
    for i in range(n):
        # Use `yield` instead of `return`: execution will start again from here
        yield i * 2 + 1

for i in odd_numbers(5):
    print(i)

Second example: first `n` numbers not divisible by x

In [1]:
def not_divisibles(n, divisor):
    current = 0
    while n > 0:
        if (current % divisor != 0):
            yield current
            n -= 1
        current += 1

In [2]:
for x in not_divisibles(7, 3):
    print(x)

1
2
4
5
7
8
10


Generated content does not need to be deterministic, or finite! It could even be generated on the fly.

In [None]:
import numpy as np

def generate_n_random_numbers(n):
    for i in range(n):
        yield np.random.uniform()

for x in generate_n_random_numbers(5):
    print(x)

### Hands-on: Your first generator

Write a generator that generates even numbers between 0 and `n`.

Expected:
```
for i in even(7):
    print(i)
```

outputs

```
0
2
4
6
```

In [10]:
# Your code here
def generate_even_numbers(n):
    for i in range(n):
        if i%2==0:
            yield i 
            
for j in generate_even_numbers(8):
    print(j)

0
2
4
6


### Hands-on: Recognize the smell of generators

Get rid of the smell in the code below by defining a generator.

In [11]:
for i in range(9):
    if i % 3 == 0:
        continue
    print('Square is', i ** 2)

for j in range(5):
    if j % 2 == 0:
        continue
    print('Cube is', j ** 3)

for k in range(13):
    if k % 5 == 0:
        continue
    print('A' * k)

Square is 1
Square is 4
Square is 16
Square is 25
Square is 49
Square is 64
Cube is 1
Cube is 27
A
AA
AAA
AAAA
AAAAAA
AAAAAAA
AAAAAAAA
AAAAAAAAA
AAAAAAAAAAA
AAAAAAAAAAAA


In [17]:
# Your code here
def skipper(n, divisor):
    for i in range(n):
        if i%divisor!=0:
            yield i
            
for j in skipper(9,3):
    print('Square is', j ** 2)

for j in skipper(5,2):
    print('Cube is', j ** 3)

for j in skipper(13,5):
    print('A'* j)

Square is 1
Square is 4
Square is 16
Square is 25
Square is 49
Square is 64
Cube is 1
Cube is 27
A
AA
AAA
AAAA
AAAAAA
AAAAAAA
AAAAAAAA
AAAAAAAAA
AAAAAAAAAAA
AAAAAAAAAAAA


### Hands-on: All pairs (skip)

We have a list of subjects whose individual performance needs to be compared in pairs.

Write a generator called `all_pairs` that returns all pairs of items from a list

E.g. `all_pairs(['A', 'B', 'C'])` will yield three sets `{'A', 'B'}`, `{'A', 'C'}`, `{'B', 'C'}` (not necessarily in this order)

Suggestion: starting writing a solution for this task without generators, then transform the for loops in a generator.

In [22]:
# Your code here
l=['A','B','C']
output_set=[]
for i in range(len(l)-1):
    l1=l
    el=l1.pop(i)
    for j in l1:
        el2=j
        output_set.append((el,el2))
print(output_set)

[('A', 'B'), ('A', 'C'), ('C', 'B')]


### Hands-on: A common generators pattern

Write a generator called `without_punctuation` that iterates over a list of strings and removes puncuation characters at the end of the string. If the string is empty after the removal, the string is skipped.

For instance, `without_punctuation(['Apple', 'Banana...', 'Carrot!!', '*$', '!Dinosaur'])` would yield `Apple`, `Banana`, `Carrot`, and `!Dinosaur`.

(see the `.rstrip` method of strings, and the constant `punctuation` in the module `string`)

In [37]:
# Your code here
from string import punctuation
li=['Apple', 'Banana...', 'Carrot!!', '*$', '!Dinosaur']
def without_punctuation(li):
    for word in li:
        clean=word.rstrip(punctuation)
        if clean!=' ':
            yield clean
for 
without_punctuation(li)

<generator object without_punctuation at 0x7faa8869d200>

In [33]:

from string import punctuation
'!asd!!.'.rstrip('!.')


'!asd'

The pattern in the exercise above is the one that I find most common in my code: refactor common filtering, cleaning up, and transformations in `for` loops.

It comes up all the time when processing messy data.

## Generators can be chained

In [None]:
def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename):
            yield line


def filter_pattern(lines, pattern):
    """ Generator that yields all lines that contain a certain string. """
    for line in lines:
        if pattern in line:
            yield line


def pprint_with_line_numbers(lines):
    """ Format each line in a pretty string. """
    for idx, line in enumerate(lines):
        yield '{} - "{}"'.format(idx, line.strip())


filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']

for line in pprint_with_line_numbers(filter_pattern(readfiles(filenames), pattern='REM')):
    print(line)


## Real-life example: ETL workflow for PayTV data

Switch to the other notebook

### Hands-on: Get rid of the smell!

The code below parses 3 CSV containing comment lines that start with the prefix `'# '`, `'-- '`, or `'REM '`.

**Get rid of the smell!**

In [39]:
# Script that computes the sum of all the columns in 3 CSV files that contain commented lines

comment_prefixes = ['# ', '-- ', 'REM ']

filename1 = 'first_commented_data.csv'
print('Load data from', filename1)
with open(filename1, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data1 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])


filename2 = 'second_commented_data.csv'
print('Load data from', filename2)
with open(filename2, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data2 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])


filename3 = 'third_commented_data.csv'
print('Load data from', filename3)
with open(filename3, 'rt') as f:
    valid_lines = []
    for line in f:
        for prefix in comment_prefixes:
            if line.startswith(prefix):
                break
        else:
            data = [int(x) for x in line.split(',')]
            valid_lines.append(data)

data3 = pd.DataFrame(valid_lines, columns=['unci', 'dunci', 'trinci', 'quari'])

print(data1.sum() + data2.sum() + data3.sum())

Load data from first_commented_data.csv
Load data from second_commented_data.csv
Load data from third_commented_data.csv
unci      254425
dunci     244622
trinci    233027
quari     245013
dtype: int64


In [43]:
# Your code here
comment_prefixes = ['# ', '-- ', 'REM ']
filenames = ['first_commented_data.csv', 'second_commented_data.csv', 'third_commented_data.csv']

    

def readfiles(filenames):
    """ Generator that yields all lines from multiple files. """
    for filename in filenames:
        for line in open(filename):
            yield line


def filter_pattern(lines, pattern):
    """ Generator that yields all lines that contain a certain string. """
    for line in lines:
        if pattern not in line:
            yield line
            

for line in filter_pattern(readfiles(filenames), pattern='# '):
    print(line)


504,367,41,62

REM Skip me

62,646,15,806

750,441,530,107

820,846,461,531

711,160,700,964

913,900,624,193

385,893,502,120

353,696,450,763

77,440,859,81

868,672,379,354

964,711,639,146

414,279,234,912

693,511,6,106

534,80,88,362

41,505,153,77

588,532,189,458

798,500,861,882

93,334,949,219

512,438,503,68

465,498,950,579

156,869,552,961

101,304,344,143

587,647,200,680

759,652,687,582

843,408,917,280

REM Do not bother

478,753,202,304

119,460,629,250

863,552,424,457

443,847,190,820

97,886,261,915

313,833,676,766

929,663,265,525

201,414,913,329

460,863,117,523

547,103,304,150

90,282,3,529

295,290,998,936

76,449,200,479

51,95,947,729

620,310,275,810

745,46,353,46

453,567,143,783

1,435,337,263

781,235,462,409

140,692,174,861

893,412,504,882

472,666,302,549

422,784,69,919

956,266,847,393

REM Skip me

698,178,55,881

883,408,751,26

-- Skip me

REM Ignore this line

767,156,119,214

762,887,310,111

335,439,495,277

125,699,29,452

389,365,564,599

## itertools (time permitting)

A tour of the content of `itertools`.

https://docs.python.org/3.6/library/itertools.html


A typical case that shows up in my code: going through combinations of experimental conditions.

In [None]:
from itertools import product

concentrations = [1, 10, 100]
times = [60, 120, 180]
applications = [1, 2, 3]

for idx, (concentration, time, application) in enumerate(product(concentrations, times, applications)):
    print('Run experiment #{}'.format(idx))
    print('Concentration', concentration)
    print('Time', time)
    print('Applications', application)
    print()

Another common case is when one needs to compute statistics on all pairs of variables

In [None]:
df = pd.DataFrame(
    data = [[1, 0.1, 32],
            [4, 0.3, 11],
            [8, 0.9, 1],
            [12, 0.12, -4]],
    columns=['unci', 'dunci', 'trinci']
)
df

In [None]:
# Without itertools

n_cols = df.shape[1]
for idx1 in range(n_cols):
    for idx2 in range(idx1 + 1, n_cols):
        corr = (df.iloc[:, idx1] * df.iloc[:, idx2]).sum()
        print(df.columns[idx1], df.columns[idx2], corr)

In [None]:
# With itertools
from itertools import combinations

for col1, col2 in combinations(df.columns, 2):
    corr = (df.loc[:, col1] * df.loc[:, col2]).sum()
    print(col1, col2, corr)

### Hands-on

Write a generator that deals cards at random from a deck of card.