# Working with Itertools module

We have a set of 4 *.csv* files with presorted data, where the primamry ckey is the `SSN`

## Goals

1. [Goal 1](#Goal-1)
Create a (lazy) iterator for each file that return named tuple with appropriate data type
2. [Goal 2](#Goal-2)
Create a Single iterable that combines the data from all the 4 files
3. [Goal3](#Goal-3)
Based on the 'update_status.csv', filter out the data that have a last update date `< 3/1/2017`
4. [Goal4](#Goal-4)
Using the filtered data from Goal 3, generate a group of number of car makes divided by gender

## Goal 1
Create a (lazy) iterator for each file that return named tuple with appropriate data type

In [377]:
# import modules
import csv
from collections import namedtuple
from datetime import datetime
from functools import partial
from itertools import chain, compress, groupby, tee

In [2]:
# navigate to data folder
import os

os.chdir('./Project_itertools_data')

In [3]:
sorted(os.listdir()) # NB the order is important

['employment.csv', 'personal_info.csv', 'update_status.csv', 'vehicles.csv']

In [4]:
# define the reader function using csv
def csv_parser(file, *,  delimiter=',', quotechar='"', include_header=False):
    with open(file) as f:
        if not include_header:
            next(f) # skip header
        reader = csv.reader(f, delimiter=delimiter, quotechar=quotechar)
        yield from reader
        
def header_class(file, class_name:str):
    with open(file) as f:
        reader = csv_parser(file, include_header=True)
        return namedtuple(class_name, next(reader))
 
def row_parser(row, dtype):
    row = (func(r) for func, r in zip(dtype, row))
    return row

def iter_files(fname, ntuple, dtype):
    reader = csv_parser(fname)
    for row in reader:
        parsed_row = row_parser(row, dtype)
        yield ntuple(*parsed_row)

In [5]:
# define dtypes parser
def parse_date(value, *, default=None, fmt='%Y-%m-%dT%H:%M:%SZ'):
    format_ = fmt
    try:
        return datetime.strptime(value, format_)
    except ValueError:
        return default    

Now lets discover the header of each file in order to identify how many parser we need to properly convert the data to a proper format

In [7]:
employment_nt = header_class('employment.csv', 'Employment')
employment_nt._fields

('employer', 'department', 'employee_id', 'ssn')

In [8]:
employment_dtype = (str, str, str, str)

In [9]:
veichles_nt = header_class('vehicles.csv', 'Vehicle')
veichles_nt._fields

('ssn', 'vehicle_make', 'vehicle_model', 'model_year')

In [10]:
veichles_dtype = (str, str, str, int)

In [11]:
personal_info_nt = header_class('personal_info.csv', 'PersonalInfo')
personal_info_nt._fields

('ssn', 'first_name', 'last_name', 'gender', 'language')

In [12]:
personal_info_dtype = (str, str, str, str, str)

In [13]:
update_status_nt = header_class('update_status.csv', 'UpdateStatus')
update_status_nt._fields

('ssn', 'last_updated', 'created')

In [14]:
update_status_dtype = (str, parse_date, parse_date)

As stated above, the files have a common presorted field that uniquely identify the data (primary key).

Now, to process all the files together we can create a series of tuples that contain the pieces needed to parse:

* the files name
* the corrisponding namedTuples
* the dtypes corresponding to each file

N.B. it is important to have the same ordering since we are going to zip the 3 tuples together

In [15]:
fnames = sorted(os.listdir())
ntuples = (employment_nt, personal_info_nt, update_status_nt, veichles_nt)
dtypes = (employment_dtype, personal_info_dtype, update_status_dtype, veichles_dtype)

# check if the order is consistent
list(zip(fnames, ntuples, dtypes))[0]

('employment.csv', __main__.Employment, (str, str, str, str))

Now we can iterate over all the files at the same time    

In [16]:
for fname, ntuple, dtype, in zip(fnames, ntuples, dtypes):
    file_iter = iter_files(fname, ntuple, dtype)
    print(fname.center(40, '*'))
    for _ in range(1): # check
        print(next(file_iter))
    print()
    

*************employment.csv*************
Employment(employer='Stiedemann-Bailey', department='Research and Development', employee_id='29-0890771', ssn='100-53-9824')

***********personal_info.csv************
PersonalInfo(ssn='100-53-9824', first_name='Sebastiano', last_name='Tester', gender='Male', language='Icelandic')

***********update_status.csv************
UpdateStatus(ssn='100-53-9824', last_updated=datetime.datetime(2017, 10, 7, 0, 14, 42), created=datetime.datetime(2016, 1, 24, 21, 19, 30))

**************vehicles.csv**************
Vehicle(ssn='100-53-9824', vehicle_make='Oldsmobile', vehicle_model='Bravada', model_year=1993)



## Goal 2
Create a Single iterable that combines the data from all the 4 files

#### N.B. The following is my approach, then we will revise Freed approach (ofc the best one)

In [174]:
result = [iter_files(fname, ntuple, dtype)
          for fname, ntuple, dtype in zip(fnames, ntuples, dtypes)]
def combine_files(result):
    #print(result)
    r1, r2, r3, r4 = result
    for r in zip(r1, r2, r3, r4):
        temp = {**r[0]._asdict(), **r[1]._asdict(), **r[2]._asdict(), **r[3]._asdict()}
        keys = temp.keys()
        values = temp.values()
        combined = namedtuple('Data', keys)
        yield combined(*values)

In [175]:
combined = combine_files(result)

print(next(combined), end='\n\n')

for _ in range(10):
    print(next(combined).employer)

Data(employer='Stiedemann-Bailey', department='Research and Development', employee_id='29-0890771', ssn='100-53-9824', first_name='Sebastiano', last_name='Tester', gender='Male', language='Icelandic', last_updated=datetime.datetime(2017, 10, 7, 0, 14, 42), created=datetime.datetime(2016, 1, 24, 21, 19, 30), vehicle_make='Oldsmobile', vehicle_model='Bravada', model_year=1993)

Nicolas and Sons
Connelly Group
Upton LLC
Zemlak-Olson
Kohler, Bradtke and Davis
Roberts, Torphy and Dach
Lind-Jast
Bashirian-Lueilwitz
Windler, Marks and Haley
Leffler-Hahn


#### Fred solution:

Fisrt, we create an inclusion/esclusion tuple for each column in the files to be able to tune specifically wich field we want to retain (in this case everithing but one ssn)

In [176]:
employment_fields = (True, True, True, False)
personal_info_fields = (True, True, True, True, True)
veichles_fields = (False, True, True, True)
update_status_fields = (False, True, True)

fields = (employment_fields, personal_info_fields, update_status_fields, veichles_fields)

In [259]:
def iter_files(fname, ntuple, dtype):
    reader = csv_parser(fname)
    for row in reader:
        parsed_row = row_parser(row, dtype)
        yield ntuple(*parsed_row)


def create_combo_named_tuple(fnames, fields):
    fields = chain.from_iterable(fields)
    field_names = chain.from_iterable(header_class(fname, fname.strip('.csv'))._fields for fname in fnames)
    compressed_field_names = compress(field_names, fields)
    return namedtuple('Data', compressed_field_names)
    
        
def iter_combined(fnames, ntuple, dtype, fields):
    # create the jointed namedtuple with only the specified fields
    combo_nt = create_combo_named_tuple(fnames, fields)
    # zipped_tuples is an iterator of tuples that contain 4 iterator generated by iter_files
    zipped_tuples = zip(*(iter_files(fname, ntuple, dtype)
                          for fname, ntuple, dtype in zip(fnames, ntuples, dtypes)))
    # now we merge the 4 iterator inside the zip
    # merge_iter is another iterator that contains only the values of one row of the 4 files chained together
    merged_iter = (chain.from_iterable(zipped_tuple) for zipped_tuple in zipped_tuples)
    # now we want to compress the data based on the valuse in 'fields' toretain only what we want
    # first we need to chain the fields together in order to have the same dimension of 'merged_iter'
    fields = tuple(chain.from_iterable(fields))
    for row in merged_iter:
        compressed_row = compress(row, fields) # carefull! fields will be exahusted after the first loop! unless we make 'fields' an iterable (es. creating a tuple from it)
        yield combo_nt(*compressed_row)

In [260]:
combined = iter_combined(fnames, ntuple, dtype, fields)

print(next(combined), end='\n\n')

for _ in range(10):
    print(next(combined).employer)

Data(employer='Stiedemann-Bailey', department='Research and Development', employee_id='29-0890771', ssn='100-53-9824', first_name='Sebastiano', last_name='Tester', gender='Male', language='Icelandic', last_updated=datetime.datetime(2017, 10, 7, 0, 14, 42), created=datetime.datetime(2016, 1, 24, 21, 19, 30), vehicle_make='Oldsmobile', vehicle_model='Bravada', model_year=1993)

Nicolas and Sons
Connelly Group
Upton LLC
Zemlak-Olson
Kohler, Bradtke and Davis
Roberts, Torphy and Dach
Lind-Jast
Bashirian-Lueilwitz
Windler, Marks and Haley
Leffler-Hahn


## Goal 3
Based on the 'update_status.csv', filter out the data that have a last update date `< 3/1/2017`

As always, first my idea and then fred's. His solution is more general

In [261]:
def filter_on_update(combo_nt, date):
    date = map(int,date.split('/'))
    if combo_nt.last_updated > datetime(*list(date)[::-1]): # datetime wants (year, month, day)
        yield combo_nt
        
        
 
def iter_combined_filter_date(fnames, ntuple, dtype, fields):
    # create the jointed namedtuple with only the specified fields
    combo_nt = create_combo_named_tuple(fnames, fields)
    # zipped_tuples is an iterator of tuples that contain 4 iterator generated by iter_files
    zipped_tuples = zip(*(iter_files(fname, ntuple, dtype)
                          for fname, ntuple, dtype in zip(fnames, ntuples, dtypes)))
    # now we merge the 4 iterator inside the zip
    # merge_iter is another iterator that contains only the values of one row of the 4 files chained together
    merged_iter = (chain.from_iterable(zipped_tuple) for zipped_tuple in zipped_tuples)
    # now we want to compress the data based on the valuse in 'fields' toretain only what we want
    # first we need to chain the fields together in order to have the same dimension of 'merged_iter'
    fields = tuple(chain.from_iterable(fields))
    for row in merged_iter:
        compressed_row = compress(row, fields) # carefull! fields will be exahusted after the first loop! unless we make 'fields' an iterable (es. creating a tuple from it)
        #filter_on_update(combo_nt(*compressed_row), '3/1/2017')
        yield from filter_on_update(combo_nt(*compressed_row), '25/3/2018')

In [262]:
filter_combined = iter_combined_filter_date(fnames, ntuple, dtype, fields)

for row in filter_combined:
    print(row.last_updated)

2018-03-26 06:15:57
2018-03-27 21:18:39
2018-03-27 09:56:49
2018-03-26 22:14:13
2018-03-26 10:53:26
2018-03-28 19:40:00
2018-03-27 00:00:32
2018-03-25 08:52:19
2018-03-30 05:53:21
2018-03-30 21:07:08
2018-03-27 01:24:58


#### Fred Solution:
using `filter` function; the big advantage (apart from cleanes) is the fact that we can specify any filter on any field with a `lambda` function as `key`

In [265]:
def iter_combined_filterd(fnames, ntuple, dtype, fields, *, key=None):
    combined = iter_combined(fnames, ntuple, dtype, fields)
    yield from filter(key, combined)

In [267]:
filter_combined = iter_combined_filterd(fnames, ntuple, dtype, fields, key=lambda row: row.last_updated >= datetime(2018, 3, 25))
for row in filter_combined:
    print(row.last_updated)

2018-03-26 06:15:57
2018-03-27 21:18:39
2018-03-27 09:56:49
2018-03-26 22:14:13
2018-03-26 10:53:26
2018-03-28 19:40:00
2018-03-27 00:00:32
2018-03-25 08:52:19
2018-03-30 05:53:21
2018-03-30 21:07:08
2018-03-27 01:24:58


## Goal 4
Using the filtered data from Goal 3, generate a group of number of car makes divided by gender
As always, first my idea and then fred's. His solution is more general

In [317]:
def iter_combined_filterd_grouped(fnames, ntuple, dtype, fields, *, _filter=None, _group=None):
    combined = iter_combined(fnames, ntuple, dtype, fields)
    filterd = filter(_filter, combined)
    _sorted = sorted(filterd, key=lambda row: row.gender)
    groupped = groupby(_sorted, key=lambda row: row.gender)
    for group in groupped:
        print(group[0].center(40, '*'), end='\n\n') # male or female
        group_sorted = sorted(group[1], key=lambda row: row.vehicle_make)
        group_sorted = groupby(group_sorted, key=lambda row: row.vehicle_make)
        group_sorted = ((row[0], len(list(row[1]))) for row in group_sorted)
        yield from sorted(test, key=lambda x: x[1], reverse=True)
        print()
            

In [323]:
filter_combined = iter_combined_filterd_grouped(fnames, ntuple, dtype, fields,
                                                _filter=lambda row: row.last_updated >= datetime(2017, 3, 1),
                                                _group=lambda row: row.gender)
for _ in range(2):
    print(next(filter_combined))

*****************Female*****************

('Chevrolet', 42)
('Ford', 42)


#### Fred Solution: 
First a group_key function is created

In [348]:
def group_key(item):
    return (item.gender, item.vehicle_make)

Now we can create our filtered dataset, sorting and then grouping it by the 'group_key'

In [368]:
filter_combined = iter_combined_filterd(fnames, ntuple, dtype, fields, key=lambda row: row.last_updated >= datetime(2017, 3, 1))
sorted_combined = sorted(filter_combined, key=group_key)
grouped = groupby(sorted_combined, key=group_key)

Now, we want to be able to select one gender at the time; we can create a generator with an if condition

In [369]:
group_by_gender = (item for item in grouped if item[0][0] == 'Female')
for _ in range(2):
    print(next(group_by_gender))
    
# N.B. now 'grouped' has been exahusted, if we want to use it again we need to recreate it or teeing it

(('Female', 'Acura'), <itertools._grouper object at 0x00000177E72AD488>)
(('Female', 'Aston Martin'), <itertools._grouper object at 0x00000177E6E2F248>)


Now we want to clean the key, since we dont need to display 'Female', and to compute the length of the itertools object; Then we need to sort by the length

In [375]:
grouped = groupby(sorted_combined, key=group_key)
cleaned = ((item[0][1], len(list(item[1]))) for item in grouped)
sort = sorted(cleaned, key=lambda x: x[1], reverse=True) 

for i in range(3):
    print(*sort[i])

Chevrolet 42
Ford 42
Ford 40


#### Improved Approach
Now, what Fred points out is that this approach is not efficient since we are sorting twice the whole dataset (sorting cost is not linear), while we could have sorted first by gender, split the dataset and only then sort by vehicle make. So lets see this approach.

So first we are going to filter the dataset by gender, creating two distinct generator

In [396]:
def group_key(item):
    return item.vehicle_make

data = iter_combined_filterd(fnames, ntuple, dtype, fields, key=lambda row: row.last_updated >= datetime(2017, 3, 1))
data1, data2 = tee(data, 2)

data_m = (item for item in data1 if item.gender == 'Male')
sorted_data_m = sorted(data_m, key=group_key)
grouped = groupby(sorted_data_m, key=group_key)
cleaned = ((data[0], len(list(data[1]))) for data in grouped)
table_m = sorted(cleaned, key=lambda x : x[1], reverse=True)

data_f = (item for item in data2 if item.gender == 'Female')
sorted_data_f = sorted(data_f, key=group_key)
grouped = groupby(sorted_data_f, key=group_key)
cleaned = ((data[0], len(list(data[1]))) for data in grouped)
table_f = sorted(cleaned, key=lambda x : x[1], reverse=True)

In [399]:
for i in range(2):
    print(*table_m[i], *table_f[i])

Ford 40 Chevrolet 42
Chevrolet 30 Ford 42


Now, we only need to cluster what we have done in a function to clean up the process:

In [400]:
filter_key = lambda row: row.last_updated >= datetime(2017, 3, 1)
group_key = lambda row: row.vehicle_make

def group_data(fnames, ntuple, dtype, fields, *, filter_key, group_key, gender):
    data = iter_combined_filterd(fnames, ntuple, dtype, fields, key=filter_key)
    filtered = (item for item in data2 if item.gender == gender)
    sorted_data = sorted(filtered, key=group_key) # it is the only step that is not lazy
    grouped = groupby(sorted_data_f, key=group_key)
    cleaned = ((data[0], len(list(data[1]))) for data in grouped)
    table = sorted(cleaned, key=lambda x : x[1], reverse=True)
    return table

In [407]:
result_female = group_data(fnames, ntuple, dtype, fields, filter_key=filter_key, group_key=group_key, gender='Female')
print(*result_female[:2])

('Chevrolet', 42) ('Ford', 42)


The last improvement we can do is to reduce the filtering steps from one to two, specifying the gender directly in the filter_key

In [409]:
filter_key = lambda row: row.last_updated >= datetime(2017, 3, 1) and row.gender == 'Female'
group_key = lambda row: row.vehicle_make

def group_data(fnames, ntuple, dtype, fields, *, filter_key, group_key):
    data = iter_combined_filterd(fnames, ntuple, dtype, fields, key=filter_key)
    sorted_data = sorted(data, key=group_key) # it is the only step that is not lazy
    grouped = groupby(sorted_data_f, key=group_key)
    cleaned = ((data[0], len(list(data[1]))) for data in grouped)
    table = sorted(cleaned, key=lambda x : x[1], reverse=True)
    return table

result_female = group_data(fnames, ntuple, dtype, fields, filter_key=filter_key, group_key=group_key)
print(*result_female[:2])

('Chevrolet', 42) ('Ford', 42)
