# Working with Itertools module

We have a set of 4 *.csv* files with presorted data, where the primamry ckey is the `SSN`

## Goals

1. [Goal 1](#Goal-1)
Create a (lazy) iterator for each file that return named tuple with appropriate data type
2. [Goal 2](#Goal-2)
Create a Single iterable that combines the data from all the 4 files

## Goal 1
Create a (lazy) iterator for each file that return named tuple with appropriate data type

In [1]:
# import modules
import csv
from collections import namedtuple
from functools import partial


In [2]:
# navigate to data folder
import os

os.chdir('./Project_itertools_data')

['employment.csv', 'vehicles.csv', 'personal_info.csv', 'update_status.csv']

In [15]:
sorted(os.listdir()) # NB the order is important

['employment.csv', 'personal_info.csv', 'update_status.csv', 'vehicles.csv']

In [36]:
# define the reader function using csv
def csv_parser(file, *,  delimiter=',', quotechar='"', include_header=False):
    with open(file) as f:
        if not include_header:
            next(f) # skip header
        reader = csv.reader(f, delimiter=delimiter, quotechar=quotechar)
        yield from reader
        
def header_class(file, class_name:str):
    with open(file) as f:
        reader = csv_parser(file, include_header=True)
        return namedtuple(class_name, next(reader))
 
def row_parser(row, dtype):
    row = (func(r) for func, r in zip(dtype, row))
    return row

def iter_files(fname, ntuple, dtype):
    reader = csv_parser(fname)
    for row in reader:
        parsed_row = row_parser(row, dtype)
        yield ntuple(*parsed_row)

In [48]:
# define dtypes parser
def parse_date(value, *, default=None, fmt='%Y-%m-%dT%H:%M:%SZ'):
    from datetime import datetime
    format_ = fmt
    try:
        return datetime.strptime(value, format_)
    except ValueError:
        return default    

Now lets discover the header of each file in order to identify how many parser we need to properly convert the data to a proper format

In [49]:
employment_nt = header('employment.csv', 'Employment')
employment_nt._fields

('employer', 'department', 'employee_id', 'ssn')

In [50]:
employment_dtype = (str, str, str, str)

In [51]:
veichles_nt = header('vehicles.csv', 'Vehicle')
veichles_nt._fields

('ssn', 'vehicle_make', 'vehicle_model', 'model_year')

In [52]:
veichles_dtype = (str, str, str, int)

In [53]:
personal_info_nt = header('personal_info.csv', 'PersonalInfo')
personal_info_nt._fields

('ssn', 'first_name', 'last_name', 'gender', 'language')

In [54]:
personal_info_dtype = (str, str, str, str, str)

In [55]:
update_status_nt = header('update_status.csv', 'UpdateStatus')
update_status_nt._fields

('ssn', 'last_updated', 'created')

In [56]:
update_status_dtype = (str, parse_date, parse_date)

As stated above, the files have a common presorted field that uniquely identify the data (primary key).

Now, to process all the files together we can create a series of tuples that contain the pieces needed to parse:

* the files name
* the corrisponding namedTuples
* the dtypes corresponding to each file

N.B. it is important to have the same ordering since we are going to zip the 3 tuples together

In [57]:
fnames = sorted(os.listdir())
ntuples = (employment_nt, personal_info_nt, update_status_nt, veichles_nt)
dtypes = (employment_dtype, personal_info_dtype, update_status_dtype, veichles_dtype)

# check if the order is consistent
list(zip(fnames, ntuples, dtypes))[0]

('employment.csv', __main__.Employment, (str, str, str, str))

Now we can iterate over all the files at the same time    

In [64]:
for fname, ntuple, dtype, in zip(fnames, ntuples, dtypes):
    file_iter = iter_files(fname, ntuple, dtype)
    print(fname.center(40, '*'))
    for _ in range(1): # check
        print(next(file_iter))
    print()
    

*************employment.csv*************
Employment(employer='Stiedemann-Bailey', department='Research and Development', employee_id='29-0890771', ssn='100-53-9824')

***********personal_info.csv************
PersonalInfo(ssn='100-53-9824', first_name='Sebastiano', last_name='Tester', gender='Male', language='Icelandic')

***********update_status.csv************
UpdateStatus(ssn='100-53-9824', last_updated=datetime.datetime(2017, 10, 7, 0, 14, 42), created=datetime.datetime(2016, 1, 24, 21, 19, 30))

**************vehicles.csv**************
Vehicle(ssn='100-53-9824', vehicle_make='Oldsmobile', vehicle_model='Bravada', model_year=1993)



## Goal 2
Create a Single iterable that combines the data from all the 4 files

In [112]:
result = [iter_files(fname, ntuple, dtype)
          for fname, ntuple, dtype in zip(fnames, ntuples, dtypes)]

def combine_files(fames, ntuples, dtypes):
    result = [iter_files(fname, ntuple, dtype)
              for fname, ntuple, dtype in zip(fnames, ntuples, dtypes)]
    temp = dict()
    for r in result:
        temp = {**temp, **next(r)._asdict()}
    yield temp

In [113]:
combined = combine_files(fnames, ntuples, dtypes)

In [114]:

next(combined)
    
    

{'employer': 'Stiedemann-Bailey',
 'department': 'Research and Development',
 'employee_id': '29-0890771',
 'ssn': '100-53-9824',
 'first_name': 'Sebastiano',
 'last_name': 'Tester',
 'gender': 'Male',
 'language': 'Icelandic',
 'last_updated': datetime.datetime(2017, 10, 7, 0, 14, 42),
 'created': datetime.datetime(2016, 1, 24, 21, 19, 30),
 'vehicle_make': 'Oldsmobile',
 'vehicle_model': 'Bravada',
 'model_year': 1993}

In [102]:
join[0]

{'employer': 'Stiedemann-Bailey',
 'department': 'Research and Development',
 'employee_id': '29-0890771',
 'ssn': '100-53-9824',
 'first_name': 'Sebastiano',
 'last_name': 'Tester',
 'gender': 'Male',
 'language': 'Icelandic',
 'last_updated': datetime.datetime(2017, 10, 7, 0, 14, 42),
 'created': datetime.datetime(2016, 1, 24, 21, 19, 30),
 'vehicle_make': 'Oldsmobile',
 'vehicle_model': 'Bravada',
 'model_year': 1993}

In [81]:
c = {**a._asdict(), **b._asdict()}

In [82]:
c


{'ssn': '101-84-0356',
 'vehicle_make': 'GMC',
 'vehicle_model': 'Yukon',
 'model_year': 2005}