# Data Engineering in Python with databolt  - Quickly Load Any Type of CSV or Excel Data (d6tlib/d6tstack)

Vendors often send large datasets in multiple files. Often there are missing and misaligned columns between files that have to be manually cleaned. With DataBolt File Stack you can easily stack them together into one consistent dataset.

Features include:
* Quickly check column consistency across multiple files
* Fix added/missing columns
* Fix renamed columns
* Check Excel tabs for consistency across files
* Quickly extract data from messy Excel files into clean CSV data
* Out of core functionality to process large files
* Export to pandas, CSV, SQL, parquet

In this workbook we will demonstrate the usage of the d6tstack library.

In [1]:
import importlib
import pandas as pd
import glob

import d6tstack.combine_csv as d6tc

## Get sample data

We've created some dummy sample data which you can download. 

In [54]:
import urllib.request
cfg_fname_sample = 'test-data.zip'
urllib.request.urlretrieve("https://github.com/d6t/d6tstack/raw/master/"+cfg_fname_sample, cfg_fname_sample)
import zipfile
zip_ref = zipfile.ZipFile(cfg_fname_sample, 'r')
zip_ref.extractall('.')
zip_ref.close()

## Use Case: Checking Column Consistency

Let's say you receive a bunch of csv files you want to ingest them, say for example into pandas, dask, pyspark, database.

In [2]:
cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-clean-*.csv'))
print(cfg_fnames)

['test-data/input/test-data-input-csv-clean-mar.csv', 'test-data/input/test-data-input-csv-clean-feb.csv', 'test-data/input/test-data-input-csv-clean-jan.csv']


### Check column consistency across all files

Even if you think the files have a consistent column layout, it worthwhile using `d6tstack` to assert that that is actually the case. It's very quick to do even with very many large files!

In [3]:
# get previews
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True) # all_strings=True makes reading faster
col_preview = c.preview_columns()

In [4]:
print('all columns equal?', c.is_all_equal())
print('')
print('which columns are present in which files?')
print('')
print(c.is_col_present().reset_index(drop=True))
print('')
print('in what order do columns appear in the files?')
print('')
print(col_preview['df_columns_order'].reset_index(drop=True))

all columns equal? True

which columns are present in which files?

                            filename  cost  date profit sales
0  test-data-input-csv-clean-mar.csv  True  True   True  True
1  test-data-input-csv-clean-feb.csv  True  True   True  True
2  test-data-input-csv-clean-jan.csv  True  True   True  True

in what order do columns appear in the files?

                            filename cost date profit sales
0  test-data-input-csv-clean-mar.csv    0    1      2     3
1  test-data-input-csv-clean-feb.csv    0    1      2     3
2  test-data-input-csv-clean-jan.csv    0    1      2     3


### Preview Combined Data

You can see a preview of what the combined data from all files will look like.

In [5]:
c.preview_combine()

Unnamed: 0,cost,date,profit,sales,filename
0,-100,2011-03-01,200,300,test-data-input-csv-clean-mar.csv
1,-100,2011-03-02,200,300,test-data-input-csv-clean-mar.csv
2,-100,2011-03-03,200,300,test-data-input-csv-clean-mar.csv
0,-90,2011-02-01,110,200,test-data-input-csv-clean-feb.csv
1,-90,2011-02-02,110,200,test-data-input-csv-clean-feb.csv
2,-90,2011-02-03,110,200,test-data-input-csv-clean-feb.csv
0,-80,2011-01-01,20,100,test-data-input-csv-clean-jan.csv
1,-80,2011-01-02,20,100,test-data-input-csv-clean-jan.csv
2,-80,2011-01-03,20,100,test-data-input-csv-clean-jan.csv


### Read All Files to Pandas

You can quickly load the combined data into a pandas dataframe with a single command. 

In [6]:
c.combine().head()

Unnamed: 0,cost,date,profit,sales,filename
0,-100,2011-03-01,200,300,test-data-input-csv-clean-mar.csv
1,-100,2011-03-02,200,300,test-data-input-csv-clean-mar.csv
2,-100,2011-03-03,200,300,test-data-input-csv-clean-mar.csv
3,-100,2011-03-04,200,300,test-data-input-csv-clean-mar.csv
4,-100,2011-03-05,200,300,test-data-input-csv-clean-mar.csv


## Use Case: Identifying and fixing inconsistent columns

The first case was clean: all files had the same columns. It happens very frequently that the data schema changes over time with columns being added or deleted over time. Let's look at a case where an extra columns got added.

In [6]:
cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))
print(cfg_fnames)

['test-data/input/test-data-input-csv-colmismatch-mar.csv', 'test-data/input/test-data-input-csv-colmismatch-feb.csv', 'test-data/input/test-data-input-csv-colmismatch-jan.csv']


In [7]:
# get previews
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True) # all_strings=True makes reading faster
col_preview = c.preview_columns()

In [8]:
# get previews
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True) # all_strings=True makes reading faster
col_preview = c.preview_columns()

In [11]:
print('all columns equal?', c.is_all_equal())
print('')
print('which columns are unique?', col_preview['columns_unique'])
print('')
print('which files have unique columns?')
print('')
print(c.is_col_present_unique())

all columns equal? False

which columns are unique? ['profit2']

which files have unique columns?

                                        profit2
filename                                       
test-data-input-csv-colmismatch-mar.csv    True
test-data-input-csv-colmismatch-feb.csv   False
test-data-input-csv-colmismatch-jan.csv   False


In [9]:
c.preview_combine() # keep all columns

Unnamed: 0,cost,date,filename,profit,profit2,sales,profit2.1,filename.1
0,-100,2011-03-01,200,300,400.0,test-data-input-csv-colmismatch-mar.csv,,
1,-100,2011-03-02,200,300,400.0,test-data-input-csv-colmismatch-mar.csv,,
2,-100,2011-03-03,200,300,400.0,test-data-input-csv-colmismatch-mar.csv,,
0,-90,2011-02-01,110,200,,test-data-input-csv-colmismatch-feb.csv,,
1,-90,2011-02-02,110,200,,test-data-input-csv-colmismatch-feb.csv,,
2,-90,2011-02-03,110,200,,test-data-input-csv-colmismatch-feb.csv,,
0,-80,2011-01-01,20,100,,test-data-input-csv-colmismatch-jan.csv,,
1,-80,2011-01-02,20,100,,test-data-input-csv-colmismatch-jan.csv,,
2,-80,2011-01-03,20,100,,test-data-input-csv-colmismatch-jan.csv,,


In [10]:
c.preview_combine(is_col_common=True) # keep only common columns

Unnamed: 0,cost,date,profit,sales,filename
0,-100,2011-03-01,200,300,test-data-input-csv-colmismatch-mar.csv
1,-100,2011-03-02,200,300,test-data-input-csv-colmismatch-mar.csv
2,-100,2011-03-03,200,300,test-data-input-csv-colmismatch-mar.csv
0,-90,2011-02-01,110,200,test-data-input-csv-colmismatch-feb.csv
1,-90,2011-02-02,110,200,test-data-input-csv-colmismatch-feb.csv
2,-90,2011-02-03,110,200,test-data-input-csv-colmismatch-feb.csv
0,-80,2011-01-01,20,100,test-data-input-csv-colmismatch-jan.csv
1,-80,2011-01-02,20,100,test-data-input-csv-colmismatch-jan.csv
2,-80,2011-01-03,20,100,test-data-input-csv-colmismatch-jan.csv


# Use Case: align renamed columns. Select subset of columns

Say a column has been renamed and now the data doesn't line up with the data from the old column name. You can easily fix such a situation by using `CombinerCSVAdvanced` which allows you to rename columns and automatically lines up the data. It also allows you to just load data from a subset of columns.

In [18]:
cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-renamed-*.csv'))
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True) # all_strings=True makes reading faster
print(c.is_col_present_unique())

                                    revenue  sales
filename                                          
test-data-input-csv-renamed-mar.csv    True  False
test-data-input-csv-renamed-feb.csv   False   True
test-data-input-csv-renamed-jan.csv   False   True


The column `sales` got renamed to `revenue` in the March file, this would causes problems when reading the files. 

In [21]:
col_preview = c.preview_columns()
c.preview_combine()[['filename']+col_preview['columns_unique']]

Unnamed: 0,filename,revenue,sales
0,test-data-input-csv-renamed-mar.csv,300.0,
1,test-data-input-csv-renamed-mar.csv,300.0,
2,test-data-input-csv-renamed-mar.csv,300.0,
0,test-data-input-csv-renamed-feb.csv,,200.0
1,test-data-input-csv-renamed-feb.csv,,200.0
2,test-data-input-csv-renamed-feb.csv,,200.0
0,test-data-input-csv-renamed-jan.csv,,100.0
1,test-data-input-csv-renamed-jan.csv,,100.0
2,test-data-input-csv-renamed-jan.csv,,100.0


You can pass the columns you want to rename to `columns_rename` and it will rename and align those columns.

In [22]:
# only select particular columns
cfg_col_sel = ['date','sales','cost','profit'] # don't select profit2
# rename colums
cfg_col_rename = {'sales':'revenue'} # rename all instances of sales to revenue

In [23]:
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True, columns_rename = cfg_col_rename, columns_select = cfg_col_sel) 
c.preview_combine() 


Unnamed: 0,date,revenue,cost,profit,filename
0,2011-03-01,300,-100,200,test-data-input-csv-renamed-mar.csv
1,2011-03-02,300,-100,200,test-data-input-csv-renamed-mar.csv
2,2011-03-03,300,-100,200,test-data-input-csv-renamed-mar.csv
0,2011-02-01,200,-90,110,test-data-input-csv-renamed-feb.csv
1,2011-02-02,200,-90,110,test-data-input-csv-renamed-feb.csv
2,2011-02-03,200,-90,110,test-data-input-csv-renamed-feb.csv
0,2011-01-01,100,-80,20,test-data-input-csv-renamed-jan.csv
1,2011-01-02,100,-80,20,test-data-input-csv-renamed-jan.csv
2,2011-01-03,100,-80,20,test-data-input-csv-renamed-jan.csv


## Case: Identify change in column order

If you read your files into a database this will be a real problem because it look like the files are all the same whereas in fact they have changes. This is because programs like dask or sql loaders assume the column order is the same. With `d6tstack` you can easily identify and fix such a case.

In [24]:
cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))
print(cfg_fnames)

['test-data/input/test-data-input-csv-reorder-jan.csv', 'test-data/input/test-data-input-csv-reorder-mar.csv', 'test-data/input/test-data-input-csv-reorder-feb.csv']


In [25]:
# get previews
c = d6tc.CombinerCSV(cfg_fnames, all_strings=True) # all_strings=True makes reading faster
col_preview = c.preview_columns()

Here we can see that all columns are not equal

In [26]:
print('all columns equal?', col_preview['is_all_equal'])
print('')
print('in what order do columns appear in the files?')
print('')
print(col_preview['df_columns_order'].reset_index(drop=True))

all columns equal? False

in what order do columns appear in the files?

                              filename cost date profit sales
0  test-data-input-csv-reorder-jan.csv    2    0      3     1
1  test-data-input-csv-reorder-mar.csv    3    0      2     1
2  test-data-input-csv-reorder-feb.csv    2    0      3     1


In [27]:
c.preview_combine() # automatically puts it in the right order

Unnamed: 0,date,sales,cost,profit,revenue,date.1,filename
0,2011-01-01,100,-80,20,test-data-input-csv-reorder-jan.csv,,
1,-90,110,200,2011-02-02,test-data-input-csv-colmismatch-feb.csv,,
2,-90,110,200,2011-02-03,test-data-input-csv-colmismatch-feb.csv,,
0,2011-03-01,300,-100,200,test-data-input-csv-reorder-mar.csv,,
1,-80,2011-01-02,test-data-input-csv-reorder-jan.csv,20,100,,
2,-80,2011-01-03,test-data-input-csv-reorder-jan.csv,20,100,,
0,2011-02-01,200,-90,2011-02-01,test-data-input-csv-reorder-feb.csv,110.0,200.0
1,-90,2011-02-02,test-data-input-csv-reorder-feb.csv,110,200,,
2,-90,2011-02-03,test-data-input-csv-reorder-feb.csv,110,200,,


# Customize separator and pass pd.read_csv() params

You can pass additional parameters such as separators and any params for `pd.read_csv()` to the combiner.

In [28]:
c = d6tc.CombinerCSV(cfg_fnames, sep=',',all_strings=True, read_csv_params={'header': None})
col_preview = c.preview_columns()
print(col_preview)

{'files_columns': {'test-data/input/test-data-input-csv-reorder-jan.csv': ['date', 'sales', 'cost', 'profit'], 'test-data/input/test-data-input-csv-reorder-mar.csv': ['date', 'sales', 'profit', 'cost'], 'test-data/input/test-data-input-csv-reorder-feb.csv': ['date', 'sales', 'cost', 'profit']}, 'columns_all': ['cost', 'date', 'profit', 'sales'], 'columns_common': ['cost', 'date', 'profit', 'sales'], 'columns_unique': [], 'is_all_equal': False, 'df_columns_present':                                                                                filename  \
file_path                                                                                 
test-data/input/test-data-input-csv-reorder-jan...  test-data-input-csv-reorder-jan.csv   
test-data/input/test-data-input-csv-reorder-mar...  test-data-input-csv-reorder-mar.csv   
test-data/input/test-data-input-csv-reorder-feb...  test-data-input-csv-reorder-feb.csv   

                                                    cost  date profit sale

# CSV out of core functionality

If your files are large you don't want to read them all in memory and then save. Instead you can write directly to the output file.

In [29]:
c.combine_save('test-data/output/test.csv')

True

### Detect CSV settings across all files

In [30]:
# finds common csv across all files
cfg_sniff = d6tc.sniff_settings_csv(cfg_fnames)
print(cfg_sniff)


{'delim': ',', 'skiprows': 0, 'has_header': True, 'header': 0}


### Detect CSV settings across all files

In [73]:
# finds common csv across all files
cfg_sniff = d6tc.sniff_settings_csv(cfg_fnames)
print(cfg_sniff)


{'delim': ',', 'skiprows': 0, 'has_header': True, 'header': 0}


# Excel Functionality

In [31]:
import importlib
import pandas as pd
import glob

import d6tstack.combine_csv as d6tc
from d6tstack.sniffer import XLSSniffer
from d6tstack.convert_xls import XLStoCSVMultiFile
from d6tstack.utils import PrintLogger

In [32]:
cfg_fnames = list(glob.glob('test-data/input/test-data-input-xls-mult-*.xlsx'))
print(cfg_fnames)

['test-data/input/test-data-input-xls-mult-feb.xlsx', 'test-data/input/test-data-input-xls-mult-jan.xlsx', 'test-data/input/test-data-input-xls-mult-mar.xlsx']


### Sniff excel sheets across files

In [33]:
# finds sheets across all files
sniffer = XLSSniffer(cfg_fnames)


In [34]:
print('all files have same sheet count?', sniffer.all_same_count())
print('')
print('all files have same sheet names?', sniffer.all_same_names())
print('')
print('all files contain sheet?', sniffer.all_contain_sheetname('Sheet1'))
print('')
print('detailed dataframe')
print('')
print(sniffer.df_xls_sheets.reset_index(drop=True).head())

all files have same sheet count? True

all files have same sheet names? True

all files contain sheet? True

detailed dataframe

                           file_name sheets_count sheets_idx      sheets_names
0  test-data-input-xls-mult-feb.xlsx            2     [0, 1]  [Sheet1, Sheet2]
1  test-data-input-xls-mult-jan.xlsx            2     [0, 1]  [Sheet1, Sheet2]
2  test-data-input-xls-mult-mar.xlsx            2     [0, 1]  [Sheet1, Sheet2]


### Use the print logger

In [35]:
logger = PrintLogger()

### Convert excel to csv

In [36]:
convertor = XLStoCSVMultiFile(cfg_fnames[:3], 'idx_global', 0, if_exists='replace', logger=logger)
files_out = convertor.convert_all()
print(files_out)

converting file: test-data-input-xls-mult-feb.xlsx | sheet: 0 ok
converting file: test-data-input-xls-mult-jan.xlsx | sheet: 0 ok
converting file: test-data-input-xls-mult-mar.xlsx | sheet: 0 ok
['test-data/input/test-data-input-xls-mult-feb.xlsx-0.csv', 'test-data/input/test-data-input-xls-mult-jan.xlsx-0.csv', 'test-data/input/test-data-input-xls-mult-mar.xlsx-0.csv']


### Read messy excel to pandas

In [37]:
from d6tstack.utils import read_excel_advanced
cfg_path = 'test-data/adv_excel_data/read_excel_adv - sample3.xlsx'
df=read_excel_advanced(cfg_path, header_xls_start="A10", header_xls_end="G10")
df.head()

Unnamed: 0,Product Code,Product Description,Weight (KG),Units,Cost,Ordered quantity,Total cost
0,SLFA300,SALMON FILLET A-TRIM,1,KG,11.07,10,110.7
