# d6tstack with Dask

Dask is a great library for out-of-core computing. But if input files are not properly organized it quickly breaks. For example:

1) if columns are different between files, dask won't even read the data! It doesn't tell you what you need to do to fix it.

2) if column order is rearranged between files it will read data, but into the wrong columns and you won't notice it

Dask can't handle those scenarios. With d6tstack you can easily fix the situation with just a few lines of code!

For more instructions, examples and documentation see https://github.com/d6t/d6tstack

## Base Case: Columns are same between all files
As a base case, we have input files which have consistent input columns and thus can be easily read in dask.

In [1]:
import dask.dataframe as dd

# consistent format
ddf = dd.read_csv('test-data/input/test-data-input-csv-clean-*.csv')
ddf.compute()

  return f(*args, **kwds)


Unnamed: 0,date,sales,cost,profit
0,2011-02-01,200,-90,110
1,2011-02-02,200,-90,110
2,2011-02-03,200,-90,110
3,2011-02-04,200,-90,110
4,2011-02-05,200,-90,110
5,2011-02-06,200,-90,110
6,2011-02-07,200,-90,110
7,2011-02-08,200,-90,110
8,2011-02-09,200,-90,110
9,2011-02-10,200,-90,110


## Problem Case 1: Columns are different between files
That worked well. But what happens if your input files have inconsistent columns across files? Say for example one file has a new column that the other files don't have.

In [23]:
# consistent format
ddf = dd.read_csv('test-data/input/test-data-input-csv-colmismatch-*.csv')
ddf.compute()


ValueError: Length mismatch: Expected axis has 5 elements, new values have 4 elements

## Fixing the problem with d6stack
Urgh! There's no way to use these files in dask. You don't even know what's going on. What file caused the problem? Why did it cause a problem? All you know is one file has more columns than the first file.

You can either manually process those files or use d6tstack to easily check for such a situation and fix it with a few lines of code - no manual processing required. Let's take a look!

In [4]:
import glob
import d6tstack.combine_csv

cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))
c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)

# check columns
print('all equal',c.is_all_equal())
print('')
c.is_column_present()

sniffing columns ok
all equal False



Unnamed: 0_level_0,date,sales,cost,profit,profit2
file_path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
test-data/input/test-data-input-csv-colmismatch-feb.csv,True,True,True,True,False
test-data/input/test-data-input-csv-colmismatch-jan.csv,True,True,True,True,False
test-data/input/test-data-input-csv-colmismatch-mar.csv,True,True,True,True,True


Before using dask you can quickly use d6stack to check if all colums are consistent with `d6tstack.combine_csv.CombinerCSV.is_all_equal()`. If they are not consistent you can easily see which files are causing problems with `d6tstack.combine_csv.CombinerCSV.is_col_present()`, in this case there is a new column "profit2" in "test-data-input-csv-colmismatch-mar.csv".

**Let's use d6stack to fix the situation.** We will use out-of-core processing with `d6tstack.combine_csv.CombinerCSVAdvanced.combine_save()` to save data from all files into one combined file with constistent columns. Any missing data is filled with NaN (to keep only common columns use `cfg_col_sel=c.col_preview['columns_common']`) Just 2 lines of code! 

In [6]:
# out-of-core combining
fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir='test-data/output/')

sniffing columns ok
writing test-data/output/d6tstack-test-data-input-csv-colmismatch-feb.csv ok
writing test-data/output/d6tstack-test-data-input-csv-colmismatch-jan.csv ok
writing test-data/output/d6tstack-test-data-input-csv-colmismatch-mar.csv ok


NB: Instead of `to_csv_align()` you can also run `to_csv_combine()` which creates a single combined file.

Now you can read this in dask and do whatever you wanted to do in the first place.

In [7]:
# consistent format
ddf = dd.read_csv('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.csv')
ddf.compute()

Unnamed: 0,date,sales,cost,profit,profit2,filepath,filename
0,2011-02-01,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
1,2011-02-02,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
2,2011-02-03,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
3,2011-02-04,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
4,2011-02-05,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
5,2011-02-06,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
6,2011-02-07,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
7,2011-02-08,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
8,2011-02-09,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv
9,2011-02-10,200,-90,110,,test-data/input/test-data-input-csv-colmismatc...,test-data-input-csv-colmismatch-feb.csv


## Problem Case 2: Columns are reordered between files
This is a sneaky case. The columns are the same but the order is different! Dask will read everything just fine without a warning but your data is totally messed up!

In the example below, the "profit" column contains data from the "cost" column!

In [8]:
# consistent format
ddf = dd.read_csv('test-data/input/test-data-input-csv-reorder-*.csv')
ddf.compute()

Unnamed: 0,date,sales,cost,profit
0,2011-02-01,200,-90,110
1,2011-02-02,200,-90,110
2,2011-02-03,200,-90,110
3,2011-02-04,200,-90,110
4,2011-02-05,200,-90,110
5,2011-02-06,200,-90,110
6,2011-02-07,200,-90,110
7,2011-02-08,200,-90,110
8,2011-02-09,200,-90,110
9,2011-02-10,200,-90,110


In [9]:
cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))
c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)

# check columns
col_sniff = c.sniff_columns()
print('all columns equal?' , c.is_all_equal())
print('')
print('in what order do columns appear in the files?')
print('')
col_sniff['df_columns_order'].reset_index(drop=True)

sniffing columns ok
all columns equal? False

in what order do columns appear in the files?



Unnamed: 0,date,sales,cost,profit
0,0,1,2,3
1,0,1,2,3
2,0,1,3,2


Again, just a useful check before loading data into dask you can see that the columns don't line up. It's very fast to run because it only reads the headers, there's NO reason for you NOT to do it from a QA perspective.

Same as above, the fix is the same few lines of code with d6stack.

In [10]:
# out-of-core combining
fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir='test-data/output/')

sniffing columns ok
writing test-data/output/d6tstack-test-data-input-csv-reorder-feb.csv ok
writing test-data/output/d6tstack-test-data-input-csv-reorder-jan.csv ok
writing test-data/output/d6tstack-test-data-input-csv-reorder-mar.csv ok


In [11]:
# consistent format
ddf = dd.read_csv('test-data/output/d6tstack-test-data-input-csv-reorder-*.csv')
ddf.compute()

Unnamed: 0,date,sales,cost,profit,filepath,filename
0,2011-02-01,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
1,2011-02-02,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
2,2011-02-03,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
3,2011-02-04,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
4,2011-02-05,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
5,2011-02-06,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
6,2011-02-07,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
7,2011-02-08,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
8,2011-02-09,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
9,2011-02-10,200,-90,110,test-data/input/test-data-input-csv-reorder-fe...,test-data-input-csv-reorder-feb.csv
