## Preprocessing

This notebook is intended as a companion to Section 3 - Preprocessing.

To demonstrate the various transformations discussed, we'll populate a simple toy dataset for illustrative purposes.

In [1]:
import pandas as pd
import numpy as np

df_train = pd.DataFrame({'number':[0, 33, 134, 333, 42, 0.33, 15.5, -1, np.nan],
                         'category':['circle','circle','circle','square','square','triangle', 1234, np.nan, 'square'],
                         'binary':['yes','yes','yes','yes','no','no','no',np.nan,'yes'],
                         'datetime':['1/01/2017 1:00am', '3/10/2018 6:30am', '5/31/2018 10:15am', '7/26/2019 1:40pm', '9/05/2019 6:45pm', '12/25/2019', '11:30pm', np.nan, 'string'],
                         'address':['1234 North Peterson St Orlando, FL 32714',
                                   '2345 South Anderson St Altamonte Springs, FL 32715',
                                   '3456 South Peterson St Maitland, FL 32789',
                                   '4567 North Peterson St Orlando, FL 32714',
                                   '5678 Avenue St Orlando, FL 32714',
                                   '6789 South Peterson St Maitland, FL 32789',
                                   '5858 North Other St Altamonte Springs, FL 32715',
                                   None,
                                   'Orlando, FL'],
                        })

df_train

Unnamed: 0,number,category,binary,datetime,address
0,0.0,circle,yes,1/01/2017 1:00am,"1234 North Peterson St Orlando, FL 32714"
1,33.0,circle,yes,3/10/2018 6:30am,"2345 South Anderson St Altamonte Springs, FL 3..."
2,134.0,circle,yes,5/31/2018 10:15am,"3456 South Peterson St Maitland, FL 32789"
3,333.0,square,yes,7/26/2019 1:40pm,"4567 North Peterson St Orlando, FL 32714"
4,42.0,square,no,9/05/2019 6:45pm,"5678 Avenue St Orlando, FL 32714"
5,0.33,triangle,no,12/25/2019,"6789 South Peterson St Maitland, FL 32789"
6,15.5,1234,no,11:30pm,"5858 North Other St Altamonte Springs, FL 32715"
7,-1.0,,,,
8,,square,yes,string,"Orlando, FL"


### under automation

In [2]:
#Let's run a quick automated application
#we'll turn off shuffling for ease of inspection
#and turn off printouts to avoid clutter

from Automunge import *
am = AutoMunge()

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False)

Upon inspecting the returned train set we see by suffix appenders which transforms were applied as part of a returned column's derivation. In each case, the returned data is supplemented with a 'NArw' column signalling presence of infill.

The numeric feature "number" had a 'nmbr' transform applied which is a z-score normalization. 

The categoric feature "binary" had a 'bnry' which is a single column binarization.

The categoric feature "category" also had a binarization by the '1010' transform returning three columns. 

The datetime feature "datetime" had entries segregated by time scale (year, month/day, hour/minute/second) with seperate transforms by sine and cosine to accomodate periodicity, and also had bins aggregated for business hours, weekdays, and holidays. Note that the intermediate transform 'tmzn' visible by suffix appender is there to allow designation of a desired timezone by parameter. 

The final categoric set, "address", was subjected to a hashing (aka the "hashing trick"), where in this case because the set had all unique entries was given a parsed hashing in which distinct words found within entries based on a space seperator were individually encoded. (Returning seven 'hash' columns because the max word count in an entry was 7 words.)

Not shown, another scenario of hashing is applied under automation for cases where the all unique threshold hasn't been reached but number of unique entries in a categoric set is still above the configurable "numbercategoryheuristic" threshold which defaults to 255. In this case hashing is applied to aggregate unique entries without parsing for distinct words.

In [3]:
pd.set_option('display.max_columns', 100)

train

Unnamed: 0,number_nmbr,binary_bnry,number_NArw,category_NArw,category_1010_0,category_1010_1,category_1010_2,binary_NArw,datetime_NArw,datetime_tmzn_year,datetime_tmzn_mdsn,datetime_tmzn_mdcs,datetime_tmzn_hmss,datetime_tmzn_hmsc,datetime_tmzn_bshr,datetime_tmzn_wkdy,datetime_tmzn_hldy,address_NArw,address_hash_0,address_hash_1,address_hash_2,address_hash_3,address_hash_4,address_hash_5,address_hash_6,address_hash_7
0,-0.644929,1,0,0,0,0,1,0,0,-1.500117,0.5145554,0.857457,0.258819,0.965926,0,0,0,0,5,4,23,19,24,23,16,0
1,-0.33916,1,0,0,0,0,1,0,0,-0.825064,0.9857698,-0.168101,0.991445,-0.130526,0,0,0,0,19,20,4,19,30,31,23,41
2,0.596678,1,0,0,0,0,1,0,0,-0.825064,1.224647e-16,-1.0,0.442289,-0.896873,1,1,0,0,24,20,23,19,43,23,12,0
3,2.440556,1,0,0,0,1,0,0,0,-0.150012,-0.8207635,-0.571268,-0.422618,-0.906308,1,1,0,0,32,4,23,19,24,23,16,0
4,-0.255769,0,0,0,0,1,0,0,0,-0.150012,-0.9961947,0.087156,-0.980785,0.19509,0,1,0,0,20,37,19,24,23,16,0,0
5,-0.641871,0,0,0,0,1,1,0,0,-0.150012,0.4098203,0.912166,0.0,1.0,0,1,1,0,29,20,23,19,43,23,12,0
6,-0.50131,0,0,0,0,0,0,0,0,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,0,1,0,0,16,4,38,19,30,31,23,41
7,-0.654195,1,0,1,0,0,0,1,1,-0.649551,-0.01783955,0.459045,-0.011968,0.440456,0,1,0,1,24,0,0,0,0,0,0,0
8,-0.360567,1,1,0,0,1,0,0,1,-0.312024,-0.3757586,-0.18343,-0.19084,-0.094732,0,1,0,0,24,23,0,0,0,0,0,0


Note that in some cases the set of columns returned from an input feature may not all be adjacent in the returned dataframe. This was an intentional design decision, if grouping coherence is desired a variation can be applied with the assignparam parameter which will activate grouping coherence of returned sets. 

```
assignparam = {'global_assignparam' : {'inplace' : False}}
```
Otherwise the set of columns returned from an input feature can be accessed from the returned postprocess_dict by:
```
postprocess_dict['column_map']['<input_feature_header>']
```

In [4]:
train[postprocess_dict['column_map']['binary']]

Unnamed: 0,binary_bnry,binary_NArw
0,1,0
1,1,0
2,1,0
3,1,0
4,0,0
5,0,0
6,0,0
7,1,1
8,1,0


To be complete, we'll quickly demonstrate the defaults under automation for label sets:

Numeric labels are left un-normalized with the 'exc2' transform which is bassically a pass-through transform for numeric sets. Note that label sets are not supplemented with the NArw column. Since this returns a single column labels set it is received as a Pandas series instead of a dataframe.

In [5]:
#numeric label set

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = 'number',
             shuffletrain = False,
             printstatus = False)

labels

0      0.00
1     33.00
2    134.00
3    333.00
4     42.00
5      0.33
6     15.50
7     -1.00
8     -1.00
Name: number_exc2, dtype: float32

Categoric labels are instead given an ordinal transform by the 'ordl' transform. 

In [6]:
#categoric label set

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = 'category',
             shuffletrain = False,
             printstatus = False)

labels

0    1
1    1
2    1
3    2
4    2
5    3
6    0
7    4
8    2
Name: category_ordl, dtype: uint8

### assigning transforms

The paper notes several types of custom transforms available for assignment, here we'll demonstrate a few.

__________

"Numeric features may be assigned to any range of transformations, normalizations, and bin aggregations"

=> there is a lot to choose from. Here we'll demonstrate a transformation set (assembled with our "family tree primitives") which includes an upstream log transform 'log0' followed by a downstream min/max scaling 'mnmx' as well as equal population bin aggregations 'bnep', which we'll asign to our numeric feature.

We'll pass a parameter to the bin aggregations to designate the bin count.

In [7]:
#family trees are defined in the transformdict
#here the upstream transforms are newt, bnep, and NArw
#and since parents is a primitive with offspring
#the downstream primitives are inspected 
#and mnmx is performed downstream of newt
transformdict =  {'newt' : {'parents' : ['newt'], \
                            'siblings': [], \
                            'auntsuncles' : ['bnep'], \
                            'cousins' : ['NArw'], \
                            'children' : [], \
                            'niecesnephews' : [], \
                            'coworkers' : ['mnmx'], \
                            'friends' : []}}

#since we're defining a new root category 'newt' in a transformdict
#we'll need a corresponding processdict entry
#we'll designate that the 'newt' category
#is associated with a log0 transform
#as applied above in the parents primitive
processdict = {'newt' : {'functionpointer' : 'log0'}}

#we'll assign this root category to column 'number'
assigncat = {'newt' : ['number']}

#since bnep returns 5 bins and our toy data set only has 9 entries
#let's go ahead and set a parameter to return fewer bins
#bnep accepts parameter 'bincount' as documented in read me
assignparam = {'default_assignparam' : {'bnep' : {'bincount' : 3}}}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             transformdict = transformdict,
             processdict = processdict, 
             assigncat = assigncat,
             MLinfill = False,
             assignparam = assignparam)

#and we can then inspect the returned train set to view output
train[postprocess_dict['column_map']['number']]

#note that since the root category had a processdict entry derived 
#from a functionpointer to log0
#the associated NArowtype is applied which treats entries <=0 as subject to infill

Unnamed: 0,number_NArw,number_log0_mnmx,number_bnep_0,number_bnep_1,number_bnep_2
0,1,0.631899,1,0,0
1,0,0.665794,0,1,0
2,0,0.868393,0,0,1
3,0,1.0,0,0,1
4,0,0.700661,0,0,1
5,0,0.0,1,0,0
6,0,0.556543,0,1,0
7,1,0.631899,1,0,0
8,1,0.631899,0,0,0


In [8]:
#Here again is the input column that was the source for comparison

df_train['number']

0      0.00
1     33.00
2    134.00
3    333.00
4     42.00
5      0.33
6     15.50
7     -1.00
8       NaN
Name: number, dtype: float64

__________

"Sequential numeric features may be supplemented by proxies for derivatives"

There are several pre-defined transformation sets that may serve to supplment sequential numeric features with proxies for derivatives (where by sequential we mean like time series data). 

The dxdt family of transforms (dxdt, d2dt, d3dt, etc) approximate a derivative by deltas between adjacent time steps followed by a normalization. Note that if a larger interval is desired for the delta a user can pass the 'periods' parameter through assignparam to designate.

The dxdt root category returns the original data normalized by the retn transform supplemented by a dxdt proxy for first order derivative (like velocity) also normalized by retn, i.e. when dxdt is assigned to an input column with header 'column', it would return the set {'column_retn', 'column_dxdt_retn', 'column_NArw'}. Similarily, the d2dt root category would return an additional column with a proxy for a second order derivative (like acceleration) as the set {'column_retn', 'column_dxdt_retn', 'column_dxdt_dxdt_retn', 'column_NArw'}.

(As an aside, the retn transform is a type of normalization with sign retention, more detail in the paper Numeric Encoding Options with Automunge.)

Another varient is available by the dxd2 family (dxd2, d2d2, d3d2, etc) which applies a more denoised signal by subtracting the average of multiple time steps between some interval. (In base configuration this is average of last two rows minus average of preceding two rows, with increased intervals available again by the 'periods' parameter.)

In [9]:
#to demonstrate let's assign the d2dt root category to column 'number'
assigncat = {'d2dt' : ['number']}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             MLinfill = False)

#and we can then inspect the returned train set to view output
train[postprocess_dict['column_map']['number']]

Unnamed: 0,number_retn,number_NArw,number_dxdt_retn,number_dxdt_dxdt_retn
0,0.0,0,0.067347,0.0
1,0.098802,0,0.067347,0.0
2,0.401198,0,0.206122,0.091975
3,0.997006,0,0.406122,0.132552
4,0.125749,0,-0.593878,-0.662762
5,0.000988,0,-0.085041,0.337238
6,0.046407,0,0.030959,0.07688
7,-0.002994,0,-0.033673,-0.042836
8,0.208394,1,0.0,0.022318


In [10]:
#Here again is the input column that was the source for comparison

df_train['number']

0      0.00
1     33.00
2    134.00
3    333.00
4     42.00
5      0.33
6     15.50
7     -1.00
8       NaN
Name: number, dtype: float64

__________

"Categoric features may be subject to encodings like ordinal, one-hot, binarization, hashing,"

Let's go ahead and demonstrate each of these side by side for comparison.

In [11]:
#family trees are defined in the transformdict
#here we'll apply ordinal via 'ord3', one-hot via 'text', 
#binarization via '1010', and hashing via 'hash'


transformdict =  {'newt' : {'parents' : [], \
                            'siblings': [], \
                            'auntsuncles' : ['ord3', 'text', '1010', 'hash'], \
                            'cousins' : ['NArw'], \
                            'children' : [], \
                            'niecesnephews' : [], \
                            'coworkers' : [], \
                            'friends' : []}}

#since we're defining a new root category 'newt' in a transformdict
#we'll need a corresponding processdict entry
#we'll just apply comparable to ord3 which is an arbitrary choice
processdict = {'newt' : {'functionpointer' : 'ord3'}}

#we'll assign this root category to column 'category'
assigncat = {'newt' : ['category']}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             transformdict = transformdict,
             processdict = processdict, 
             assigncat = assigncat,
             MLinfill = False)

#and we can then inspect the returned train set to view output
train[postprocess_dict['column_map']['category']]

Unnamed: 0,category_hash,category_NArw,category_ord3,category_1234,category_circle,category_square,category_triangle,category_1010_0,category_1010_1,category_1010_2
0,5,0,0,0,1,0,0,0,0,1
1,5,0,0,0,1,0,0,0,0,1
2,5,0,0,0,1,0,0,0,0,1
3,6,0,1,0,0,1,0,0,1,0
4,6,0,1,0,0,1,0,0,1,0
5,9,0,3,0,0,0,1,0,1,1
6,5,0,2,1,0,0,0,0,0,0
7,6,1,4,0,0,0,0,1,0,0
8,6,0,1,0,0,1,0,0,1,0


Here is a summary of each transformation category and associated returned columns:
- hash: category_hash
- NArw: category_NArw
- ord3: category_ord3
- text: category_1234, category_circle, category_square, category_triangle
- 1010: category_1010_0, category_1010_1, category_1010_2

Note that the 'text' one hot encoding is a little unique in that suffix appenders are associated with the activation. Another varient of one-hot encoding is available with privacy preserving suffix appenders as category 'onht'.

In [12]:
#Here again is the input column that was the source for comparison

df_train['category']

0      circle
1      circle
2      circle
3      square
4      square
5    triangle
6        1234
7         NaN
8      square
Name: category, dtype: object

__________

"or even parsed categoric encoding [11] with an increased information retention in comparison to one-hot encoding by a vectorization as a function of grammatical structure shared between entries."

There are several varients of parsed categoric encodings in teh library. We go over them in some detail in the paper Parsed Categoric Encodings with Automunge, particularily the or19 root category. Here we'll demonstrate another varient available in library as or23 which is similar to or19 but replaces the spl9/sp10 chain with a sp19 which is similar to splt with concurrent activations but with returned activations aggregated into a binarization. (Forgive the esoteric vocabulary, for more detail see the paper Parsed Categoric Encodings.)

In [13]:
#we'll assign the or23 root category to column 'address'
assigncat = {'or23' : ['address']}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             MLinfill = False)

#and we can then inspect the returned train set to view output
train[postprocess_dict['column_map']['address']]

Unnamed: 0,address_UPCS_ord3,address_NArw,address_UPCS_nmcm,address_UPCS_sp19_0,address_UPCS_sp19_1,address_UPCS_sp19_2
0,0,0,32714.0,1,0,1
1,1,0,32715.0,1,0,0
2,2,0,32789.0,1,1,0
3,3,0,32714.0,1,0,1
4,4,0,32714.0,0,1,0
5,6,0,32789.0,1,1,0
6,5,0,32715.0,0,1,1
7,8,1,32735.714844,0,0,0
8,7,0,32735.714844,0,0,1


In [14]:
#Here again is the input column that was the source for comparison

df_train['address']

0             1234 North Peterson St Orlando, FL 32714
1    2345 South Anderson St Altamonte Springs, FL 3...
2            3456 South Peterson St Maitland, FL 32789
3             4567 North Peterson St Orlando, FL 32714
4                     5678 Avenue St Orlando, FL 32714
5            6789 South Peterson St Maitland, FL 32789
6      5858 North Other St Altamonte Springs, FL 32715
7                                                 None
8                                          Orlando, FL
Name: address, dtype: object

For additional comparison, here is the or19 version that was discussed in detail in Parsed Categoric Encodings with Automunge.

In [15]:
#we'll assign the or19 root category to column 'address'
assigncat = {'or19' : ['address']}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             MLinfill = False)

#and we can then inspect the returned train set to view output
train[postprocess_dict['column_map']['address']]

Unnamed: 0,address_NArw,address_UPCS_spl9_ord3,address_UPCS_spl9_sp10_ord3,address_UPCS_nmc7_nmbr,address_UPCS_1010_0,address_UPCS_1010_1,address_UPCS_1010_2,address_UPCS_1010_3
0,0,0,1,-0.68876,0,0,0,0
1,0,2,0,-0.657041,0,0,0,1
2,0,1,2,1.690181,0,0,1,0
3,0,0,1,-0.68876,0,0,1,1
4,0,3,1,-0.68876,0,1,0,0
5,0,1,2,1.690181,0,1,1,0
6,0,2,0,-0.657041,0,1,0,1
7,1,5,0,0.0,1,0,0,0
8,0,4,0,0.0,0,1,1,1


__________

"Categoric sets may be collectively aggregated into a single common binarization. "

The aggregation of categoric sets into a common binarization is a form of dimensionality reduction available in the library, available for activation by the Binary automugne(.) parameter. Note that this only aggregates categoric entries populated with boolean integer entries (so not applied to ordinal or hashed sets). Note that inversion is supported.

First let's demonstrate on our toy data set without the Binary dimensionality reduction applied. We'll assign two categoric sets to '1010' which means they will be individually binarized.

In [16]:
#we'll assign the 1010 root category to two categoric columns
assigncat = {'1010' : ['category', 'binary']}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             MLinfill = False)

#we'll go ahead and inspect the entire returned train set
train

Unnamed: 0,number_nmbr,number_NArw,category_NArw,category_1010_0,category_1010_1,category_1010_2,binary_NArw,binary_1010_0,binary_1010_1,datetime_NArw,datetime_tmzn_year,datetime_tmzn_mdsn,datetime_tmzn_mdcs,datetime_tmzn_hmss,datetime_tmzn_hmsc,datetime_tmzn_bshr,datetime_tmzn_wkdy,datetime_tmzn_hldy,address_NArw,address_hash_0,address_hash_1,address_hash_2,address_hash_3,address_hash_4,address_hash_5,address_hash_6,address_hash_7
0,-0.644929,0,0,0,0,1,0,0,1,0,-1.500117,0.5145554,0.857457,0.258819,0.965926,0,0,0,0,5,4,23,19,24,23,16,0
1,-0.33916,0,0,0,0,1,0,0,1,0,-0.825064,0.9857698,-0.168101,0.991445,-0.130526,0,0,0,0,19,20,4,19,30,31,23,41
2,0.596678,0,0,0,0,1,0,0,1,0,-0.825064,1.224647e-16,-1.0,0.442289,-0.896873,1,1,0,0,24,20,23,19,43,23,12,0
3,2.440556,0,0,0,1,0,0,0,1,0,-0.150012,-0.8207635,-0.571268,-0.422618,-0.906308,1,1,0,0,32,4,23,19,24,23,16,0
4,-0.255769,0,0,0,1,0,0,0,0,0,-0.150012,-0.9961947,0.087156,-0.980785,0.19509,0,1,0,0,20,37,19,24,23,16,0,0
5,-0.641871,0,0,0,1,1,0,0,0,0,-0.150012,0.4098203,0.912166,0.0,1.0,0,1,1,0,29,20,23,19,43,23,12,0
6,-0.50131,0,0,0,0,0,0,0,0,0,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,0,1,0,0,16,4,38,19,30,31,23,41
7,-0.654195,0,1,1,0,0,1,1,0,1,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,0,0,0,1,24,0,0,0,0,0,0,0
8,0.0,1,0,0,1,0,0,0,1,1,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,0,0,0,0,24,23,0,0,0,0,0,0


Here we see that between category and binary we ended up with 5 returned binarized columns (category_1010_0	category_1010_1	category_1010_2, binary_1010_0	binary_1010_1). Now let's try again with the Binary dimensionality reduction applied.

In [17]:
#we'll assign the 1010 root category to two categoric columns
assigncat = {'1010' : ['category', 'binary']}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             Binary = True,
             MLinfill = False)

#we'll go ahead and inspect the entire returned train set
train

Unnamed: 0,number_nmbr,datetime_tmzn_year,datetime_tmzn_mdsn,datetime_tmzn_mdcs,datetime_tmzn_hmss,datetime_tmzn_hmsc,address_hash_0,address_hash_1,address_hash_2,address_hash_3,address_hash_4,address_hash_5,address_hash_6,address_hash_7,Binary_1010_0,Binary_1010_1,Binary_1010_2,Binary_1010_3
0,-0.644929,-1.500117,0.5145554,0.857457,0.258819,0.965926,5,4,23,19,24,23,16,0,0,0,0,1
1,-0.33916,-0.825064,0.9857698,-0.168101,0.991445,-0.130526,19,20,4,19,30,31,23,41,0,0,0,1
2,0.596678,-0.825064,1.224647e-16,-1.0,0.442289,-0.896873,24,20,23,19,43,23,12,0,0,0,1,0
3,2.440556,-0.150012,-0.8207635,-0.571268,-0.422618,-0.906308,32,4,23,19,24,23,16,0,0,1,0,0
4,-0.255769,-0.150012,-0.9961947,0.087156,-0.980785,0.19509,20,37,19,24,23,16,0,0,0,0,1,1
5,-0.641871,-0.150012,0.4098203,0.912166,0.0,1.0,29,20,23,19,43,23,12,0,0,1,0,1
6,-0.50131,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,16,4,38,19,30,31,23,41,0,0,0,0
7,-0.654195,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,24,0,0,0,0,0,0,0,0,1,1,0
8,0.0,1.200094,-0.0348995,-0.999391,-0.130526,0.991445,24,23,0,0,0,0,0,0,0,1,1,1


Here we see a reduction to only four returned binarized columns in a common aggregation (Binary_1010_0,	Binary_1010_1, Binary_1010_2, Binary_1010_3). Note that the reduction in returned columns may be particularly noticable when there is a lot of correlation between categoric features.

__________

"Categoric labels may have label smoothing applied [12], or fitted smoothing where null values are fit to class distributions."

Label smoothing is available by the 'smth' root category, or fitted smoothing by the 'fsmh' root category. In each case the values for activations defaults to 0.9 and may be designated by passing a 'activation' parameter to assignparam. The null values under smth are based on the unique entry count and in fsmh are fit to class distributions corresponding to the activation associated with an entry.

Note that smoothing is applied by default to train data and not to test data. To make test data smoothing on by default can apply the 'testsmooth' parameter to assignparam, or to selectively apply to test sets passed to postmunge can use the traindata postmunge(.) parameter.

Here we'll first demonstrate the smth transform, as well as passing an alternate activation value to increase from default of 0.9 to 0.95.

In [18]:
#we'll assign the smth root category to the category column, 
#which we'll also designate as a label set
#(this can also be assigned to features in train set if desired)

assigncat = {'smth' : ['category']}

labels_column = 'category'

#we'll update the activation value from 0.9 to 0.95
assignparam = {'default_assignparam' : {'smth' : {'activation' : 0.95}}}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column, 
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             assignparam = assignparam,
             Binary = True,
             MLinfill = False)

#since we are interested in the labels set it will be returned in labels
labels

Unnamed: 0,category_NArw,category_smth_0,category_smth_1,category_smth_2,category_smth_3,category_smth_4
0,0,0.0125,0.95,0.0125,0.0125,0.0125
1,0,0.0125,0.95,0.0125,0.0125,0.0125
2,0,0.0125,0.95,0.0125,0.0125,0.0125
3,0,0.0125,0.0125,0.95,0.0125,0.0125
4,0,0.0125,0.0125,0.95,0.0125,0.0125
5,0,0.0125,0.0125,0.0125,0.95,0.0125
6,0,0.95,0.0125,0.0125,0.0125,0.0125
7,1,0.0125,0.0125,0.0125,0.0125,0.95
8,0,0.0125,0.0125,0.95,0.0125,0.0125


Now let's try again but this time with fitted smoothing via fsmh.

In [19]:
#we'll assign the smth root category to the category column, 
#which we'll also designate as a label set
#(this can also be assigned to features in train set if desired)

assigncat = {'fsmh' : ['category']}

labels_column = 'category'

#we'll update the activation value from 0.9 to 0.95
assignparam = {'default_assignparam' : {'fsmh' : {'activation' : 0.95}}}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             labels_column = labels_column, 
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             assignparam = assignparam,
             Binary = True,
             MLinfill = False)

#since we are interested in the labels set it will be returned in labels
labels

Unnamed: 0,category_NArw,category_smth_0,category_smth_1,category_smth_2,category_smth_3,category_smth_4
0,0,0.008333,0.95,0.025,0.008333,0.008333
1,0,0.008333,0.95,0.025,0.008333,0.008333
2,0,0.008333,0.95,0.025,0.008333,0.008333
3,0,0.008333,0.025,0.95,0.008333,0.008333
4,0,0.008333,0.025,0.95,0.008333,0.008333
5,0,0.00625,0.01875,0.01875,0.95,0.00625
6,0,0.95,0.01875,0.01875,0.00625,0.00625
7,1,0.00625,0.01875,0.01875,0.00625,0.95
8,0,0.008333,0.025,0.95,0.008333,0.008333


__________

"Data augmentation transformations [10] may be applied which make use of noise injection, including several variants for both numeric and categoric features."

Data augmentation transforms were discussed in the paper Numeric Encoding Options with Automunge. Root categories for numeric transforms are available as DPnb (z-score normalized), DPmm (min-max normalized), DPrt (retain normalized), and root categories for categoric transforms are available as DPbn (binary for two value set), DPod (ordinal), DPoh (one hot), DP10 (binarized). 

Here we'll show a numeric and a categoric to demonstrate, just to pick one how about DPmm and DP10.

Note that the whole point of data augmentation is to increase the number of training samples by adding noise injection, and so the workflow is slightly different since the same set is redundantly encoded with and without noise injection. We'll accomplish this by processing the same df_train set as both train and test data in automunge and concatinating the results. Note that alternatively additional copies of the data can be prepared in postmunge.

Note that the convention for DP family of transforms is that noise is injected to train data by default and not to test data. If you would like noise injected to test data, you can perform in postmunge(.) by activating the traindata parameter.

In [21]:
#Let's demonstrate applying noise injection to features 'number' and 'category'
assigncat = {'DPmm' : ['number'],
             'DP10' : ['category']}

#these transforms accept parameters to designate noise distribution profile
#as well as the ratio fo entries to have noise injected
#the ratio defaults to 0.03, here we'll increased to 0.5 for visualization purposes
assignparam = {'default_assignparam' : {'DPmm' : {'flip_prob' : 0.5},
                                        'DP10' : {'flip_prob' : 0.5}}}

#we'll process the df_train as both train data (with noise injection) 
#and test data (without noise injection)
#and contcatinate the results to increase the number of training samples
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             df_test = df_train,
             shuffletrain = False,
             printstatus = False,
             assigncat = assigncat,
             assignparam = assignparam,
             MLinfill = False)

#concatinate the train and test sets to increase sample count
train   = pd.concat([train, test], axis=0, ignore_index=True)
trainID = pd.concat([train_ID, test_ID], axis=0, ignore_index=True)
labels  = pd.concat([labels, test_labels], axis=0, ignore_index=True)

#here is what the output looks like
#we'll just inspect those feature with noise injected
train[postprocess_dict['column_map']['number'] + postprocess_dict['column_map']['category']]

Unnamed: 0,number_NArw,number_mnmx_DPmm,category_NArw,category_ord3_DPod_1010_0,category_ord3_DPod_1010_1,category_ord3_DPod_1010_2
0,0,0.002994,0,0,0,0
1,0,0.101796,0,0,0,0
2,0,0.404648,0,0,0,0
3,0,0.961622,0,1,0,0
4,0,0.128743,0,0,0,1
5,0,0.003982,0,0,1,1
6,0,0.045695,0,0,1,0
7,0,0.0,1,1,0,0
8,1,0.211388,0,0,0,1
9,0,0.002994,0,0,0,0


In above dataframe, rows 0-8 have noise injected to a randomly sampled 50% of entries, and rows 9-17 are without noise injections.

__________

"Sets of transformations to be directed at a target feature can be assembled which include generations and branches of derivations by making use of our “family tree primitives” [13], as can be used to redundantly encode a feature set in multiple configurations of varying information content."

We already kind of demonstrated a family tree specification above, I'll show again here so you can see in contecxt.

In [22]:
#family trees are defined in the transformdict
#here the upstream transforms are newt, bnep, and NArw
#and since parents is a primitive with offspring
#the downstream primitives are inspected 
#and mnmx is performed downstream of newt
transformdict =  {'newt' : {'parents' : ['newt'], \
                            'siblings': [], \
                            'auntsuncles' : ['bnep'], \
                            'cousins' : ['NArw'], \
                            'children' : [], \
                            'niecesnephews' : [], \
                            'coworkers' : ['mnmx'], \
                            'friends' : []}}

#since we're defining a new root category 'newt' in a transformdict
#we'll need a corresponding processdict entry
#we'll designate that the 'newt' category
#is associated with a log0 transform
#as applied above in the parents primitive
processdict = {'newt' : {'functionpointer' : 'log0'}}

#we'll assign this root category to column 'number'
assigncat = {'newt' : ['number']}

#since bnep returns 5 bins and our toy data set only has 9 entries
#let's go ahead and set a parameter to return fewer bins
#bnep accepts parameter 'bincount' as documented in read me
assignparam = {'default_assignparam' : {'bnep' : {'bincount' : 3}}}

#Then go ahead and implement automunge
#since this is a toy dataset we'll turn off ml infill

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             transformdict = transformdict,
             processdict = processdict, 
             assigncat = assigncat,
             MLinfill = False,
             assignparam = assignparam)

#and we can then inspect the returned train set to view output
train[postprocess_dict['column_map']['number']]

#note that since the root category had a processdict entry derived 
#from a functionpointer to log0
#the associated NArowtype is applied which treats entries <=0 as subject to infill

Unnamed: 0,number_NArw,number_log0_mnmx,number_bnep_0,number_bnep_1,number_bnep_2
0,1,0.631899,1,0,0
1,0,0.665794,0,1,0
2,0,0.868393,0,0,1
3,0,1.0,0,0,1
4,0,0.700661,0,0,1
5,0,0.0,1,0,0
6,0,0.556543,0,1,0
7,1,0.631899,1,0,0
8,1,0.631899,0,0,0


__________

"Such transformation sets may be accessed from those predefined in an internal library for simple assignment or alternatively may be custom configured. "

We demonstrated with preceding custom configuration of a family tree. Those predefined in the internal library simply require assigning a column to a root category in assigncat.
```
assigncat = {'or19' : ['category']}
```
The READ ME documentation includes a comprehensive survey of transformations and transformation sets available in the library.

__________

"Even the transformation functions themselves may be custom defined with only minimal requirements of simple data structures."

The conventions for defining custom transformation functions are documented in the READ ME under section "Custom Transformation Functions".

__________

"Through application statistics of the features are recorded to facilitate detection of distribution drift."

The drift statistics are recorded by default for train data passed to automunge. Drift reports are available for comparing subsequent test data passed to postmunge by activating the driftreport parameter. Results of the assessment can be viewed in the dictionary returned from postmunge as postreports_dict['driftreport'].

In [25]:
#Here we'll just access a drift report 
#comparing the same data passed to automunge and postmunge

#first process data in automunge
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             MLinfill = False)

#then process additional data in postmunge using postprocess_dict returned from automunge
#activate the driftreport parameter to assemble drift stat comparison
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
             df_train,
             driftreport = True,
             printstatus = False)

#we can view results in postmunge printouts or in report returned here
postreports_dict['driftreport']

#note that the report includes drift stats associated with the received feature
#as well as drift stats collected with each transformation function applied

{'number': {'origreturnedcolumns_list': ['number_NArw', 'number_nmbr'],
  'newreturnedcolumns_list': ['number_NArw', 'number_nmbr'],
  'drift_category': 'nmbr',
  'orignotinnew': {},
  'newnotinorig': {},
  'newreturnedcolumn': {'number_NArw': {'orignormparam': {'pct_NArw': 0.1111111111111111},
    'newnormparam': {'pct_NArw': 0.1111111111111111}},
   'number_nmbr': {'orignormparam': {'mean': 69.60375,
     'std': 107.92468600110679,
     'max': 333.0,
     'min': -1.0,
     'offset': 0,
     'multiplier': 1,
     'cap': False,
     'floor': False},
    'newnormparam': {'mean': 69.60375,
     'std': 107.92468600110679,
     'max': 333.0,
     'min': -1.0,
     'offset': 0,
     'multiplier': 1,
     'cap': False,
     'floor': False}}}},
 'category': {'origreturnedcolumns_list': ['category_1010_0',
   'category_1010_1',
   'category_1010_2',
   'category_NArw'],
  'newreturnedcolumns_list': ['category_1010_0',
   'category_1010_1',
   'category_1010_2',
   'category_NArw'],
  'drift_ca

__________

"Inversion is available to recover the original form of data found preceding transformations, as may be useful to recover the original form of labels after inference."

Inversion operations are performed in postmunge. When activating inversion we'll need to designate if the target is a test set or labels.

Here we'll demonstrate encoding the train set with automunge and then recovering the original form with postmunge inversion.

In [26]:
#first process data in automunge
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             shuffletrain = False,
             printstatus = False,
             MLinfill = False)

#then we'll invert the train set returned from automunge
df_invert, recovered_list, inversion_info_dict = \
am.postmunge(postprocess_dict, 
             train, 
             inversion='test', 
             printstatus=False)

#Here is the recoved data after inversion
#(Note that we have a convention of applying arbitrary plug value of 'zzzinfill'
#for entries that were not successfully recovered.)

#note that a few transforms don't support inversion
#such as transforms for datetime sets or hashing 
#(so columns 'datetime' and 'address' aren't recovered)
#for details on which columns are recovered can turn on printouts with printstatus

df_invert

Unnamed: 0,number,binary,category
0,0.0,yes,circle
1,33.000004,yes,circle
2,134.0,yes,circle
3,333.0,yes,square
4,42.000004,no,square
5,0.330002,no,triangle
6,15.500004,no,1234
7,-1.0,yes,zzzinfill
8,69.603752,yes,square


__________

"Or of course if the data is received already numerically encoded the library can simply be applied as a tool for missing data infill."

Since our toy data set isn't already numerically encoded I'll just note the parameter. Note that what is taking place here is a numeric set with any non-integer entries is treated as a continuous set for regression, an all integer set is treated as a categoric encoding for classification, unless the number of unique entries in the integer set exceeds a heuristic of 75% of the train data, then it is treated as a continuous integer set for regression. We assume missing data is recieved as NaN.

Note this heuristic is not perfect, there may be cases where an integer set treated as a classification target is desired to be treated as a continuous set for regression, in which case we suggest assigning the column to exc8 in assigncat.

The associated automunge parameter is:
```
powertransform = 'infill'
```