# Introduction

Notebook adapted for bank marketing dataset by DA Feb 2020

# Prerequisites

Before proceeding with the demonstration we'll conduct a few data preparations. Note that Automunge needs following prerequisites to operate:

- tabular data in Pandas dataframe or Numpy array format
- "tidy data" (meaning one feature per column and one observation per row)
- if available label column included in the set with column name passed to function as string
- a "train" data set intended to train a machine learning model and if available a "test" set intended to generate predictions from the same model
- the train and test data must have consistently formatted data and consistent column headers

Ok well introductions complete let's go ahead and manually munge to meet these requirements.

# Data imports and preliminary munging

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os

In [15]:
#First set the file paths

train_transaction_filepath = "bank_marketing_train_classification.csv"
test_transaction_filepath = "bank_marketing_test_withlabel.csv"

In [16]:
#Now let's import them as dataframes. Note both the identify and transaction sets include 
#a single common column, TransactionID, so we'll use that as an index column to merge

train_transaction = pd.read_csv(train_transaction_filepath, error_bad_lines=False, index_col=0)
test_transaction = pd.read_csv(test_transaction_filepath, error_bad_lines=False, index_col=0)

In [17]:
#train_transaction.rename(columns={'Unnamed: 0': 'id'}, inplace=True) # DA use index_col=0 on read instead
#test_transaction.rename(columns={'Unnamed: 0': 'id'}, inplace=True) # DA

from sklearn.model_selection import train_test_split
big_train, tiny_train = train_test_split(train_transaction, test_size=0.2, random_state=42) # small_train 0.002
big_test, tiny_test = train_test_split(test_transaction, test_size=0.2, random_state=42) # small_test 0.002

print(list(tiny_train))
print("")
print("tiny_train.shape = ", tiny_train.shape)
print("big_train.shape = ", big_train.shape)
print("")
print(list(tiny_train))
print("")
print("tiny_test.shape = ", tiny_test.shape)
print("big_test.shape = ", big_test.shape)

['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y']

tiny_train.shape =  (6590, 21)
big_train.shape =  (26360, 21)

['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y']

tiny_test.shape =  (1648, 21)
big_test.shape =  (6590, 21)


In [18]:
# DA
pd.value_counts(tiny_test['y'].values, sort=False)

0    1482
1     166
dtype: int64

In [7]:
# DA
pd.value_counts(tiny_train['y'].values, sort=False)

0    5834
1     756
dtype: int64

# Automunge install and initialize

In [7]:
#Ok here's where we import our tool with pip install. Note that this step requires  
#access to the internet. (Note this import procedure changed with version 2.58.)

# ! pip install Automunge

# #or to upgrade (we currently roll out upgrades pretty frequently)
# ! pip install Automunge --upgrade

Collecting Automunge
[?25l  Downloading https://files.pythonhosted.org/packages/60/99/d126fc4e3adf4ca3554acce00b94a4a5e49e4e7da4bda56cd9ec81c608f1/Automunge-2.58-py3-none-any.whl (208kB)
[K     |████████████████████████████████| 215kB 3.4MB/s 
[?25hInstalling collected packages: Automunge
Successfully installed Automunge-2.58


In [8]:
#And then we initialize the class.

from Automunge import Automunger
am = Automunger.AutoMunge()

# Ok let's give it a shot

Well at the risk of overwhelming the reader I'm just going to throw out a full application. Basically, we pass the train set and if available a consistently formatted test set to the function and it returns normalized and numerically encoded sets suitable for the direct application of machine learning. The function returns a series of sets (which based on the options selected may be empty), I find it helps to just copy and paste the full range of arguments and returned sets from the documentation for each application. 

In [26]:
#So first let's just try a generic application with our tiny_train set. Note tiny_train here
#represents our train set. If a labels column is available we should include and designate, 
#and any columns we want to exclude from processing we can designate as "ID columns" which
#will be carved out and consistently shuffled and partitioned.

#Note here we're only demonstrating on the set with the reduced number of features to save time.


train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'y', trainID_column = False, \
            testID_column = False, valpercent1=0.0, valpercent2 = 0.0, \
            shuffletrain = False, TrainLabelFreqLevel = False, powertransform = False, \
            binstransform = False, MLinfill = False, infilliterate=1, randomseed = 42, \
            numbercategoryheuristic = 15, pandasoutput = True, NArw_marker = False, \
            featureselection = True, featurepct = 1.0, featuremetric = .02, \
            featuremethod = 'pct', PCAn_components = None, PCAexcl = [], \
            ML_cmnd = {'MLinfill_type':'default', \
                       'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}}, \
                       'PCA_type':'default', \
                       'PCA_cmnd':{}}, \
            assigncat = {'mnmx':[], 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                         'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], 'MAD3':[], \
                         'bins':[], 'bint':[], \
                         'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                         'log0':[], 'log1':[], 'pwrs':[], \
                         'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                         'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                         'excl':[], 'exc2':[], 'exc3':[], 'null':[], 'eval':[]}, \
            assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[], \
                            'adjinfill':[], 'meaninfill':[], 'medianinfill':[]}, \
            transformdict = {}, processdict = {}, \
            printstatus = True)


_______________
Begin Feature Importance evaluation

_______________
Begin Automunge processing

error, non integer index passed without columns named
error, non integer index passed without columns named
evaluating column:  age
processing column:  age
    root category:  nmbr
 returned columns:
['age_nmbr']

evaluating column:  job
processing column:  job
    root category:  1010
 returned columns:
['job_1010_3', 'job_1010_2', 'job_1010_1', 'job_1010_0']

evaluating column:  marital
processing column:  marital
    root category:  1010
 returned columns:
['marital_1010_0', 'marital_1010_2', 'marital_1010_1']

evaluating column:  education
processing column:  education
    root category:  1010
 returned columns:
['education_1010_3', 'education_1010_1', 'education_1010_2', 'education_1010_0']

evaluating column:  default
processing column:  default
    root category:  1010
 returned columns:
['default_1010_1', 'default_1010_0']

evaluating column:  housing
processing column:  housing
   

In [28]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6590 entries, 37286 to 32558
Data columns (total 37 columns):
age_nmbr               6590 non-null float32
job_1010_0             6590 non-null int8
job_1010_1             6590 non-null int8
job_1010_2             6590 non-null int8
job_1010_3             6590 non-null int8
marital_1010_0         6590 non-null int8
marital_1010_1         6590 non-null int8
marital_1010_2         6590 non-null int8
education_1010_0       6590 non-null int8
education_1010_1       6590 non-null int8
education_1010_2       6590 non-null int8
education_1010_3       6590 non-null int8
default_1010_0         6590 non-null int8
default_1010_1         6590 non-null int8
housing_1010_0         6590 non-null int8
housing_1010_1         6590 non-null int8
loan_1010_0            6590 non-null int8
loan_1010_1            6590 non-null int8
contact_bnry           6590 non-null int8
month_1010_0           6590 non-null int8
month_1010_1           6590 non-null int8
mon

In [23]:
new_train = pd.concat([train, labels], axis=1)
# new_test = pd.concat([test, test_labels], axis=1)

In [24]:
new_train.to_csv('bank_marketing_train_forML_Automunge.csv')
# new_test.to_csv('bank_marketing_test_forML_Automunge.csv')

So what's going on here is we're calling the function am.automunge and pass the returned sets to a series of objects:
```
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
```

Again we don't have to include all of the parameters when calling the function, but I find it helpful just to copy and paste them all. For example if we just wanted to defer to defaults we could just call:
```
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train)
```

Those sets returned from the function call are as follows:

- __train, trainID, labels__ : these are the sets intended to train a machine learning model. (The ID set is simply any columns we wanted to exclude from transformations comparably partitioned and shuffled)
- __validation1, validationID1, validationlabels1__ : these are sets carved out from the train set intended for hyperparameter tuning validation based on the designated validation1 ratio (defaults to 0.0)
- __validation2, validationID2, validationlabels2__ : these are sets carved out from the train set intended for final model validation based on the designated validation2 ratio (defaults to 0.0)
- __test, testID, testlabels__ : these are the sets derived from any passed test set intended to generate predictions from the machine learning model trained form the train set, consistently processed as the train set
- __labelsencoding_dict__ : this is a dictionary which may prove useful for reverse encoding predictions generated from the machine learning model to be trained from the train set
- __finalcolumns_train, finalcolumns_test__ : a list of the columns returned from the transformation, may prove useful in case one wants to ensure consistent column labeling which is required for subsequent processing of any future test data
- __featureimportance__ : this stores the results of the feature importance evaluation if user elects to conduct
- __postprocess_dict__ : this dictionary should be saved as it may be used as an input to the postmunge funciton to consistently process any subsequently available test data

Let's take a look at a few items of interest from the returned sets.

Notice that the returned sets now include a suffix appended to column name. These suffixes identify what type of transformation were performed. Here we see a few different types of suffixes:

In [10]:
#suffixes identifying steps of transformation
list(train)

['age_nmbr',
 'job_1010_0',
 'job_1010_1',
 'job_1010_2',
 'job_1010_3',
 'marital_1010_0',
 'marital_1010_1',
 'marital_1010_2',
 'education_1010_0',
 'education_1010_1',
 'education_1010_2',
 'education_1010_3',
 'default_1010_0',
 'default_1010_1',
 'housing_1010_0',
 'housing_1010_1',
 'loan_1010_0',
 'loan_1010_1',
 'contact_bnry',
 'month_1010_0',
 'month_1010_1',
 'month_1010_2',
 'month_1010_3',
 'day_of_week_1010_0',
 'day_of_week_1010_1',
 'day_of_week_1010_2',
 'duration_nmbr',
 'campaign_nmbr',
 'pdays_nmbr',
 'previous_nmbr',
 'poutcome_1010_0',
 'poutcome_1010_1',
 'emp_var_rate_nmbr',
 'cons_price_idx_nmbr',
 'cons_conf_idx_nmbr',
 'euribor3m_nmbr',
 'nr_employed_nmbr']

In [11]:
#And here's what the returned data looks like.
train.head()

Unnamed: 0,age_nmbr,job_1010_0,job_1010_1,job_1010_2,job_1010_3,marital_1010_0,marital_1010_1,marital_1010_2,education_1010_0,education_1010_1,...,campaign_nmbr,pdays_nmbr,previous_nmbr,poutcome_1010_0,poutcome_1010_1,emp_var_rate_nmbr,cons_price_idx_nmbr,cons_conf_idx_nmbr,euribor3m_nmbr,nr_employed_nmbr
12447,-0.867146,1,0,1,0,0,1,1,0,1,...,-0.211589,0.196069,-0.359965,1,0,0.853047,-0.225669,0.970467,0.782994,0.852924
25489,0.755016,0,0,0,1,0,1,0,0,1,...,-0.569394,-5.10029,5.759129,1,1,-0.748378,2.060937,-2.231219,-1.474193,-2.803102
567,0.468752,0,0,0,1,0,1,1,0,1,...,-0.569394,-5.116291,1.679733,1,1,-1.196777,-1.177989,-1.229331,-1.353633,-0.930166
38408,0.755016,0,0,0,1,0,1,0,0,1,...,1.577441,0.196069,-0.359965,1,0,0.853047,-0.225669,0.970467,0.785301,0.852924
8815,0.182488,0,0,1,0,0,1,1,0,0,...,0.146217,0.196069,-0.359965,1,0,0.853047,1.538976,-0.271003,0.78184,0.852924


Upon inspection:
- addr2, card4, and ProductCD both have a series of suffixes which represent the different categories derived from a one-hot-encoding of a categorical set
- each of TransactionDT, TransactionAmt, card1, card2, card3, card5, addr1, addr2, dist1 have the suffix 'nmbr' which represents a z-score normalization
- card6 has the suffix 'bnry' which represents a binary (0/1) encoding
- P_emaildomain has the suffix 'ordl' which represents an ordinal (integer) encoding

Automunge uses suffix appenders to track the steps of transformations. For example, one could assign transformations to a column which resulted in multiple suffix appenders, such as say:
column1_bxcx_nmbr
Which would represent a column with original header 'column1' upon which was performed two steps of transformation, a box-cox power law transform followed by a z-score normalization.

# Labels

When we conducted the transfomation we also desiganted a label column which was included in the set, so let's take a peek at the returned labels.

In [33]:
list(labels)

'y_bnry'

In [34]:
#as you can see the returned values on the labels column are consistently encoded
#as were passed
labels[list(labels)[0]].unique() # DA

array([0, 1], dtype=int64)

In [35]:
#Note that if or original labels weren't yet binary encoded, we could inspect the 
#returned labelsencoding_dict object to determine the basis of encoding.

#Here we just see that the 1 value originated from values 1, and the 0 value
#originated from values 0 - a trivial example, but this could be helpful if
#we had passed a column containing values ['cat', 'dog'] for instance.

labelsencoding_dict

{'bnry': {'y_bnry': {'missing': 0,
   1: 1,
   0: 0,
   'extravalues': [],
   'oneratio': 0.11471927162367224,
   'zeroratio': 0.8852807283763278}}}

In [29]:
for keys,values in featureimportance.items():
    print(keys)
    print('metric = ', values['metric'])
    print('metric2 = ', values['metric2'])
    print()

age_nmbr
metric =  0.004137931034482678
metric2 =  0.0

job_1010_0
metric =  0.00505747126436773
metric2 =  0.0055172413793103114

job_1010_1
metric =  0.00505747126436773
metric2 =  0.0036781609195402076

job_1010_2
metric =  0.00505747126436773
metric2 =  0.006896551724137834

job_1010_3
metric =  0.00505747126436773
metric2 =  0.003218390804597626

marital_1010_0
metric =  0.004137931034482678
metric2 =  0.0036781609195402076

marital_1010_1
metric =  0.004137931034482678
metric2 =  0.0027586206896551557

marital_1010_2
metric =  0.004137931034482678
metric2 =  0.003218390804597626

education_1010_0
metric =  0.0018390804597701038
metric2 =  0.0018390804597701038

education_1010_1
metric =  0.0018390804597701038
metric2 =  0.0022988505747125743

education_1010_2
metric =  0.0018390804597701038
metric2 =  0.0018390804597701038

education_1010_3
metric =  0.0018390804597701038
metric2 =  0.0027586206896551557

default_1010_0
metric =  0.0009195402298850519
metric2 =  0.001379310344827

# Subsequent consistent processing with postmunge(.)

Another important object returned form the automunge application is what we call the "postprocess_dict". In fact, good practice is that we should always save externally any postprocess_dict returned from the application of automunge whose output was used to train a machine learning model. Why? Well using this postprocess_dict object, we can then pass any subsequently available "test" data that we want to use to generate predictions from that machine learning model giving fully consistent processing and encoding. Let's demonstrate. 

When we performed a train_test_split above to derive the "tiny_train" set, we also ended up with a bigger set called "tiny_train_bigger". Let's try applying the postmunge function to consistently process.

Note a few pre-requisites for the appplication of postmunge:

- requires passing a postprocess_dict that was dervied from the application of automunge
- consistently formatted data as the train set used in the application of automunge from which the postprocess_dict was derived
- consistent column labeling as the train set used in the application of automunge from which the postprocess_dict was derived

And there we have it, let's demonstrate the postmunge function on the set "tiny_test" we prepared above.

In [22]:
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \
am.postmunge(postprocess_dict, tiny_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus=True)

_______________
Begin Postmunge processing

error, non integer index passed without columns named
error, different number of original columns in train and test sets


TypeError: cannot unpack non-iterable NoneType object

In [37]:
#And if we're doing our job right then this set should be formatted exaclty like that returned
#from automunge, let's take a look.

test.head()

Unnamed: 0,id_nmbr,age_nmbr,job_1010_0,job_1010_1,job_1010_2,job_1010_3,marital_1010_0,marital_1010_1,marital_1010_2,education_1010_0,...,campaign_nmbr,pdays_nmbr,previous_nmbr,poutcome_1010_0,poutcome_1010_1,emp_var_rate_nmbr,cons_price_idx_nmbr,cons_conf_idx_nmbr,euribor3m_nmbr,nr_employed_nmbr
706,-1.07954,0.564173,1,0,1,0,0,1,0,0,...,-0.569394,0.196069,-0.359965,1,0,0.853047,1.538976,-0.271003,0.781263,0.852924
5968,0.357053,-0.867146,1,0,1,1,0,1,1,0,...,-0.569394,0.196069,-0.359965,1,0,0.853047,0.593569,-0.467024,0.779533,0.852924
1665,1.366708,-1.15341,1,0,0,0,0,1,0,1,...,-0.211589,0.196069,-0.359965,1,0,0.660876,0.724923,0.905127,0.723002,0.340113
6676,-1.653569,-1.344253,1,0,0,1,0,1,1,1,...,0.861829,0.196069,-0.359965,1,0,0.660876,0.724923,0.905127,0.723002,0.340113
5606,0.091798,-0.676304,1,0,1,0,0,1,1,0,...,-0.569394,0.196069,-0.359965,1,0,0.853047,0.593569,-0.467024,0.782417,0.852924


In [17]:
#Looks good! 

#So if we wanted to generate predictions from a machine learning model trained 
#on a train set processed with automunge, we now have a way to consistently 
#prepare data with postmunge.

# Let's explore a (few) of the automunge parameters

Ok let's take a look at a few of the optional methods available here. First here again is what a full automunge call looks like:

```
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(df_train, df_test = False, labels_column = False, trainID_column = False, \
            testID_column = False, valpercent1=0.0, valpercent2 = 0.0, \
            shuffletrain = False, TrainLabelFreqLevel = False, powertransform = False, \
            binstransform = False, MLinfill = False, infilliterate=1, randomseed = 42, \
            numbercategoryheuristic = 15, pandasoutput = True, NArw_marker = True, \
            featureselection = False, featurepct = 1.0, featuremetric = .02, \
            featuremethod = 'pct', PCAn_components = None, PCAexcl = [], \
            ML_cmnd = {'MLinfill_type':'default', \
                       'MLinfill_cmnd':{'RandomForestClassifier':{}, 'RandomForestRegressor':{}}, \
                       'PCA_type':'default', \
                       'PCA_cmnd':{}}, \
            assigncat = {'mnmx':[], 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                         'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], 'MAD3':[], \
                         'bins':[], 'bint':[], \
                         'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                         'log0':[], 'log1':[], 'pwrs':[], \
                         'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                         'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                         'excl':[], 'exc2':[], 'exc3':[], 'null':[], 'eval':[]}
            assigninfill = {'stdrdinfill':[], 'MLinfill':[], 'zeroinfill':[], 'oneinfill':[], \
                            'adjinfill':[], 'meaninfill':[], 'medianinfill':[]}, \
            transformdict = {}, processdict = {}, \
            printstatus = True)
```

So let's just go through these one by one. (This section is kind of diving into the weeds, not required reading)

__df_train__ and __df_test__ First note that if we want we can pass two different pandas dataframe sets to automunge, such as might be beneficial if we have one set with labels (a "train" set) and one set without (a "test" set). Note that the normalization parameters are all derived just from the train set, and applied for consistent processing of any test set if included. Again a prerequisite is that any train and test set must have consistently labeled columns and consistent formated data, with the exception of any designated "ID" columns or "label" columns which will be carved out and consistenlty shuffled and partitioned. Note too that we can pass these sets with non-integer-range index or even multi column indexes, such that such index columns will be carved out and returned as part of the ID sets, consistently shuffled and partitioned. If we only want to process a train set we can pass the test set as "False".

__labels_column__ is intended for passing string identifiers of a column that will be treated as labels. Note that some of the methods require the inclusion of labels, such as feature importance evaluation or label frequency levelizer (for oversampling rows with lower frequency labels).

__trainID_column__ and testID_column are intended for passing strings or lists of strings identifying columns that will be carved out before processing and consistently shuffled and partitioned. 

__valpercent1__ and __valpercent2__ parameters are intended as floats between 0-1 that indicate the ratio of the sets that will be carved out for the two validation sets. If shuffle train is activated then the sets will be carved out randomly, else they will be taken from the bottom sequnetial rows of the train set and randomly partitioned between the two validaiton sets. Note that these values default to 0.

__shuffletrain__ parameter indicates whether the train set will be (can you guess?) yep you were right the answer is shuffled.

__TrainLabelFreqLevel__ parameter indicates whether the train set will have the oversampling method applied where rows with lower frequency labels are copied for more equal distribution of labels, such as might be beneficial for oversampling in the training operation.

__powertransform__ parameter indicates whether default numerical coloumn evluation will include an inference of distribution properties to assign between z-score normalization, min-max scaling, or box-cox power law trasnformation. Note this one is still somewhat rough around the edges and we will continue to refine mewthods going forward.

__binstransform__ indicates whether defauilt z-score normalizaiton applicaiton will include the develoipment of bins sets identifying a point's placement with respect to number of standard deviations from the mean.

__MLinfill__ indicates whether default infill methods will predict infill for missing points using machine learning models trained on the rest of the set in generalizaed and automated fashion. Note that this method benefits from increased scale of data in teh train set, and mmodels derbvied from the train set are used for consistent prediction methods for the test set.

__infilliterate__ indicates whether the predictive methods for MLinfill will be iterated by this integer such as may be beneficial for particularily messy data.

__randomseed__ seed for randomness for all of the random seeded methods such as predcitive algorithms for ML infill, feature importance, PCA, shuffling, etc

__numbercategoryheuristic__ an integer indicating for categorical sets the threshold between processing with one-hot-encoding vs ordinal methods

__pandasoutput__ quite simply True means returned sets are pandas dataframes, False means Numpy arrays (defaults to Numpy arrays)

__NArw_marker__ indicates whether returned columns will include a derived column indicating rows that were subject to infill (can be identified with the suffix "NArw")

__featureselection__ indicated whether a feature importance evlauation will be performed (using then shuffle permeation method), note this requires the inclusion of a designated loabels column in the train set. Results are presented in the returned object "featureimportance"

__featurepct__ if feature selection performed and featuremethod == 'pct', indicates what percent of columns will be retained from the feature importance dimensionality reduction (columns are ranked by importance and the low percent are trimmed). Note that a value of 1.0 means no trimming will be done.

__featuremetric__ if feature selection performed and featuremethod == 'metric', indicates what threshold of importance metric will be required for retained columns from the feature importance dimensionality reduction (columns feature importance metrics are derived and those below this threshold are trimmed). Note that a value of 0.0 means no trimming will be done.

__feteaturemethod__ accepts values of 'pct' or 'metric' indicates method used for any feature importance dimensionality reduction

__PCAn_components__ Triggers PCA dimensionality reduction when != None. Can be a float indicating percent of columns to retain in PCA or an integer indicated number of columns to retain. The tool evaluates whether set is suitable for kernel PCA, sparse PCA, or PCA. Alternatively, a user can assign a desired PCA method in the ML-cmnd['PCA_type']. Note that a value of None means no PCA dimensionality reduction will be performed unless the scale of data is below a heuristic based on the number of features. (A user can also just turn off default PCA with ML-cmnd['PCA_type'])

__PCAexcl__ a list of any columns to be excluded from PCA trasnformations

__ML_cmnd__ allows a user to pass parameters to the predictive algorithms used in ML infill, feature importance, and PCA (I won't go into full detail here, although note one handy feature is we can tell the algorithm to exlcude boolean columns form PCA which is useful)

__assigncat__ allows a user to assign distinct columns to different processing methods, for those columns that they don't want to defer to default automated processing. For example a user could designate columns for min-max scaling instead of z-score, or box-cox power law trasnform, or you know we've got a whole library of methods that we're continueing to build out. These are defined in our READ ME. Simply pass the column header string identifier to the list associated with any of these root categories.

__assigninfill__ allows a user to assign disinct columns to different infill methods for missing or improperly formatted data, for those columns that they don't want to defer to default automated infill whi ch could be either standard infill (mean to numerical sets, most common to binary, and boolean identifier to categorical), or ML infill if it was selected. 

__transformdict__ and __processdict__ allows a user to design custom trees or trasnformations or even custom processing functions such as documented in our essays that no one reads. Once defined a column can be assigned to these methods in the assigncat.

__printstatus__ You know, like, prints the status during operation. Self-explanatory!


Now we'll demonstrate a few.

# trainID_column, shuffletrain, valpercent1

In [18]:
#great well let's try a few of these out. How about the ID columns, let's see what happens when we pass one.
#Let's just pick an arbitrary one, TransactionDT

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', trainID_column = 'TransactionDT', \
             valpercent1=0.20, shuffletrain = True, pandasoutput=True, printstatus = False)

In [19]:
#Now we'll find that the TransactionDT column is missing from the train set, left 
#unaltered instead in the ID set, paired with the Transaction ID which was put
#in the ID set because it was a non-integer range index column (thus if we wanted
#to reassign the original index column we could simply copy the TransactionID column
#from the ID set back to the processed train set)

trainID.head()

Unnamed: 0,TransactionDT,TransactionID
0,6621556,3259400
1,12432282,3466055
2,7308608,3282364
3,8986607,3349548
4,7419202,3287303


In [20]:
#note that since our automunge call included a validation ratio, we'll find 
#a portion of the sets partitioned in the validation sets, here for instance
#is the validaiton ID sets 

#(we'll also find returned sets in the validation1, and validationlabels1)

#note that since we activated the shuffletrain option these are randomly
#selected from the train set

validationID1.head()

Unnamed: 0,TransactionDT,TransactionID
0,10134372,3388943
1,6131082,3243230
2,5529859,3220899
3,10863840,3416676
4,12881202,3480925


# TrainLabelFreqLevel

In [21]:
#Let's take a look at TrainLabelFreqLevel, which serves to copy rows such as to
#(approximately) levelize the frequency of labels found in the set.

#First let's look at the shape of a train set returtned from an automunge
#applicaiton without this option selected

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', TrainLabelFreqLevel=False, \
             pandasoutput=True, printstatus=False)

print("train.shape = ", train.shape)

train.shape =  (1182, 37)


In [22]:
#OK now let's try again with the option selected. If there was a material discrepency in label frequency
#we should see more rows included in the returned set

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', TrainLabelFreqLevel=True, \
             pandasoutput=True, printstatus=False)

print("train.shape = ", train.shape)

train.shape =  (2302, 37)


# binstransform

In [23]:
#binstransform just means that default numerical sets will include an additional set of bins identifying
#number of standard deviations from the mean. We have to be careful with this one if we don't have a lot
#of data as it adds a fair bit of dimensionality

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', binstransform=True, \
             pandasoutput=True, printstatus=False)

print("list(train):")
list(train)

list(train):


['dist2_nmbr_bint_t<-2',
 'dist2_nmbr_bint_t-21',
 'dist2_nmbr_bint_t-10',
 'dist2_nmbr_bint_t+01',
 'dist2_nmbr_bint_t+12',
 'dist2_nmbr_bint_t>+2',
 'dist1_nmbr_bint_t<-2',
 'dist1_nmbr_bint_t-21',
 'dist1_nmbr_bint_t-10',
 'dist1_nmbr_bint_t+01',
 'dist1_nmbr_bint_t+12',
 'dist1_nmbr_bint_t>+2',
 'addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'addr1_nmbr_bint_t<-2',
 'addr1_nmbr_bint_t-21',
 'addr1_nmbr_bint_t-10',
 'addr1_nmbr_bint_t+01',
 'addr1_nmbr_bint_t+12',
 'addr1_nmbr_bint_t>+2',
 'card5_nmbr_bint_t<-2',
 'card5_nmbr_bint_t-21',
 'card5_nmbr_bint_t-10',
 'card5_nmbr_bint_t+01',
 'card5_nmbr_bint_t+12',
 'card5_nmbr_bint_t>+2',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'card3_nmbr_bint_t<-2',
 'card3_nmbr_bint_t-21',
 'card3_nmbr_bint_t-10',
 'card3_nmbr_bint_t+01',
 'card3_nmbr_bint_t+12',
 'card3_nmbr_bint_t>+2',
 'card2_nmbr_bint_t<-2',
 'card2_nmbr_bint_t-21',
 'card2_nmbr_bint_t-10',
 'card2_nmbr_bint_t+01',
 'card2_nmbr_bint_t+12'

In [24]:
#so the interpretation should be for columns with suffix including "bint" that indicates 
#bins for number fo standard deviations from the mean. For example, nmbr_bint_t+01
#would indicated values between mean to +1 standard deviation.

# MLinfill

In [25]:
#So MLinfill changes the default infill method from standardinfill (which means mean for 
#numerical sets, most common for binary, and boolean marker for categorical), to a predictive
#method in which a machine learning model is trained for each column to predict infill based
#on properties of the rest of the set. This one's pretty neat, but caution that it performs 
#better with more data as you would expect.

#Let's demonstrate, first here's an applicaiton without MLinfill, we'll turn on the NArws option
#to output an identifier of rows subject to infill

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', MLinfill=False, \
             NArw_marker=True, pandasoutput=True, printstatus=False)

print("train.head()")
train.head()

train.head()


Unnamed: 0,addr2_26.0,addr2_60.0,addr2_87.0,card4_american express,card4_discover,card4_mastercard,card4_visa,ProductCD_C,ProductCD_H,ProductCD_R,...,card6_bnry,addr1_NArw,addr1_nmbr,addr2_NArw,dist1_NArw,dist1_nmbr,dist2_NArw,dist2_nmbr,P_emaildomain_NArw,P_emaildomain_ordl
0,0,0,1,0,0,1,0,0,0,0,...,0,0,1.025535,0,1,0.0,1,0.0,0,2
1,0,0,1,0,0,0,1,0,0,0,...,0,0,1.977745,0,0,-0.523147,1,0.0,0,30
2,0,0,1,0,0,1,0,0,0,0,...,0,0,0.255207,0,0,-0.495666,1,0.0,0,10
3,0,0,0,0,0,0,1,1,0,0,...,0,1,0.0,1,1,0.0,1,0.0,0,10
4,0,0,1,0,0,0,1,0,0,0,...,0,0,0.084023,0,1,0.0,1,0.0,0,11


In [26]:
#So upon inspection it looks like we had a few infill points on
#columns originating from dist1 (as identified by the NArw columns)
#so let's focus on that

#As you can see the plug value here is just the mean which for a 
#z-score normalized set is 0

columns = ['dist1_nmbr', 'dist1_NArw']
train[columns].head()

Unnamed: 0,dist1_nmbr,dist1_NArw
0,0.0,1
1,-0.523147,0
2,-0.495666,0
3,0.0,1
4,0.0,1


In [27]:
#Now let's try with MLinfill

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', MLinfill=True, \
             NArw_marker=True, pandasoutput=True, printstatus=False)

print("train[columns].head()")
train[columns].head()

train[columns].head()


Unnamed: 0,dist1_nmbr,dist1_NArw
0,1.486171,1
1,-0.523147,0
2,-0.495666,0
3,-0.146145,1
4,-0.364578,1


In [28]:
#As you can see the method predicted a unique infill value to each row subject to infill
#(as identified by the NArw column). We didn't include a lot of data with this small demonstration
#set, so I expect the accuracy of this method would improve with a bigger set

# numbercategoryheuristic

In [29]:
# numbercategoryheuristic just changes the threshold for number of unique values in a categorical set
#between processing a categorical set via one-hot encoding or ordinal processing (sequential integer encoding)

#for example consiter the returned column for the email domain set in the data, if we look above we see the
#set was processed as ordinal, let's see why

print("number of unique values in P_emaildomain column pre-processing")
print(len(train['P_emaildomain_ordl']))

number of unique values in P_emaildomain column pre-processing
1182


In [30]:
#So yeah looks like that entry has a unique entry per row, so really not really a good candidate for inclusion at
#all, this might be better served carved out into the ID set until such time as we can extract some info from it
#prior to processing. But the poitn is if we had set numbercategoryheuristic to 1478 instead of 15 we would have 
#derived 1477 one-hot-encoded columns from this set which obviosuly would be an issue for this scale of data.

# pandasoutput

In [31]:
#pandasoutput just tells whether to return pandas dataframe or numpy arrays (defaults to numpy which
#is a more universal elligible input to the different machine learning frameworks)

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud',  \
             pandasoutput=False, NArw_marker = False, printstatus=False)

print("type(train)")
print(type(train))

type(train)
<class 'numpy.ndarray'>


In [32]:
#note that if we return numpy arrays and want to view the column headers 
#(which remember track the steps of transofmations in their suffix appenders)
#good news that's available in the returned finalcolumns_train
print("finalcolumns_train")
finalcolumns_train

finalcolumns_train


['addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'TransactionDT_nmbr',
 'TransactionAmt_nmbr',
 'card1_nmbr',
 'card2_nmbr',
 'card3_nmbr',
 'card5_nmbr',
 'card6_bnry',
 'addr1_nmbr',
 'dist1_nmbr',
 'dist2_nmbr',
 'P_emaildomain_ordl']

In [33]:
#or with pandasoutput = True

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud',  \
             pandasoutput=True, NArw_marker = True, printstatus=False)

print("type(train)")
print(type(train))

type(train)
<class 'pandas.core.frame.DataFrame'>


# NArw_marker

In [34]:
#The NArw marker helpfully outputs from each column a marker indicating what rows were
#subject to infill. Let's quickly demonstrate. First here again are the returned columns
#without this feature activated.

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', \
             NArw_marker=False, pandasoutput=True, printstatus=False)

print("list(train)")
list(train)

list(train)


['addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'TransactionDT_nmbr',
 'TransactionAmt_nmbr',
 'card1_nmbr',
 'card2_nmbr',
 'card3_nmbr',
 'card5_nmbr',
 'card6_bnry',
 'addr1_nmbr',
 'dist1_nmbr',
 'dist2_nmbr',
 'P_emaildomain_ordl']

In [35]:
#Now with NArw_marker turned on.

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', \
             NArw_marker=True, pandasoutput=True, printstatus=False)

print("list(train)")
list(train)

list(train)


['addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'TransactionDT_NArw',
 'TransactionDT_nmbr',
 'TransactionAmt_NArw',
 'TransactionAmt_nmbr',
 'ProductCD_NArw',
 'card1_NArw',
 'card1_nmbr',
 'card2_NArw',
 'card2_nmbr',
 'card3_NArw',
 'card3_nmbr',
 'card4_NArw',
 'card5_NArw',
 'card5_nmbr',
 'card6_NArw',
 'card6_bnry',
 'addr1_NArw',
 'addr1_nmbr',
 'addr2_NArw',
 'dist1_NArw',
 'dist1_nmbr',
 'dist2_NArw',
 'dist2_nmbr',
 'P_emaildomain_NArw',
 'P_emaildomain_ordl']

In [36]:
#If we inspect one of these we'll see a marker for what rows were subject to infill
#(actually already did this a few cells ago but just to be complete)

columns = ['dist1_nmbr', 'dist1_NArw']
train[columns].head()



Unnamed: 0,dist1_nmbr,dist1_NArw
0,1.486171,1
1,-0.523147,0
2,-0.495666,0
3,-0.146145,1
4,-0.364578,1


# featureselection

In [37]:
#featureselection performs a feature importance evaluation with the permutaion method. 
#(basically trains a machine learning model, and then measures impact to accuaracy 
#after randomly shuffling each feature)

#Let's try it out. Note that this method requires the inclusion of a labels column.

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', NArw_marker=False, \
             featureselection=True, pandasoutput=True, printstatus=False)

In [38]:
#Now we can view the results like so.
#(a future iteration of tool will improve the reporting method, for now this works)
for keys,values in featureimportance.items():
    print(keys)
    print('shuffleaccuracy = ', values['shuffleaccuracy'])
    print('baseaccuracy = ', values['baseaccuracy'])
    print('metric = ', values['metric'])
    print('metric2 = ', values['metric2'])
    print()

addr2_26.0
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.002557544757033292
metric2 =  -0.002557544757033292

addr2_60.0
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.002557544757033292
metric2 =  -0.002557544757033292

addr2_87.0
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.002557544757033292
metric2 =  0.0

card4_american express
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.002557544757033292
metric2 =  -0.002557544757033292

card4_discover
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.002557544757033292
metric2 =  -0.002557544757033292

card4_mastercard
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.002557544757033292
metric2 =  0.0

card4_visa
shuffleaccuracy =  0.9769820971867008
baseaccuracy =  0.9744245524296675
metric =  -0.00255754475703329

In [39]:
#I suspect the small size of this demonstration set impacted these results.

#Note that for interpretting these the "metric" represents the impact
#after shuffling the entire set originating from same feature and larger
#metric implies more importance
#and metric2 is derived after shuffling all but the current column originating from same
#feature and smaller metric2 implies greater relative importance in that set of
#derived features. In case you were wondering.

# PCAn_components, PCAexcl

In [40]:
#Now if we want to apply some kind of dimensionality reduction, we can conduct 
#via Principle Component Analysis (PCA), a type of unsupervised learning.

#a few defaults here is PCA is automatically performed if number of features > 50% number of rows
#(can be turned off via ML_cmnd)
#also the PCA type defaults to kernel PCA for all non-negative sets, sparse PCA otherwise, or regular
#PCA if PCAn_components pass as a percent. (All via scikit PCA methods)

#If there are any columns we want to exclude from PCA, we can specify in PCAexcl

#We can also pass parameters to the PCA call via the ML_cmnd

#Let's demosntrate, here we'll reduce to four PCA derived sets, arbitrarily excluding 
#from the transofrmation columns derived from dist1


train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', NArw_marker=False, \
             PCAn_components=4, PCAexcl=['dist1'], \
             pandasoutput=True, printstatus=False)

print("derived columns")
list(train)


derived columns


['PCAcol0', 'PCAcol1', 'PCAcol2', 'PCAcol3', 'dist1_nmbr']

In [41]:
#Noting that any subsequently available data can easily be consistently prepared as follows
#with postmunge (by simply passing the postprocess_dict object returned from automunge, which
#you did remember to save, right? If not no worries it's also possible to consistnelty process
#by passing the test set with the exact saem original train set to automunge)

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \
am.postmunge(postprocess_dict, tiny_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus=False)

list(test)

['PCAcol0', 'PCAcol1', 'PCAcol2', 'PCAcol3', 'dist1_nmbr']

In [42]:
#Another useful method might be to exclude any boolean columns from the PCA
#dimensionality reduction. We can do that with ML_cmnd by passing following:

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', NArw_marker=False, \
             PCAn_components=4, PCAexcl=['dist1'], \
             pandasoutput=True, printstatus=False, \
             ML_cmnd = {'MLinfill_type':'default', \
                        'MLinfill_cmnd':{'RandomForestClassifier':{}, \
                                         'RandomForestRegressor':{}}, \
                        'PCA_type':'default', \
                        'PCA_cmnd':{'bool_PCA_excl':True}})

print("derived columns")
list(train)


derived columns


['PCAcol0',
 'PCAcol1',
 'PCAcol2',
 'PCAcol3',
 'dist1_nmbr',
 'addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'card6_bnry']

# assigncat

In [43]:
#A really important part is that we don't have to defer to the automated evaluation of
#column properties to determine processing methods, we can also assign distinct processing
#methods to specific columns.

#Now let's try assigning a few different methods to the numerical sets:

#remember we're assigninbg based on the original column names before the appended suffixes

#How about let's arbitrily select min-max scaling to these columns 
minmax_list = ['card1', 'card2', 'card3']

#And since we previously saw that Transaction_Amt might have some skewness based on our
#prior powertrasnform evaluation, let's set that to 'pwrs' which puts it into bins
#based on powers of 10
pwrs_list = ['TransactionAmt']

#Let's say we don't feel the P_emaildomain is very useful, we can just delete it with null
null_list = ['P_emaildomain']

#and if there's a column we want to exclude from processiong, we can exclude with excl
#note that any column we exclude from processing needs to be already numerically encoded
#if we want to use any of our predictive methods like MLinfill, feature improtance, PCA
#on other columns. (excl just passes data untouched, exc2 performs a modeinfill just in 
#case some missing points are found.)
exc2_list = ['card5']

#and we'll leave the rest to default methods

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', NArw_marker=False, \
             pandasoutput=True, printstatus=False, \
             assigncat = {'mnmx':minmax_list, 'mnm2':[], 'mnm3':[], 'mnm4':[], 'mnm5':[], 'mnm6':[], \
                         'nmbr':[], 'nbr2':[], 'nbr3':[], 'MADn':[], 'MAD2':[], \
                         'bins':[], 'bint':[], \
                         'bxcx':[], 'bxc2':[], 'bxc3':[], 'bxc4':[], \
                         'log0':[], 'log1':[], 'pwrs':pwrs_list, \
                         'bnry':[], 'text':[], 'ordl':[], 'ord2':[], \
                         'date':[], 'dat2':[], 'wkdy':[], 'bshr':[], 'hldy':[], \
                         'excl':[], 'exc2':exc2_list, 'exc3':[], 'null':null_list, 'eval':[]})

print("derived columns")
list(train)

derived columns


['addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card5_exc2_bins_s<-2',
 'card5_exc2_bins_s-21',
 'card5_exc2_bins_s-10',
 'card5_exc2_bins_s+01',
 'card5_exc2_bins_s+12',
 'card5_exc2_bins_s>+2',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'TransactionAmt_10^-1',
 'TransactionAmt_10^0',
 'TransactionAmt_10^1',
 'TransactionAmt_10^2',
 'TransactionAmt_10^3',
 'TransactionDT_nmbr',
 'card1_mnmx',
 'card2_mnmx',
 'card3_mnmx',
 'card6_bnry',
 'addr1_nmbr',
 'dist1_nmbr',
 'dist2_nmbr']

In [44]:
#Here's what the resulting derivations look like
train.head()

Unnamed: 0,addr2_26.0,addr2_60.0,addr2_87.0,card5_exc2_bins_s<-2,card5_exc2_bins_s-21,card5_exc2_bins_s-10,card5_exc2_bins_s+01,card5_exc2_bins_s+12,card5_exc2_bins_s>+2,card4_american express,...,TransactionAmt_10^2,TransactionAmt_10^3,TransactionDT_nmbr,card1_mnmx,card2_mnmx,card3_mnmx,card6_bnry,addr1_nmbr,dist1_nmbr,dist2_nmbr
0,0,0,1,0,0,0,1,0,0,0,...,1,0,1.023439,0.391159,0.763527,0.453608,0,1.025535,0.104258,-0.848042
1,0,0,1,0,0,0,1,0,0,0,...,1,0,1.642753,0.663074,0.442886,0.453608,0,1.977745,-0.523147,-0.326374
2,0,0,1,0,0,0,1,0,0,0,...,0,0,-0.10202,0.481874,0.022044,0.453608,0,0.255207,-0.495666,-0.462592
3,0,1,0,0,1,0,0,0,0,0,...,1,0,-0.38353,0.855628,0.891784,0.814433,0,0.256705,-0.287355,-1.221946
4,0,0,1,0,0,0,1,0,0,0,...,1,0,-0.120128,0.833266,0.781563,0.453608,0,0.084023,-0.312272,-1.111596


# assigninfill

In [45]:
#We can also assign distinct infill methods to each column. Let's demonstrate. 
#I remember when we were looking at MLinfill that one of our columns had a few NArw
#(rows subject to infill), let's try a different infill method on those 

#how about we try adjinfill which carries the value from an adjacent row

#remember we're assigning columns based on their title prior to the suffix appendings

train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', \
             NArw_marker=True, pandasoutput=True, printstatus=False, \
             assigninfill = {'adjinfill':['dist1']})

columns = ['dist1_nmbr', 'dist1_NArw']
train[columns].head()

Unnamed: 0,dist1_nmbr,dist1_NArw
0,-0.523147,1
1,-0.523147,0
2,-0.495666,0
3,-0.495666,1
4,-0.495666,1


# transformdict and processdict

In [46]:
#trasnformdict and processdict are for more advanced users. They allow the user to design
#custom compositions of transformations, or even incorporate their own custom defined
#trasnformation functions into use on the platform. I won't go into full detail on these methods
#here, I documented these a bunch in the essays which I'll link to below, but here's a taste.

#Say that we have a numerical set that we want to use to apply multiple trasnformations. Let's just
#make a few up, say that we have a set with fat tail characteristics, and we want to do multiple
#trasnformions including a bocx-cox trasnformation, a z-score trasnformation on that output, as
#well as a set of bins for powers of 10. Well our 'TransactionAmt' column might be a good candiate
#for that. Let's show how.

#Here we define our cusotm trasnform dict using our "family tree primitives"
#Note that we always need to uyse at least one replacement primitive, if a column is intended to be left
#intact we can include a excl trasnfo0rm as a replacement primitive.

#here are the primitive definitions
# 'parents' :           upstream / first generation / replaces column / with offspring
# 'siblings':           upstream / first generation / supplements column / with offspring
# 'auntsuncles' :       upstream / first generation / replaces column / no offspring
# 'cousins' :           upstream / first generation / supplements column / no offspring
# 'children' :          downstream parents / offspring generations / replaces column / with offspring
# 'niecesnephews' :     downstream siblings / offspring generations / supplements column / with offspring
# 'coworkers' :         downstream auntsuncles / offspring generations / replaces column / no offspring
# 'friends' :           downstream cousins / offspring generations / supplements column / no offspring

#So let's define our custom trasnformdict for a new root category we'll call 'cstm'
transformdict = {'cstm' : {'parents' : ['bxcx'], \
                           'siblings': [], \
                           'auntsuncles' : [], \
                           'cousins' : ['pwrs'], \
                           'children' : [], \
                           'niecesnephews' : [], \
                           'coworkers' : [], \
                           'friends' : []}}

#Note that since bxcx is a parent category, it will look for offspring in the primitives associated
#with bxcx root cateogry in the library, and find there a downstream nmbr category

#Note that since we are defining a new root category, we also have to define a few parameters for it
#demonstrate here. Further detail on thsi step available in documentation. If you're not sure you might
#want to try just copying an entry in the READ ME.

#Note that since cstm is only a root cateogry and not included in the family tree primitives we don't have to
#define a processing funciton (for the dualprocess/singleprocess/postprocess entries), we can just enter None

processdict = {'cstm' : {'dualprocess' : None, \
                         'singleprocess' : None, \
                         'postprocess' : None, \
                         'NArowtype' : 'numeric', \
                         'MLinfilltype' : 'numeric', \
                         'labelctgy' : 'nmbr'}}

#We can then pass this trasnformdict to the automunge call and assign the intended column in assigncat
train, trainID, labels, \
validation1, validationID1, validationlabels1, \
validation2, validationID2, validationlabels2, \
test, testID, testlabels, \
labelsencoding_dict, finalcolumns_train, finalcolumns_test, \
featureimportance, postprocess_dict = \
am.automunge(tiny_train, df_test = False, labels_column = 'isFraud', \
             NArw_marker=True, pandasoutput=True, printstatus=False, \
             assigncat = {'cstm':['TransactionAmt']}, \
             transformdict = transformdict, processdict = processdict)

print("list(train)")
list(train)

list(train)


['addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'TransactionAmt_10^-1',
 'TransactionAmt_10^0',
 'TransactionAmt_10^1',
 'TransactionAmt_10^2',
 'TransactionAmt_10^3',
 'TransactionDT_NArw',
 'TransactionDT_nmbr',
 'TransactionAmt_bxcx_nmbr',
 'ProductCD_NArw',
 'card1_NArw',
 'card1_nmbr',
 'card2_NArw',
 'card2_nmbr',
 'card3_NArw',
 'card3_nmbr',
 'card4_NArw',
 'card5_NArw',
 'card5_nmbr',
 'card6_NArw',
 'card6_bnry',
 'addr1_NArw',
 'addr1_nmbr',
 'addr2_NArw',
 'dist1_NArw',
 'dist1_nmbr',
 'dist2_NArw',
 'dist2_nmbr',
 'P_emaildomain_NArw',
 'P_emaildomain_ordl']

In [47]:
#and then of course use also has the ability to define their own trasnformation functions to
#incorproate into the platform, I'll defer to the essays for that bit in the interest of brevity

# postmunge

In [48]:
#And the final bit which I'll just reiterate here is that automunge facilitates the simplest means
#for consistent processing of subsequently available data with just a single function call
#all you need is the postprocess_dict object returned form the original automunge call

#This even works when we passed custom trasnformdict entries as was case with last postprocess_dict
#derived in last example, however if you're defining custom trasfnormation functions for now you
#need to save those custom function definitions are redefine in the new notewbook when applying postmunge

#Here again is a demosntration of postmunge. Since the last postprocess_dict we returned
#was with our custom transfomrations in preceding excample, the 'TransactionAmt' column will
#be processed consistently

test, testID, testlabels, \
labelsencoding_dict, finalcolumns_test = \
am.postmunge(postprocess_dict, tiny_test, testID_column = False, \
             labelscolumn = False, pandasoutput=True, printstatus=True)

_______________
Begin Postmunge processing

processing column:  TransactionDT
    root category:  nmbr
 returned columns:
['TransactionDT_NArw', 'TransactionDT_nmbr']

processing column:  TransactionAmt
    root category:  cstm
 returned columns:
['TransactionAmt_bxcx_nmbr', 'TransactionAmt_10^0', 'TransactionAmt_10^2', 'TransactionAmt_10^1', 'TransactionAmt_10^3', 'TransactionAmt_10^-1']

processing column:  ProductCD
    root category:  text
 returned columns:
['ProductCD_W', 'ProductCD_H', 'ProductCD_S', 'ProductCD_R', 'ProductCD_NArw', 'ProductCD_C']

processing column:  card1
    root category:  nmbr
 returned columns:
['card1_NArw', 'card1_nmbr']

processing column:  card2
    root category:  nmbr
 returned columns:
['card2_nmbr', 'card2_NArw']

processing column:  card3
    root category:  nmbr
 returned columns:
['card3_nmbr', 'card3_NArw']

processing column:  card4
    root category:  text
 returned columns:
['card4_visa', 'card4_mastercard', 'card4_discover', 'card4_american

In [49]:
list(test)

['addr2_26.0',
 'addr2_60.0',
 'addr2_87.0',
 'card4_american express',
 'card4_discover',
 'card4_mastercard',
 'card4_visa',
 'ProductCD_C',
 'ProductCD_H',
 'ProductCD_R',
 'ProductCD_S',
 'ProductCD_W',
 'TransactionAmt_10^-1',
 'TransactionAmt_10^0',
 'TransactionAmt_10^1',
 'TransactionAmt_10^2',
 'TransactionAmt_10^3',
 'TransactionDT_NArw',
 'TransactionDT_nmbr',
 'TransactionAmt_bxcx_nmbr',
 'ProductCD_NArw',
 'card1_NArw',
 'card1_nmbr',
 'card2_NArw',
 'card2_nmbr',
 'card3_NArw',
 'card3_nmbr',
 'card4_NArw',
 'card5_NArw',
 'card5_nmbr',
 'card6_NArw',
 'card6_bnry',
 'addr1_NArw',
 'addr1_nmbr',
 'addr2_NArw',
 'dist1_NArw',
 'dist1_nmbr',
 'dist2_NArw',
 'dist2_nmbr',
 'P_emaildomain_NArw',
 'P_emaildomain_ordl']

# Closing thoughts

Great well certainly appreciate your attention and opportunity to share. I suppose next step for me is to try and hone in on my entry and perhaps get on the leaderboard. That'd be cool. 

Oh before I go if you'd like to see more I recently published my first collection of essays titled "From the Diaries of John Henry", which a big chunk included the documentation through the development of Automunge. Check it out it's all online.

[turingsquared.com](http://turingsquared.com)

Or for more on Automunge our website and contact info is available at 

[automunge.com](https://www.automunge.com)