The following demonstrations coincide with Section 4 of the paper.

## Code Demonstration

Jupyter notebook install and imports are as follows:

In [1]:
!pip install Automunge



In [2]:
from Automunge import *
am = AutoMunge()

The automunge(.) function accepts as input a Pandas dataframe or tabular Numpy array of training data and optionally also corresponding test data. If any of the sets include a label column that header should be designated, similarly with any index header or list of headers to exclude from the ML infill basis. For Numpy, headers are the index integer and labels should be positioned as final column.

In [3]:
#for simplicity we'll base this notebook on Titanic data set

In [4]:
import pandas as pd

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

# labels_column = '<labels_column_header>'
# trainID_column = '<ID_column_header>'

labels_column = 'Survived'
trainID_column = 'PassengerId'

In [5]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


These data sets can be passed to automunge(.) to automatically encode and impute. The function returns 10 sets (9 dataframes and 1 dictionary) which in some cases may be empty based on parameter settings, we suggest the following optional naming convention. The final set, the "postprocess_dict", is the key for consistently preparing additional data in postmunge(.). Note that if a validation set is desired it can be partitioned from df_train with valpercent and prepared on the train set basis. Shuffling is on by default for train data and off by default for test data, the associated parameter is shown for reference.  Here we demonstrate with the assigncat parameter assigning the root category of a transformation set to some target column which will override the default transform under automation. We also demonstrate with the assigninfill parameter assigning an alternate infill convention to a column. The ML infill and NArw column aggregation are on by default, their associated activation parameters are shown for reference. Note that if the data is already numerically encoded and user just desires infill, they can pass parameter powertransform = 'infill'. 

In [6]:
parsed_categoric_target_column = 'Ticket'
infill_target_column = 'Cabin'

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train, 
             df_test = df_test,
             labels_column = labels_column, 
             trainID_column = trainID_column,
             valpercent = 0.2,
             shuffletrain = True,
             assigncat = {'or23' : [parsed_categoric_target_column] },
             assigninfill = {'modeinfill' : [infill_target_column] },
             MLinfill = True,
             NArw_marker = True)

_______________
Begin Automunge processing

evaluating column:  Pclass
processing column:  Pclass
    root category:  1010
 returned columns:
['Pclass_NArw', 'Pclass_1010_0', 'Pclass_1010_1']

evaluating column:  Name
processing column:  Name
    root category:  hash
 returned columns:
['Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13']

evaluating column:  Sex
processing column:  Sex
    root category:  bnry
 returned columns:
['Sex_bnry', 'Sex_NArw']

evaluating column:  Age
processing column:  Age
    root category:  nmbr
 returned columns:
['Age_nmbr', 'Age_NArw']

evaluating column:  SibSp
processing column:  SibSp
    root category:  nmbr
 returned columns:
['SibSp_nmbr', 'SibSp_NArw']

evaluating column:  Parch
processing column:  Parch
    root category:  nmbr
 returned columns:
['Parch_nmbr', 'Parch_NArw

A list of columns returned from some particular input feature can be accessed with postprocess_dict['column_map']['\<input_feature_header\>']. A report classifying the returned column types (such as continuous, boolean, ordinal, onehot, binary, etc.) and their groupings can be accessed with postprocess_dict['columntype_report'].

In [7]:
#Here's an example of viewing columns 
#returned from some input feature

input_feature_header = 'Ticket'

train[postprocess_dict['column_map'][input_feature_header]].head()

Unnamed: 0,Ticket_UPCS_ord3,Ticket_NArw,Ticket_UPCS_nmcm,Ticket_UPCS_sp19_0,Ticket_UPCS_sp19_1,Ticket_UPCS_sp19_2,Ticket_UPCS_sp19_3,Ticket_UPCS_sp19_4,Ticket_UPCS_sp19_5,Ticket_UPCS_sp19_6
176,536,0,3101312.0,1,1,1,1,0,1,1
492,26,0,17757.0,1,0,1,1,1,0,0
390,362,0,349247.0,0,1,1,0,0,1,0
613,537,0,392078.0,1,1,1,0,1,0,0
580,468,0,2466.0,0,1,0,0,1,1,1


In [8]:
#And here is what the report of column types looks like

postprocess_dict['columntype_report']

{'continuous': ['Age_nmbr',
  'SibSp_nmbr',
  'Parch_nmbr',
  'Fare_nmbr',
  'Ticket_UPCS_nmcm'],
 'boolean': ['Sex_bnry',
  'Pclass_NArw',
  'Name_NArw',
  'Sex_NArw',
  'Age_NArw',
  'SibSp_NArw',
  'Parch_NArw',
  'Ticket_NArw',
  'Fare_NArw',
  'Cabin_NArw',
  'Embarked_NArw'],
 'ordinal': ['Ticket_UPCS_ord3',
  'Name_hash_0',
  'Name_hash_1',
  'Name_hash_2',
  'Name_hash_3',
  'Name_hash_4',
  'Name_hash_5',
  'Name_hash_6',
  'Name_hash_7',
  'Name_hash_8',
  'Name_hash_9',
  'Name_hash_10',
  'Name_hash_11',
  'Name_hash_12',
  'Name_hash_13'],
 'onehot': [],
 'onehot_sets': [],
 'binary': ['Pclass_1010_0',
  'Pclass_1010_1',
  'Ticket_UPCS_sp19_0',
  'Ticket_UPCS_sp19_1',
  'Ticket_UPCS_sp19_2',
  'Ticket_UPCS_sp19_3',
  'Ticket_UPCS_sp19_4',
  'Ticket_UPCS_sp19_5',
  'Ticket_UPCS_sp19_6',
  'Cabin_1010_0',
  'Cabin_1010_1',
  'Cabin_1010_2',
  'Cabin_1010_3',
  'Cabin_1010_4',
  'Cabin_1010_5',
  'Cabin_1010_6',
  'Cabin_1010_7',
  'Embarked_1010_0',
  'Embarked_1010_1'],
 'b

If the returned train set is to be used for training a model that may go into production, the postprocess_dict should be saved externally, such as with the pickle library. 

We can then prepare additional data on the train set basis with postmunge(.).

In [9]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
             df_test)

_______________
Begin Postmunge processing

______

processing column:  Pclass
    root category:  1010

 returned columns:
['Pclass_NArw', 'Pclass_1010_0', 'Pclass_1010_1']

______

processing column:  Name
    root category:  hash

 returned columns:
['Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13']

______

processing column:  Sex
    root category:  bnry

 returned columns:
['Sex_bnry', 'Sex_NArw']

______

processing column:  Age
    root category:  nmbr

 returned columns:
['Age_nmbr', 'Age_NArw']

______

processing column:  SibSp
    root category:  nmbr

 returned columns:
['SibSp_nmbr', 'SibSp_NArw']

______

processing column:  Parch
    root category:  nmbr

 returned columns:
['Parch_nmbr', 'Parch_NArw']

______

processing column:  Ticket
    root category:  or23

 returned columns:
['Ticket_UPCS_