### Table of Contents


• __Appendix B__: Automunge demonstration

...

• __Appendix F__: Noise injection tutorial

  – F.1: DP transformation categories

  – F.2: Parameter assignment

...
  
  – F.6: Data augmentation with noise

  – F.7: Alternate random samplers

  – F.8: QRAND library sampling

  – F.9: All together now

  – F.10: Noise directed at existing data pipelines

• __Appendix G__: Advanced noise profiles

  – G.1: Noise parameter randomization

  – G.2: Noise profile composition

  – G.3: Protected attributes

...

### Appendix B - Automunge demonstration

The Automunge interface is channeled through two master functions, automunge(.) for preparing data and postmunge(.) for preparing additional corresponding data. As an example, for a training set dataframe df_train which includes a label feature ‘labels’, automunge(.) can be applied under automation as:

In [1]:
#!pip install Automunge
from Automunge import *
am = AutoMunge()

In [2]:
import pandas as pd

In [3]:
#titanic set available from [Kaggle](https://www.kaggle.com/c/titanic/data)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
labels_column = 'Survived'
trainID_column = 'PassengerId'

df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 labels_column = labels_column)

_______________
Begin Automunge

______

versioning serial stamp:
_8.18_640834806539

Automunge returned train column set: 
['PassengerId_nmbr', 'Pclass_nmbr', 'Sex_bnry', 'Age_nmbr', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_nmbr', 'PassengerId_NArw', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



Some of the returned sets may be empty based on parameter selections. Using the returned dictionary postprocess_dict, corresponding data can then be prepared on a consistent basis with postmunge(.).

In [5]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test)

_______________
Begin Postmunge

Postmunge returned test column set: 
['PassengerId_nmbr', 'Pclass_nmbr', 'Sex_bnry', 'Age_nmbr', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_nmbr', 'PassengerId_NArw', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

_______________
Postmunge returned ID column set: 
['Automunge_index']

_______________
Postmunge Complete



In [6]:
train

Unnamed: 0,PassengerId_nmbr,Pclass_nmbr,Sex_bnry,Age_nmbr,SibSp_nmbr,Parch_nmbr,Fare_nmbr,PassengerId_NArw,Pclass_NArw,Name_NArw,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
551,0.411884,-0.369157,1,-0.185806,-0.474279,-0.473408,-0.124850,0,0,0,...,0,0,0,0,0,0,0,0,1,1
100,-1.340567,0.826913,0,-0.116967,-0.474279,-0.473408,-0.489167,0,0,0,...,0,0,0,0,0,0,0,0,1,1
96,-1.356109,-1.565228,1,2.843141,-0.474279,-0.473408,0.049302,0,0,0,...,0,0,0,1,1,0,1,0,0,1
352,-0.361370,0.826913,1,-1.011883,0.432550,0.767199,-0.502582,0,0,0,...,0,0,0,0,0,0,0,0,0,1
69,-1.461023,0.826913,1,-0.254646,1.339380,-0.473408,-0.473739,0,0,0,...,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
453,0.031086,-1.565228,1,1.328667,0.432550,-0.473408,1.145020,0,0,0,...,1,0,1,0,1,1,1,0,0,1
118,-1.270624,-1.565228,1,-0.392326,-0.474279,0.767199,4.332899,0,0,0,...,0,1,0,0,1,0,1,0,0,1
397,-0.186514,-0.369157,1,1.122148,-0.474279,-0.473408,-0.124850,0,0,0,...,0,0,0,0,0,0,0,0,1,1
746,1.169596,0.826913,1,-0.943043,0.432550,0.767199,-0.240559,0,0,0,...,0,0,0,0,0,0,0,0,1,1


In [7]:
test

Unnamed: 0,PassengerId_nmbr,Pclass_nmbr,Sex_bnry,Age_nmbr,SibSp_nmbr,Parch_nmbr,Fare_nmbr,PassengerId_NArw,Pclass_NArw,Name_NArw,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
0,1.733022,0.826913,1,0.330491,-0.474279,-0.473408,-0.490508,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1.736908,0.826913,0,1.190988,0.432550,-0.473408,-0.507194,0,0,0,...,0,0,0,0,0,0,0,0,1,1
2,1.740794,-0.369157,1,2.223584,-0.474279,-0.473408,-0.453112,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1.744680,0.826913,1,-0.185806,-0.474279,-0.473408,-0.473739,0,0,0,...,0,0,0,0,0,0,0,0,1,1
4,1.748565,0.826913,0,-0.530005,0.432550,0.767199,-0.400792,0,0,0,...,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,3.337817,0.826913,1,0.467827,-0.474279,-0.473408,-0.486064,0,0,0,...,0,0,0,0,0,0,0,0,1,1
414,3.341703,-1.565228,0,0.640270,-0.474279,-0.473408,1.543379,0,0,0,...,0,0,0,0,0,0,0,0,0,1
415,3.345588,0.826913,1,0.605850,-0.474279,-0.473408,-0.502163,0,0,0,...,0,0,0,0,0,0,0,0,1,1
416,3.349474,0.826913,1,-0.206458,-0.474279,-0.473408,-0.486064,0,0,0,...,0,0,0,0,0,0,0,0,1,1


To engineer a custom set of transformations, one can populate a transformdict and processdict entry for a new transformation category we’ll call ‘newt’. The functionpointer is used to match ‘newt’ to the processdict entries applied for ‘nmbr’, which is for z-score normalization. The transformdict is used to populate transformation category entries to the family tree primitives [Table 1] Teague (2021a) associated with a root category. The first four primitives are for upstream transforms. Since parents is a primitive with offspring, after applying transforms for the ‘newt’ entry, the downstream primitives from newt’s family tree will be inspected to apply ‘bsor’ for ordinal encoded standard deviation bins to the output of the upstream transform. The upstream ‘NArw’ is used to aggregate missing data markers. The assigncat parameter is used to assign ‘newt’ as a root category to a target input column ‘targetcolumn’. There are also many preconfigured trees available in the library.

In [8]:
processdict =  {'newt' : {'functionpointer'   : 'nmbr'}}    

transformdict =  {'newt' : {'parents'         : ['newt'],
                                    'siblings'             : [],
                                    'auntsuncles'      : [],
                                    'cousins'             : ['NArw'],
                                    'children'             : [],
                                    'niecesnephews' : [],
                                    'coworkers'         : [],
                                    'friends'              : ['bsor']}}
                        
assigncat = {'newt' : ['targetcolumn']}

This transformation set will return columns with headers logging the applied transformation categories as: ‘column_newt’ (z-score normalization), ‘column_newt_bsor’ (ordinal encoded standard deviation bins), and ‘column_NArw’ (missing data markers). In an alternate configuration ‘bsor’ could be entered to an upstream primitive, this is just an example to demonstrate applying generations of transformations. Since friends is a supplement primitive, the upstream output ‘column_newt’ to which the ‘bsor’ transform is applied is retained in the returned data. And since cousins and friends are primitives without offspring, no further generations are inspected after applying their entries.

Parameters can be passed to the transformations through assignparam, as demonstrated here for updating a parameter setting so that the number of standard deviation bins for ‘bsor’ as applied to column ‘column’ is increased from the default of 6 to 7, where since this is an odd number will result in the center bin straddling the mean.

In [9]:
assignparam = {'bsor' : {'column' : {'bincount' : 7}}}

Under automation auto ML models are trained for each feature and missing marker activations are aggregated in the returned sets to support missing data imputation. These options can be deactivated with the MLinfill and NArw_marker parameters. The function automatically shuffles the rows of training data and defaults to not shuffling rows of test data. To retain order of train set rows can deactivate the shuffletrain parameter.

In [10]:
shuffletrain = False

There is an option to mask the returned feature headers and order of columns for purposes of retaining privacy of model basis by the privacy_encode parameter, which can later be inverted with a postmunge(.) inversion operation if desired. This option can also be combined with encryption of the postprocess_dict by the encrypt_key parameter. Here is an example of privacy encoding without encryption.

In [11]:
privacy_encode = True

Putting it all together in an automunge(.) call simply means passing our parameter specifications.

In [12]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 labels_column = labels_column,
                 processdict = processdict,
                 transformdict = transformdict,
                 assigncat = assigncat,
                 assignparam = assignparam,
                 shuffletrain = shuffletrain,
                 privacy_encode = privacy_encode)

_______________
Begin Automunge

______

versioning serial stamp:
_8.18_714045975680

Automunge returned train column set: 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]

Automunge returned ID column set: 
['Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



One can then save the returned postprocess_dict, such as by downloading with the pickle library, to use as a key for preparing additional corresponding data on a consistent basis with postmunge(.).


In [13]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test)

_______________
Begin Postmunge

Postmunge returned test column set: 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]

_______________
Postmunge returned ID column set: 
['Automunge_index']

_______________
Postmunge Complete



Assigning noise injection root categories to targeted input columns is also applied in the assigncat automunge(.) parameter, which once assigned will be carried through as the basis for postmunge(.). Here we demonstrate assigning DPnb as the root category for a list of numeric features, DPod for a list of categoric features, and DPmm for a specific targeted numeric feature.

In [14]:
numeric_features_list = ['Age', 'Fare']
categoric_features_list = ['Pclass', 'Sex', 'Embarked']

assigncat = \
{'DPnb' : numeric_features_list,
 'DPod' : categoric_features_list,
 'DPmm' : '<targetcolumn>'}

To default to applying noise injection under automation one can take advantage of the automunge(.) powertransform parameter which is used to select between scenarios for default transformations applied under automation. powertrans- form accepts specification as ‘DP1’ or ‘DP2’ resulting in automated encodings applying noise injection, further detailed in the read me powertransform parameter writeup (or DT and DB equivalents DT1 / DT2 / DB1 / DB2 for different default train and test noise configurations [Appendix C]).

Transformation category specific parameters can be passed to transformation functions through the automunge(.) assignparam parameter, which will then be carried through as the basis for preparing additional data in postmunge. In order of precedence, parameter assignments may be designated targeting a transformation category as applied to a specific column header with suffix appenders, a transformation category as applied to an input column header (which may include multiple instances), all instances of a specific transformation category, all transformation categories, or may be initialized as default parameters when defining a transformation category.

Here we demonstrate passing three different kinds of assignparam specifications.

In [15]:
assignparam = \
{'global_assignparam'  : 
    {'testnoise': True},
 'default_assignparam' : 
    {'DPod' : {'flip_prob' : 0.05}},
 'DPmm' : 
    {'targetcolumn' : 
        {'noisedistribution' : 'abs_normal',
         'sigma' : 0.02}}}

• ‘global_assignparam’ passes a specified parameter to all transformation functions applied to all columns, which if a function does not accept that parameter will just be ignored. In this demonstration we turn on test noise injection for all transforms via the ‘testnoise’ parameter.

• ‘default_assignparam’ passes a specified parameter to all instances of a specified tree category (where tree category refers to the entries to the family tree primitives of a root category assigned to a column, and in many cases the tree category will be the same as the root category). Here we demonstrate updating the ‘flip_prob’ parameter from the 0.03 default for all instances of the DPod transform, which represents the ratio of entries that will be targeted for injection.

• To target parameters to specific categories as applied to specific columns, can specify as {category : {column : {parameter : value}}}. Here we demonstrate targeting the application of the DPmm transform to a column ‘targetcolumn’ in order to apply all positive signed noise injections by setting the ‘noisedistribution’ parameter to ‘abs_normal’, and also reducing the standard deviation of the injections from default of 0.03 to 0.02 with the ‘sigma’ setting. ‘targetcolumn’ refers to the header configuration received as input to a transform without
the returned suffix.


Having defined our assignparam specification dictionary, it can then be passed to the automunge(.) assignparam parameter. As an asterisk, it’s important to keep in mind that targeting a category for assignparam specification is based on that category’s use as a tree category (as opposed to use as a root category), which in some cases may be different. The read me documentation on noise injection details any cases where a noise injection parameter acceptance may be a tree category differing from the root category, as is the case for a few of the hashing noise injections. Having defined our relevant parameters, we can then pass them to an automunge(.) call.

In [16]:
#!pip install Automunge
from Automunge import *
am = AutoMunge()

#import train and test data sets
import pandas as pd
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
labels_column = labels_column
trainID_column = 'PassengerId'

#prepare the data for machine learning
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 df_test = df_test,
                 labels_column = labels_column, 
                 trainID_column = trainID_column,
                 assigncat = assigncat,
                 assignparam = assignparam)

#download postprocess_dict with pickle

_______________
Begin Automunge

______

versioning serial stamp:
_8.18_607549145618

Automunge returned train column set: 
['Pclass_DPo4_DPod', 'Sex_DPo4_DPod', 'Age_DPn3_DPnb', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_DPn3_DPnb', 'Embarked_DPo4_DPod', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [17]:
train

Unnamed: 0,Pclass_DPo4_DPod,Sex_DPo4_DPod,Age_DPn3_DPnb,SibSp_nmbr,Parch_nmbr,Fare_DPn3_DPnb,Embarked_DPo4_DPod,Pclass_NArw,Name_NArw,Name_hash_0,...,Cabin_NArw,Cabin_1010_0,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw
490,1,1,-0.313921,0.432550,-0.473408,-0.246260,1,0,0,180,...,1,0,0,0,0,0,0,0,0,0
263,2,1,0.709110,-0.474279,-0.473408,-0.648058,1,0,0,195,...,0,0,0,1,0,1,1,1,1,0
259,3,2,1.397507,-0.474279,0.767199,-0.124850,1,0,0,608,...,1,0,0,0,0,0,0,0,0,0
321,1,1,-0.185806,-0.474279,-0.473408,-0.489167,1,0,0,600,...,1,0,0,0,0,0,0,0,0,0
635,2,2,-0.116967,-0.474279,-0.473408,-0.386454,1,0,0,224,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
217,3,1,0.846789,0.432550,-0.473408,-0.104726,1,0,0,528,...,1,0,0,0,0,0,0,0,0,0
295,2,1,0.881078,-0.474279,-0.473408,-0.090221,2,0,0,833,...,1,0,0,0,0,0,0,0,0,0
595,1,1,0.433751,0.432550,0.767199,-0.162078,1,0,0,860,...,1,0,0,0,0,0,0,0,0,0
533,1,2,-0.004900,-0.474279,2.007806,-0.198133,2,0,0,849,...,1,0,0,0,0,0,0,0,0,0


In [18]:
test.head()

Unnamed: 0,Pclass_DPo4_DPod,Sex_DPo4_DPod,Age_DPn3_DPnb,SibSp_nmbr,Parch_nmbr,Fare_DPn3_DPnb,Embarked_DPo4_DPod,Pclass_NArw,Name_NArw,Name_hash_0,...,Cabin_NArw,Cabin_1010_0,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw
0,1,1,0.330491,-0.474279,-0.473408,-0.490508,3,0,0,369,...,1,0,0,0,0,0,0,0,0,0
1,1,2,1.190988,0.43255,-0.473408,-0.507194,1,0,0,428,...,1,0,0,0,0,0,0,0,0,0
2,3,1,2.223584,-0.474279,-0.473408,-0.453112,3,0,0,926,...,1,0,0,0,0,0,0,0,0,0
3,1,1,-0.185806,-0.474279,-0.473408,-0.473739,1,0,0,813,...,1,0,0,0,0,0,0,0,0,0
4,1,2,-0.530005,0.43255,0.767199,-0.400792,1,0,0,642,...,1,0,0,0,0,0,0,0,0,0


In [19]:
labels

Unnamed: 0,Survived_lbbn
490,0
263,0
259,1
321,0
635,1
...,...
217,0
295,0
595,0
533,1


In addition to preparing our training data and any validation or test data, this function also populates the postprocess_dict dictionary, which we recommend downloading with pickle if you intend to train a model with the returned data (pickle code demonstrations provided in read me). The postprocess_dict can then be uploaded in a separate notebook to prepare additional corresponding test data on a consistent basis, as may be used for inference.

In [20]:
#!pip install Automunge
from Automunge import *
am = AutoMunge()

#import test data
import pandas as pd
df_test = pd.read_csv('test.csv')

#upload postprocess_dict with pickle
#now prepare the test data on a consistent basis
#traindata parameter accepts boolean, defaulting to False
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test,
                 traindata = False)

_______________
Begin Postmunge

Postmunge returned test column set: 
['Pclass_DPo4_DPod', 'Sex_DPo4_DPod', 'Age_DPn3_DPnb', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_DPn3_DPnb', 'Embarked_DPo4_DPod', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw']

_______________
Postmunge returned ID column set: 
['PassengerId', 'Automunge_index']

_______________
Postmunge Complete



### Appendix F: Noise Injection Tutorial

##### F.1 DP transformation categories

DP family of transforms are surveyed in the read me’s library of transformations section as Differential Privacy Noise Injections. The noise injections can be performed in conjunction with numeric normalizations or categoric encodings, which options were surveyed in [Appendix D] [Table 3].

Here is an example of assigning some of these root categories to received features with headers ‘column1’, ‘column2’, ‘column3’. DTnb is z-score normalization with Gaussian noise to test data, shown here assigned to column1. DBod is ordinal encoding with weighted activation flips to both train and test data, shown here assigned to column2 and column3.
(To just inject to train data the identifier string for that default configuration replaces the DT or DB prefix with DP.)

In [21]:
assigncat = {'DTnb' : 'Fare',
                 'DBod' : ['Pclass', 'Sex']}

assigncat

{'DTnb': 'Fare', 'DBod': ['Pclass', 'Sex']}

##### F.2 Parameter assignment

Each of these transformations accepts optional parameter specifications to vary from the defaults. Parameters are passed to transformations through the automunge(.) assignparam parameter. As we described in Appendix D, parameter assignments through assignparam can be conducted in three ways, where global_assignparam passes the setting to every transform applied to every column, default_assignparam pass the same setting to every instance of a specific transformation’s tree category identifier applied to any column, or in the third option a parameter setting can be assigned to a specific transformation tree category identifier passed to a specific column (where that column may be an input column or a derived column with suffix appender passed to the transform). Note that the difference between a tree category and a root category Teague (2021a) is that a root category is the identifier of the family tree of transformation categories assigned to a column in the assigncat parameter, and a tree category is an entry to one of those family tree primitives which is used to access the transformation function. To restate for clarity, the (column) string designates one of either the input column header (before suffixes are applied) or an intermediate column header with suffixes that serves as input to the target transformation.

In [22]:
assignparam = {
  'global_assignparam'  : {'testnoise': True},
  'default_assignparam' : {'DPnb' : {'sigma' : 0.5}},
  'DPod' : {'Pclass'   : {'testnoise' : False}}}

assignparam

{'global_assignparam': {'testnoise': True},
 'default_assignparam': {'DPnb': {'sigma': 0.5}},
 'DPod': {'Pclass': {'testnoise': False}}}

For noise injections that are performed in conjunction with a normalization or encoding, the noise transform is generally applied in a different tree category than the encoding transform, so if parameters are desired to be passed to the encoding, assignparam will need to target a different tree category for the encoding than for the noise. Generally speaking, the noise transform family trees have been configured so that the noise tree category matches the root category, which was intentional for simplicity of parameter assignment (with an exception for DPhs for esoteric reasons). To view the full family tree such as to inspect the encoding tree category, the set of family trees associated with various root categories are provided in the code repository as FamilyTrees.md.

Note that assignparam can also be used to deviate from the default train or test noise injection settings. As noted above, the convention for the string identifiers of noise root categories is that ‘DP’ injects noise to train and not test data, ‘DT’ injects noise to test and not train data, and ‘DB’ injects noise to both train and test data. These are the defaults, but each of these can be updated by parameter assignment with assignparam specification of ‘trainnoise’ or ‘testnoise’ parameters.

As noted in [Appendix C], for subsequent data passed to postmunge(.), the data can also be treated as test data or train data, and in both cases also have noise deactivated. The postmunge(.) traindata parameter defaults to False to prepare postmunge(.) as test data and accepts entries of {False, True, ‘test_no_noise’, ‘train_no_noise’}.

Most of the noise injection transforms share common parameters between those targeting numeric or categoric entries.

In [23]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 df_test = df_test,
                 labels_column = labels_column, 
                 trainID_column = trainID_column,
                 assigncat = assigncat,
                 assignparam = assignparam)

_______________
Begin Automunge

______

versioning serial stamp:
_8.18_649902266688

Automunge returned train column set: 
['Pclass_DBo4_DBod', 'Sex_DBo4_DBod', 'Age_nmbr', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_DTn3_DTnb', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [24]:
train.head()

Unnamed: 0,Pclass_DBo4_DBod,Sex_DBo4_DBod,Age_nmbr,SibSp_nmbr,Parch_nmbr,Fare_DTn3_DTnb,Pclass_NArw,Name_NArw,Name_hash_0,Name_hash_1,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
735,1,1,-0.082547,-0.474279,-0.473408,-0.324071,0,0,1015,293,...,0,0,0,0,0,0,0,0,1,1
374,1,2,-1.83796,2.246209,0.767199,-0.223957,0,0,486,469,...,0,0,0,0,0,0,0,0,1,1
218,2,2,0.158392,-0.474279,-0.473408,0.88719,0,0,177,469,...,1,0,1,1,1,1,0,0,0,1
659,2,1,1.948225,-0.474279,2.007806,1.631419,0,0,30,293,...,1,1,0,1,1,0,1,0,0,1
357,3,2,0.57143,-0.474279,-0.473408,-0.386454,0,0,675,469,...,0,0,0,0,0,0,0,0,1,1


In [25]:
#assumes DPmm and DPnb have been assigned in assigncat
assignparam = {
  'DPmm' : {'Age': {'noisedistribution'         : 'abs_normal',
                                     'noise_scaling_bias_offset' : False}},
  'default_assignparam'  : {'DPnb' : {'flip_prob' : 0.1}},
  'global_assignparam'    : {'testnoise': True},
}

In [26]:
assigncat = {'DTnb' : 'Fare',
             'DTmm' : 'Age',
             'DBod' : ['Pclass', 'Sex']}

In [27]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 df_test = df_test,
                 labels_column = labels_column, 
                 trainID_column = trainID_column,
                 assigncat = assigncat,
                 assignparam = assignparam)

_______________
Begin Automunge

______

versioning serial stamp:
_8.18_807635116332

Automunge returned train column set: 
['Pclass_DBo4_DBod', 'Sex_DBo4_DBod', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_DTn3_DTnb', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'Age_DTm2_DTmm', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

Automunge returned ID column set: 
['PassengerId', 'Automunge_index']

Automunge returned label column set: 
['Survived_lbbn']

_______________
Automunge Complete



In [28]:
train.head()

Unnamed: 0,Pclass_DBo4_DBod,Sex_DBo4_DBod,SibSp_nmbr,Parch_nmbr,Fare_DTn3_DTnb,Pclass_NArw,Name_NArw,Name_hash_0,Name_hash_1,Name_hash_2,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
674,3,1,-0.474279,-0.473408,-0.648058,0,0,270,293,822,...,0,0,0,0,0,0,0,0,1,1
168,2,1,-0.474279,-0.473408,-0.126359,0,0,900,293,839,...,0,0,0,0,0,0,0,0,1,1
701,2,1,-0.474279,-0.473408,-0.119064,0,0,888,293,184,...,1,1,1,1,0,0,1,0,1,1
403,1,1,0.43255,-0.473408,-0.329102,0,0,712,293,154,...,0,0,0,0,0,0,0,0,1,1
508,1,1,-0.474279,-0.473408,-0.194778,0,0,282,293,670,...,0,0,0,0,0,0,0,0,1,1


In [29]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test)

_______________
Begin Postmunge

Postmunge returned test column set: 
['Pclass_DBo4_DBod', 'Sex_DBo4_DBod', 'SibSp_nmbr', 'Parch_nmbr', 'Fare_DTn3_DTnb', 'Pclass_NArw', 'Name_NArw', 'Name_hash_0', 'Name_hash_1', 'Name_hash_2', 'Name_hash_3', 'Name_hash_4', 'Name_hash_5', 'Name_hash_6', 'Name_hash_7', 'Name_hash_8', 'Name_hash_9', 'Name_hash_10', 'Name_hash_11', 'Name_hash_12', 'Name_hash_13', 'Sex_NArw', 'Age_NArw', 'Age_DTm2_DTmm', 'SibSp_NArw', 'Parch_NArw', 'Ticket_NArw', 'Ticket_hash_0', 'Ticket_hash_1', 'Ticket_hash_2', 'Fare_NArw', 'Cabin_NArw', 'Cabin_1010_0', 'Cabin_1010_1', 'Cabin_1010_2', 'Cabin_1010_3', 'Cabin_1010_4', 'Cabin_1010_5', 'Cabin_1010_6', 'Cabin_1010_7', 'Embarked_NArw', 'Embarked_1010_0', 'Embarked_1010_1']

_______________
Postmunge returned ID column set: 
['PassengerId', 'Automunge_index']

_______________
Postmunge Complete



In [30]:
test.head()

Unnamed: 0,Pclass_DBo4_DBod,Sex_DBo4_DBod,SibSp_nmbr,Parch_nmbr,Fare_DTn3_DTnb,Pclass_NArw,Name_NArw,Name_hash_0,Name_hash_1,Name_hash_2,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
0,1,1,-0.474279,-0.473408,-0.490508,0,0,369,293,119,...,0,0,0,0,0,0,0,0,1,0
1,1,2,0.43255,-0.473408,-0.507194,0,0,428,470,119,...,0,0,0,0,0,0,0,0,1,1
2,3,1,-0.474279,-0.473408,-0.453112,0,0,926,293,642,...,0,0,0,0,0,0,0,0,1,0
3,1,1,-0.474279,-0.473408,-0.473739,0,0,813,293,342,...,0,0,0,0,0,0,0,0,1,1
4,1,2,0.43255,0.767199,-0.400792,0,0,642,470,27,...,0,0,0,0,0,0,0,0,1,1


##### F.6 Data augmentation with noise

Data augmentation refers to increasing the size of a training set with manipulations to increase variety. In the image modality it is common to achieve data augmentation by way of adjustments like image cropping, rotations, color shift, etc. Here we are simply injecting noise to training data for similar effect. In a deep learning benchmark performed in Teague (2020a) it was found that this type of data augmentation was fairly benign with a fully represented data set, but was increasingly beneficial with underserved training data. Note that this type of data augmentation can be performed in conjunction with non-deterministic inference by simply injecting to both train and test data.

Data augmentation can be realized by assigning noise transforms in conjunction with the automunge(.) noise_augment parameter, which accepts integers of number of additional duplicates to prepare, e.g. noise_augment=1 would double the size of the training set returned from automunge(.). For cases where too much duplication starts to run into memory constraints additional duplicates can also be prepared with postmunge(.), which also has a noise_augment parameter option and accepts the traindata parameter to distinguish whether a data set is to be treated as train or test data.

Under the default configuration when noise_augment is received as an integer dtype, one of the duplicates will be prepared without noise. If noise_augment is received as a float(int) type, all of the duplicates will be prepared with noise.

Here is an example of preparing data augmentation for the data set loaded earlier.

In [31]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 powertransform = 'DP2',
                 noise_augment = 2.0,
                 printstatus = False)

train.head()

Unnamed: 0,Survived_DPb2_DPbn,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,PassengerId_NArw,PassengerId_DPrt,Survived_NArw,Pclass_NArw,Pclass_DPrt,Name_NArw,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
2212,0,1,25,1,0,0.450853,1,0,0.0,0,...,0,0.0,0,320,0,0,0,0.051822,0,0
742,1,1,6,1,0,0.447191,1,0,0.5,0,...,0,0.0,0,541,0,0,0,0.020495,1,0
1442,1,1,7,1,0,1.0,1,0,0.5,0,...,0,0.0,0,53,0,0,0,0.050749,1,0
1371,1,1,2,1,0,1.0,1,0,1.0,0,...,0,0.333333,0,371,132,0,0,0.091543,1,0
1540,0,0,3,1,0,1.0,1,0,1.0,0,...,0,0.0,0,371,196,0,0,0.014737,1,0


##### F.6 Alternate random samplers

The random sampling for noise injection defaults to numpy’s PCG64, which is based on the PCG pseudo random number generator algorithm O’Neill (2014). On its own this generator is not truly random, it relies on seedings of entropy provided by the operating system which are then enhanced through use. To support integration of enhanced randomness profiles, both automunge(.) and postmunge(.) accept parameters for entropy_seeds and random_generator.

__entropy_seeds__ accepts an integer or list/array of integers which may serve as a supplemental source of entropy for the numpy.random generator to enhance randomness properties.

__random_generator__ accepts input of a numpy.random.Generator formatted random sampler. An example could be numpy.random.MT19937 for Mersenne Twister, or could even be an external library with a numpy.random formatted generator, such as for example could be used to sample with the support of quantum circuits.

Specifications of entropy_seeds and random_generator are specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess_dict. The two parameters can also be passed in tangent, for sampling with a custom generator with custom supplemental entropy seeds.

If an alternate library does not have a numpy.random formatted generator, their output can be channeled to entropy_seeds for similar benefit. Here is an example of specifying an alternate generator and supplemental entropy seedings.

In [32]:
import numpy

random_generator = numpy.random.MT19937
entropy_seeds = [4,5,6]

In [33]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 powertransform = 'DP2',
                 random_generator = random_generator,
                 entropy_seeds = entropy_seeds,
                 noise_augment = 2.0,
                 printstatus = False)

train.head()

Unnamed: 0,Survived_DPb2_DPbn,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,PassengerId_NArw,PassengerId_DPrt,Survived_NArw,Pclass_NArw,Pclass_DPrt,Name_NArw,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
2396,1,1,45,1,0,1.0,1,0,1.0,0,...,0,0.0,0,365,0,0,0,0.015713,1,0
937,1,1,144,2,0,0.051685,1,0,1.0,0,...,0,0.0,0,51,0,0,0,0.030254,1,0
576,0,0,7,1,0,0.568539,1,0,0.5,0,...,0,0.333333,0,161,0,0,0,0.050749,1,0
995,1,1,3,1,0,0.116854,1,0,1.0,0,...,0,0.0,0,699,0,0,0,0.014498,1,0
1411,0,0,76,1,0,1.0,1,0,0.0,0,...,0,0.0,0,631,0,0,0,0.1825,0,0


In the default case the same bank of entropy seeds is fed to each sampling operation with a shuffle. The library also supports different types of sampling scenarios that can be supported by entropy seedings. Alternate sampling scenarios can be specified to automunge(.) or postmunge(.) by the sampling_dict parameter. Here are a few scenarios to illustrate.

1) In one scenario, instead of passing externally sampled supplemental entropy seeds, a user can pass a custom generator for internal sampling of entropy seeds. Here is an example of using a custom generator to sampling entropy seeds and the default generator PCG64 for sampling applied in the transformations. The sampling_type bulk_seeds means that a unique seed will be generated for each sampled entry. When not sampling externally, this scenario may be beneficial for improving utilization rate of quantum hardware since the quantum sampling will only take once per automunge(.) or postmunge(.) call and latency will be governed by the sampler instead of pandas operations.

In [34]:
#random_generator = (custom numpy formatted generator)
random_generator = numpy.random.MT19937
entropy_seeds = False
sampling_dict = \
{'sampling_type' : 'bulk_seeds',
 'extra_seed_generator' : 'custom',
 'sampling_generator' : 'PCG64',
 }

In [35]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 powertransform = 'DP2',
                 random_generator = random_generator,
                 entropy_seeds = entropy_seeds,
                 sampling_dict = sampling_dict,
                 printstatus = False)

train.head()

Unnamed: 0,Survived_DPb2_DPbn,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,PassengerId_NArw,PassengerId_DPrt,Survived_NArw,Pclass_NArw,Pclass_DPrt,Name_NArw,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
639,1,1,45,1,0,0.717978,1,0,1.0,0,...,0,0.0,0,928,0,0,0,0.031425,1,0
426,0,0,7,1,0,0.478652,1,0,0.5,0,...,0,0.0,0,115,0,0,0,0.050749,1,0
348,0,1,7,1,0,0.391011,1,0,1.0,0,...,0,0.166667,0,371,68,0,0,0.031035,1,0
48,1,1,146,2,0,0.053933,1,0,1.0,0,...,0,0.0,0,536,0,0,0,0.042315,1,0
562,1,1,58,1,0,0.631461,1,0,0.5,0,...,0,0.0,0,127,0,0,0,0.02635,1,0


2) In another scenario a user may want to reduce their sampling budget by only accessing one entropy seed for each set of entries. This is accessed with the sampling_type of sampling_seed.

In [36]:
#random_generator = (custom numpy formatted generator)
random_generator = numpy.random.MT19937
entropy_seeds = False
sampling_dict = \
{'sampling_type' : 'sampling_seed',
 'extra_seed_generator' : 'custom',
 'sampling_generator' : 'PCG64',
 }

In [37]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 powertransform = 'DP2',
                 random_generator = random_generator,
                 entropy_seeds = entropy_seeds,
                 sampling_dict = sampling_dict,
                 printstatus = False)

train.head()

Unnamed: 0,Survived_DPb2_DPbn,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,PassengerId_NArw,PassengerId_DPrt,Survived_NArw,Pclass_NArw,Pclass_DPrt,Name_NArw,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
93,1,1,3,1,0,0.104494,1,0,1.0,0,...,0,0.333333,0,371,178,0,0,0.04016,1,0
54,1,1,66,2,0,0.060674,1,0,0.0,0,...,0,0.166667,0,729,0,0,0,0.120975,0,0
848,1,1,22,1,0,0.952809,1,0,0.5,0,...,0,0.166667,0,388,0,0,0,0.07564,1,0
534,1,0,6,1,0,0.6,1,0,1.0,0,...,0,0.0,0,951,0,0,0,0.016908,1,0
255,0,0,38,2,0,0.286517,1,0,1.0,0,...,0,0.333333,0,772,0,0,0,0.029758,1,0


3) There may be a case where a source of supplemental entropy seeds isn’t available as a numpy.random formatted generator. In this case, in order to apply one of the alternate sampling_type scenarios, a user may desire to know a budget of how many seeds are required for externally sampled seeds passed through the entropy_seeds parameter. This can be accomplished by first running the automunge(.) call without entropy seeding specifications to generate the report returned as postprocess_dict[‘sampling_report_dict’]. (note that if sampling seeds internally with a custom generator this isn’t needed.) Note that the sampling_report_dict will report requirements separately for train and test data and in the bulk_seeds case will have a row count basis. (If not passing test data to automunge(.) the test budget can be omitted. For postmunge the use of train or test budget should align with the postmunge traindata parameter.) For example, if a user wishes to derive a set of entropy seeds to support a bulk_seeds sampling type, they can produce a report and derive as follows:

In [38]:
#first run automunge(.) to populate postprocess_dict
#using comparable category and parameter assignments
#we recommend running initially with default sampling_type
#to populate sampling_report_dict for test data even if df_test not provided
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 powertransform = 'DP2',
                 random_generator = False,
                 entropy_seeds = False,
                 sampling_dict = False,
                 printstatus = False)

In [39]:
#access the sampling_report_dict in the returned postprocess_dict
postprocess_dict['sampling_report_dict']

{'rowcount_basis_train': 891,
 'rowcount_basis_test': 1,
 'bulk_seeds_total_train': 45811,
 'bulk_seeds_total_test': 3,
 'bulk_seeds_stochastic_count_train': 18867,
 'bulk_seeds_stochastic_count_test': 0,
 'sampling_seed_total_train': 111,
 'sampling_seed_total_test': 3,
 'transform_seed_total': 27}

In [40]:
#access the sampling_report_dict in the returned postprocess_dict
sampling_report_dict = postprocess_dict['sampling_report_dict']

#a bulk_seeds sampling_type budget will need to take account for row counts
rowcount_train = df_train.shape[0]
rowcount_test = df_test.shape[0]

#the budget can be derived as
train_budget = \
sampling_report_dict['bulk_seeds_total_train'] * rowcount_train \
/ sampling_report_dict['rowcount_basis_train']

test_budget = \
sampling_report_dict['bulk_seeds_total_test'] * rowcount_test \
/ sampling_report_dict['rowcount_basis_test']

#number of external seeds needed for bulk seeds case:
seed_count = \
train_budget + test_budget

#this number of seeds can then be passed to the entropy_seeds parameter

externally_sampled_seeds_list = [1234,3456124,32546]

random_generator = False
entropy_seeds = externally_sampled_seeds_list
sampling_dict = \
{'sampling_type' : 'bulk_seeds',
 'extra_seed_generator' : 'PCG64',
 'sampling_generator' : 'PCG64',
 }

In [41]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             powertransform = 'DP2',
             random_generator = random_generator,
             entropy_seeds = entropy_seeds,
             sampling_dict = sampling_dict,
             printstatus = False)

train.head()

Unnamed: 0,Survived_DPb2_DPbn,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,PassengerId_NArw,PassengerId_DPrt,Survived_NArw,Pclass_NArw,Pclass_DPrt,Name_NArw,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
213,1,1,0,1,0,0.239326,1,0,0.5,0,...,0,0.0,0,228,0,0,0,0.025374,1,0
430,0,1,25,1,0,0.483146,1,0,0.0,0,...,0,0.0,0,320,0,0,0,0.051822,0,0
202,0,0,3,2,0,0.266085,1,0,1.0,0,...,0,0.012137,0,431,255,0,0,0.031671,1,0
716,0,0,94,2,0,0.804494,1,0,0.0,0,...,0,0.0,0,637,54,0,0,0.444099,0,0
682,1,1,129,1,0,0.766292,1,0,1.0,0,...,0,0.0,0,864,0,0,0,0.018006,1,0


##### F.8 QRAND library sampling

In [42]:
#recently validated in seperate notebook

# from qrand import QuantumBitGenerator
# from qrand.platforms import QiskitPlatform
# from qrand.protocols import HadamardProtocol
# from qiskit import IBMQ

# provider = IBMQ.load_account()
# platform = QiskitPlatform(provider)
# protocol = HadamardProtocol()
# bitgen = QuantumBitGenerator(platform, protocol)

# #then can initialize automunge(.) or postmunge(.) parameters
# #for each sampling being channeled through the quantum circuit
# random_generator = bitgen
# entropy_seeds = False
# sampling_dict = \
# {'sampling_type' : 'default',
#  'extra_seed_generator' : 'off',
#  'sampling_generator' : 'custom',
#  }
 
# #or if you only want to access the quantum circuit once per data set
# #can initialize in this alternate configuration for similar result
# random_generator = bitgen
# entropy_seeds = False
# sampling_dict = \
# {'sampling_type' : 'bulk_seeds',
#  'extra_seed_generator' : 'custom',
#  'sampling_generator' : 'PCG64',
#  }

In [43]:
# train, train_ID, labels, \
# val, val_ID, val_labels, \
# test, test_ID, test_labels, \
# postprocess_dict = \
# am.automunge(df_train,
#              powertransform = 'DP2',
#              random_generator = random_generator,
#              entropy_seeds = entropy_seeds,
#              sampling_dict = sampling_dict,
#              printstatus = False)

# train.head()

##### F.9 All together now

Let’s do a quick demonstration tying it all together. Here we’ll apply the powertransform = ‘DP2’ option for noise under automation, override a few of the default transforms with assigncat, assign a few deviations to transformation parameters via assignparam, add some additional entropy seeds from some other resource, and prepare a few additional training data duplicates for data augmentation purposes.

In [44]:
powertransform = 'DP2'
assigncat = {'DPh2' : 'Name'}
noise_augment = 2.
entropy_seeds = [432,6,243,561232,89]

#(Age is a feature header in the Titanic data set)
assignparam = {
    'DPrt' : {'Age': {'noisedistribution'         : 'abs_normal',
                            'noise_scaling_bias_offset' : False}},
    'default_assignparam' : {'DPrt' : {'flip_prob' : 0.1}},
    'global_assignparam'  : {'testnoise': True}}

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
                 labels_column = labels_column,
                 trainID_column = trainID_column,
                 powertransform = powertransform,
                 assigncat = assigncat,
                 noise_augment = noise_augment,
                 entropy_seeds = entropy_seeds,
                 assignparam = assignparam,
                 printstatus = False)

train.head()

Unnamed: 0,Name_DPh2_DPo7,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,Pclass_NArw,Pclass_DPrt,Name_NArw,Sex_NArw,Age_NArw,Age_DPrt,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
2310,315,1,145,1,0,1.0,0,0,0,0.484795,...,0,0.0,0,646,0,0,0,0.015469,1,0
206,49,0,145,1,0,1.0,0,0,0,0.560191,...,0,0.0,0,702,0,0,0,0.015127,1,0
268,60,0,1,1,0,0.550094,0,0,0,0.426476,...,0,0.333333,0,371,85,0,0,0.054164,1,0
1367,199,1,7,1,0,0.5,0,0,0,0.421965,...,0,0.0,0,699,0,0,0,0.040989,1,0
1910,244,0,144,2,0,1.0,0,0,1,0.24958,...,0,0.166667,0,486,81,0,0,0.04364,0,0


In [45]:
test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test)

test.head()

_______________
Begin Postmunge

Postmunge returned test column set: 
['Name_DPh2_DPo7', 'Sex_DPb2_DPbn', 'Cabin_DPo4_DPod', 'Embarked_DPo4_DPod', 'Pclass_NArw', 'Pclass_DPrt', 'Name_NArw', 'Sex_NArw', 'Age_NArw', 'Age_DPrt', 'SibSp_NArw', 'SibSp_DPrt', 'Parch_NArw', 'Parch_DPrt', 'Ticket_NArw', 'Ticket_DPhs_0_mlhs_DPod', 'Ticket_DPhs_1_mlhs_DPod', 'Ticket_DPhs_2_mlhs_DPod', 'Fare_NArw', 'Fare_DPrt', 'Cabin_NArw', 'Embarked_NArw']

_______________
Postmunge returned ID column set: 
['PassengerId', 'Automunge_index']

_______________
Postmunge Complete



Unnamed: 0,Name_DPh2_DPo7,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,Pclass_NArw,Pclass_DPrt,Name_NArw,Sex_NArw,Age_NArw,Age_DPrt,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
0,494,1,146,3,0,1.0,0,0,0,0.428248,...,0,0.0,0,439,0,0,0,0.015282,1,0
1,803,0,3,1,0,1.0,0,0,0,1.0,...,0,0.0,0,343,0,0,0,0.013663,1,0
2,850,1,54,3,0,0.5,0,0,0,1.0,...,0,0.0,0,509,0,0,0,0.018909,1,0
3,1013,1,129,1,0,1.0,0,0,0,0.334004,...,0,0.0,0,297,0,0,0,0.016908,1,0
4,497,0,3,1,0,1.0,0,0,0,0.271174,...,0,0.166667,0,3,0,0,0,0.023984,1,0


Similarly we can prepare additional test data in postmunge(.) using the postprocess_dict returned from automunge(.), which since we set testnoise as globally activated will result in injected noise in the default traindata=False case.

In [46]:
entropy_seeds = [2345, 77887, 2342, 7878789]

traindata = False

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test,
                 entropy_seeds = entropy_seeds,
                 traindata=traindata,
                 printstatus=False)

test.head()

Unnamed: 0,Name_DPh2_DPo7,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,Pclass_NArw,Pclass_DPrt,Name_NArw,Sex_NArw,Age_NArw,Age_DPrt,...,Parch_NArw,Parch_DPrt,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw
0,494,1,146,3,0,1.0,0,0,0,0.428248,...,0,0.0,0,439,0,0,0,0.015282,1,0
1,803,0,3,1,0,1.0,0,0,0,1.0,...,0,0.0,0,343,0,0,0,0.013663,1,0
2,850,1,54,3,0,0.5,0,0,0,1.0,...,0,0.0,0,509,0,0,0,0.018909,1,0
3,1013,1,129,1,0,1.0,0,0,0,0.334004,...,0,0.0,0,297,0,0,0,0.016908,1,0
4,497,0,3,1,0,1.0,0,0,0,0.271174,...,0,0.166667,0,3,0,0,0,0.021448,1,0


##### F.10 Noise directed at existing data pipelines

One more thing. When noise is intended for direction at an existing data pipeline, such as for incorporation of noise into test data for an inference operation on a previously trained model, there may be desire to inject noise without other edits to a dataframe. This is possible by passing the dataframe as a df_train to an automunge(.) call to populate a postprocess_dict with assignment of the various features to one of these four pass-through categories:

• __DPne__: pass-through numeric with gaussian (or laplace) noise, comparable parameter support to DPnb

• __DPse__: pass-through with swap noise (e.g. for categoric data), comparable parameter support to DPmc

• __DPpc__: pass-through with weighted categoric noise (categoric activation flips), comparable parameter support to DPod

• __excl__: pass-through without noise

Once populated, the postprocess_dict can be used to prepare additional data in postmunge(.) which has lower latency. Note that DPse injects swap noise by accessing an alternate row entry for a target. This type of noise may not be suitable for test data injections in a scenario where inference may be run on a test set with one or very few samples. The convention in library is that data is received in a tidy form (one column per feature and one row per observation), so ideally categoric features should be received in a single column configuration for targeting with DPse.

Note that DPne will return entries as float data type, converting any non-numeric to NaN. The default noise scale for DPne (sigma=0.06 / test_sigma=0.03) is set to align with z-score normalized data. For the DPne pass-through transform, since the feature may not be z-score normalized, the scaling is adjusted by multiplication with the evaluated standard deviation of the feature as found in the training data by use of the defaulted parameter rescale_sigmas = True. This adjustment factor is derived based on the training data used to fit the postprocess_dict, and that same basis is carried through to postmunge(.). If user doesn’t have access to the training data, they can fit the postprocess_dict to a representative df_test routed as the automunge(.) df_train.

Having populated the postprocess_dict, additional inference data can be channeled through postmunge, which has latency benefits.

In [47]:
numeric_features = ['Age', 'Fare']

categoric_features = \
['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']

passthrough_features = ['PassengerId']

#assign the features to one of DTne/DTse/excl
#The DT configuration defaults to injecting noise just to test data
assigncat = \
{'DTne' : numeric_features, #numeric features receiving gaussian noise
 'DTse' : categoric_features, #categoric features receiving swap noise
 'excl' : passthrough_features,
}

#if we want to update the noise parameters they can be applied in assignparam
#shown here are the defaults
assignparam = \
{'default_assignparam' : 
  {'DPne' : {'test_sigma' : 0.06,
                 'rescale_sigmas' : True},
   'DTse' : {'test_flip_prob' : 0.01}}}

#We'll also deactivate shuffletrain to retain order of rows
shuffletrain = False

#note that the family trees for DPne / DPse / DPpc / excl 
#do not include NArw aggregation
#so no need to deactivate NArw_marker
#they are also already excluded from infill based on process_dict specification
#so no need to deactivate MLinfill

#the orig_headers parameter retains original column headers 
#without suffix appenders
orig_headers = True

#this operation can fit the postprocess_dict to the df_test (or df_train)
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_test,
                 assigncat = assigncat,
                 assignparam = assignparam,
                 shuffletrain = shuffletrain,
                 orig_headers = orig_headers,
                 printstatus = False)
            
train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [48]:
#we can then use the populated postprocess_dict to run postmunge(.)
#which has better latency than automunge(.)
#the entropy seeding parameters are shown with their defaults for reference

test, test_ID, test_labels, \
postreports_dict = \
am.postmunge(postprocess_dict, 
                 df_test,
                 printstatus = False,
                 random_generator = False,
                 entropy_seeds = False,
                 sampling_dict = {}
                 )

test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The returned dataframe test can then be passed to inference. The order of columns in returned dataframe will be retained for these transforms and the orig_headers parameter retains original column headers without suffix appenders.

The postmunge(.) call can then be repeated as additional inference data becomes available, and could be applied sequentially to streams of data in inference.

### Appendix G - Advanced noise profiles

The noise profiles discussed thus far have mostly been a composition of two perturbation vectors, one arising from a Bernoulli sampling to select injection targets and a second arising from either a distribution sampling for numeric or a choice sampling for categoric. There may be use cases where a user desires some additional variations in the form of distribution. This section will survey a few more advanced noise profile compositions methods available. Compositions beyond those discussed here are also available by custom defined transformation functions which are available by use of a simple template.

In some sense, the methods discussed here will be a form of probabilistic programming, although not a Turing complete one. For Turing complete distribution compositionally, we recommend channeling through custom defined transformation functions that make use of external libraries with such capability, e.g. Tran et al. (2017). Custom transformations can apply a simple template detailed in the read me Teague (2021a) section “Custom Transformation Functions."

##### G.1 Noise parameter randomization

The generic numeric noise injection parameters were surveyed in [Appendix H.3] and their defaults presented in [Table 4]. Similarly, the generic categoric noise injection parameters were surveyed in [Appendix H.4] with defaults presented in [Table 5]. For each of the parameters related to noise scaling, weighting, or specification, the library offers options to randomize their derivation by a random sampling, including support for such sampling to be conducted with the support of entropy seeding. The random sampling of parameter values can either be activated by passing the parameters as a list of candidate values for a choice sampling between, or for parameters with float specification by passing parameters as arbitrary scipy stats Virtanen et al. (2020) formatted distributions for a shaped sampling.

One of the parameters that we did not go into great detail in earlier discussions was the ‘retain_basis’ parameter. This parameter is relevant to noise parameter randomization, and refers to the practice of applying a common or unique noise parameter sampling between a feature as prepared in the test data received by automunge(.) and each subsequent postmunge(.) call. We expect that in most cases a user will desire a common noise profile between initial test data prepared in automunge(.) and subsequent test data prepared in postmunge(.) for inference, as is the default True setting. A consistent noise profile should be appropriate when relying on a corresponding noise profile injected to the training data. We speculate that there may be cases where non-deterministic inference could benefit from a unique sampled noise profile across inference operations. Deactivating the retain_basis option can either be conducted specific to a feature, or may be conducted globally using an assignparam[‘global_assignparam’] specification.

In [49]:
#Here we demonstrate two forms of parameter randomization

#each DPod transform flip_prob parameter 
#will sample from a normal distribution of mean 0.2 and scale 0.1

#the DPmm sigma parameter targeting Fare column
#will sample from a set of candidate values shown as [0.02, 0.03, 0.04]

#the default is that this sampling is applied both for train and test data
#to retain the train basis for test data can activate the retain_basis parameter

assigncat = \
{'DPod' : ['Pclass'],
 'DPmm' : ['Fare']}

from scipy import stats

assignparam = \
{'global_assignparam'  : 
    {'testnoise': True},
 'default_assignparam' : 
    {'DPod' : {'flip_prob' : stats.norm(loc=0.2, scale=0.1)}},
 'DPmm' : 
    {'Fare' : 
        {'sigma' : [0.02, 0.03, 0.04]}}}

In [50]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_test,
             assigncat = assigncat,
             assignparam = assignparam,
             printstatus = False)
            
train.head()

Unnamed: 0,PassengerId_nmbr,Pclass_DPo4_DPod,Sex_bnry,Age_nmbr,SibSp_nmbr,Parch_nmbr,PassengerId_NArw,Pclass_NArw,Name_NArw,Name_hash_0,...,Cabin_1010_0,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
66,-1.179534,1,0,-0.865412,-0.498872,-0.399769,0,0,0,592,...,0,0,0,0,0,0,0,0,1,0
131,-0.641501,2,1,1.602643,-0.498872,-0.399769,0,0,0,506,...,0,1,0,0,0,1,1,0,0,1
202,-0.053803,2,1,1.179547,0.616254,-0.399769,0,0,0,360,...,0,1,0,1,0,0,0,0,0,1
201,-0.062081,2,1,-2.111427,-0.498872,1.638076,0,0,0,194,...,0,0,0,0,0,0,0,0,1,1
192,-0.136578,1,1,-1.323765,0.616254,0.619154,0,0,0,395,...,0,0,0,0,0,0,0,0,1,1


##### G.2 Noise profile composition

Another channel for adding additional perturbation vectors into a noise profile is available by composing sets of noise injection transforms using the family tree primitives [Table 1]. The primitives are for purposes of specifying the order, type, and retention of derivations applied when a ‘root category’ is assigned to an input feature, where each derivation is associated with a ‘tree category’ populated in either the root category’s family or some downstream family tree accessed from origination of the root category. As they are implemented with elements of recursion, they can be used to specify transformation sets that include generations and branches of univariate derivations. Thus, multiple noise injection operations can be applied to a single returned set, potentially including noise of different scaling and/or injection ratios.

Here is an example of family tree specification for a numeric injection of Gaussian noises with two different profiles as applied downstream of a z-score normalization.

In [51]:
transformdict = {}
transformdict.update({'DPnb' : {'parents'       : ['DPn3'],
                                          'siblings'      : [],
                                          'auntsuncles'   : [],
                                          'cousins'       : ['NArw'],
                                          'children'      : [],
                                          'niecesnephews' : [],
                                          'coworkers'     : ['DPnb2'],
                                          'friends'       : []}})

transformdict.update({'DPn3' : {'parents'       : ['DPn3'],
                                          'siblings'      : [],
                                          'auntsuncles'   : [],
                                          'cousins'       : [],
                                          'children'      : ['DPnb'],
                                          'niecesnephews' : [],
                                          'coworkers'     : [],
                                          'friends'       : []}})

processdict = {}
processdict.update({'DPnb' : {'functionpointer' : 'DPnb',
                                        'defaultparams' : {'sigma':0.5,
                                                                 'flip_prob':0.0001}}})
                                                 
processdict.update({'DPnb2' : {'functionpointer' : 'DPnb',
                                         'defaultparams' : {'sigma':0.05,
                                                                  'flip_prob':0.03}}})
                                                                  
processdict.update({'DPn3' : {'functionpointer' : 'nmbr'}})

In [52]:
train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             transformdict = transformdict,
             processdict = processdict,
             shuffletrain = shuffletrain,
             assigncat = {'DPnb':'Age'},
             printstatus = False)
                          
train.head()

Unnamed: 0,PassengerId_nmbr,Survived_bnry,Pclass_nmbr,Sex_bnry,Age_DPn3_DPnb_DPnb2,SibSp_nmbr,Parch_nmbr,Fare_nmbr,PassengerId_NArw,Survived_NArw,...,Cabin_1010_1,Cabin_1010_2,Cabin_1010_3,Cabin_1010_4,Cabin_1010_5,Cabin_1010_6,Cabin_1010_7,Embarked_NArw,Embarked_1010_0,Embarked_1010_1
0,-1.729137,1,0.826913,1,-0.530005,0.43255,-0.473408,-0.502163,0,1,...,0,0,0,0,0,0,0,0,1,1
1,-1.725251,0,-1.565228,0,0.57143,0.43255,-0.473408,0.786404,0,1,...,1,0,1,0,0,1,0,0,0,1
2,-1.721365,0,0.826913,0,-0.254646,-0.474279,-0.473408,-0.48858,0,1,...,0,0,0,0,0,0,0,0,1,1
3,-1.71748,0,-1.565228,0,0.364911,0.43255,-0.473408,0.420494,0,1,...,0,1,1,1,0,0,0,0,1,1
4,-1.713594,1,0.826913,1,0.364911,-0.474279,-0.473408,-0.486064,0,1,...,0,0,0,0,0,0,0,0,1,1


##### G.3 Protected attributes

We noted in our related work discussions in Section 7 that one possible consequence of noise injections is that different segments of a feature set corresponding to categories in an adjacent protected feature may be impacted more than others owing to a diversity in segment distributions in comparison to a common noise profile, which may contribute to loss discrepancy between categories of that protected feature Khani & Liang (2020) without mitigation. The mitigation available from Automunge was inspired by the description of loss discrepancy offered by the citation, and might be considered another contribution of this paper.

Loss discrepancy may arise in conjunction with noise due to the property that segments of a feature that are not randomly sampled in some cases may not share the same distribution profile from their aggregation. In some cases the segments in a noise targeted feature corresponding to the attributes of an adjacent protected features may have this property. Thus by injecting a single noise profile into segments with different scalings, those segments will be unequally impacted.

The Automunge solution is an as yet untested hypothesis with a clean implementation. When a user specifies an adjacent protected feature for a numeric noise feature, the noise scaling for each segment of the noise target feature corresponding to attributes in the adjacent feature is rescaled, with the train data basis carried through to test data. For example, if the aggregate feature has standard deviation σa, and the segment has standarde deviation σs, the noise can be scaled for that segment by multiplication by the ratio σs /σa. Similarly, if a protected feature is specified for a categoric noise feature, the derivation of weights by frequency counts can be calculated for each segment individually. In both cases, the segment noise distributions will share a common profile between train and test data, and the aggregate noise distribution will too as long as the distribution properties of the protected feature remain consistent. These options can be activated by passing an input header string to the protected_feature parameter of a distribution sampled or weighted categoric noise transform through assignparam, with support for up to one protected feature per noise targeted feature.

In [53]:
assignparam = \
{'global_assignparam'  : 
    {'protected_feature': 'Pclass'}}

train, train_ID, labels, \
val, val_ID, val_labels, \
test, test_ID, test_labels, \
postprocess_dict = \
am.automunge(df_train,
             assignparam = assignparam,
             powertransform = 'DP2',
             printstatus = False)
                          
train.head()

Unnamed: 0,Survived_DPb2_DPbn,Sex_DPb2_DPbn,Cabin_DPo4_DPod,Embarked_DPo4_DPod,PassengerId_NArw,PassengerId_DPrt,Survived_NArw,Name_NArw,Name_DPhs_0_mlhs_DPod,Name_DPhs_1_mlhs_DPod,...,Ticket_NArw,Ticket_DPhs_0_mlhs_DPod,Ticket_DPhs_1_mlhs_DPod,Ticket_DPhs_2_mlhs_DPod,Fare_NArw,Fare_DPrt,Cabin_NArw,Embarked_NArw,Pclass_NArw,Pclass_DPrt
338,0,1,129,1,0,0.379775,1,0,450,37,...,0,114,0,0,0,0.015713,1,0,0,1.0
137,1,1,20,1,0,0.153933,1,0,13,37,...,0,361,0,0,0,0.103644,0,0,0,0.0
225,1,1,3,1,0,0.252809,1,0,424,37,...,0,244,29,0,0,0.01825,1,0,0,1.0
500,1,1,129,1,0,0.561798,1,0,343,37,...,0,73,9,0,0,0.016908,1,0,0,1.0
90,1,1,129,1,0,0.101124,1,0,835,37,...,0,761,0,0,0,0.015713,1,0,0,1.0
