<h1>ANOVOS - Data Transformer<span class="tocSkip"></span></h1>
<p> Following notebook shows the list of functions related to "data transformer" module provided under ANOVOS package and how it can be invoked accordingly</p>
<div class="toc"><ul class="toc-item"><li><span><a href="#Attribute-Binning-(discretization)" data-toc-modified-id="Attribute-Binning-1">Attribute Binning (discretization)</a></span></li><li><span><a href="#Monotonic-Binning" data-toc-modified-id="Monotonic-Binning-2">Monotonic Binning</a></span></li><li><span><a href="#Categorical-Attribute-to-Numerical-Attribute-Conversion" data-toc-modified-id="Categorical-Attribute-to-Numerical-Attribute-Conversion-3">Categorical Attribute to Numerical Attribute Conversion</a></span></li><li><span><a href="#Missing-Value-Imputation" data-toc-modified-id="Missing-Value-Imputation">Missing Value Imputation</a></span></li><li><span><a href="#Outlier-Categories-Treatment" data-toc-modified-id="Outlier-Categories-Treatment-5">Outlier Categories Treatment</a></span></li></ul></div>

**Setting Spark Session**

In [1]:
from anovos.shared.spark import *

**Input/Output Path** 

In [2]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_transformer"

**Read Input Data** 

In [3]:
from anovos.data_ingest.data_ingest import read_dataset
from pyspark.sql import functions as F
df = read_dataset(spark, file_path = inputPath, file_type = "csv",
                  file_configs = {"header": "True", "delimiter": "," , "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Attribute Binning (discretization)
- API specification of function **attribute_binning** can be found <a href="../api_specification/anovos/data_transformer/transformers.html#anovos.data_transformer.transformers.attribute_binning">here</a>
- Supports numerical attributes only. 2 binning options: Equal Range Binning (each bin is of equal size/width) and Equal Frequency Binning (each bin has equal no. of rows) 

In [4]:
from anovos.data_transformer.transformers import attribute_binning

In [5]:
# Example 1 - Equal range binning + append transformed columns at the end
odf = attribute_binning(spark, idf=df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_range", 
                        bin_size=5, output_mode="append", print_impact=True)

odf.toPandas().head(5)

+--------------------+-------------+
|           attribute|unique_values|
+--------------------+-------------+
|education-num_binned|            5|
|hours-per-week_bi...|            5|
+--------------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,hours-per-week_binned,education-num_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,3.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3.0,4.0


In [6]:
# Distinct values after binning
odf.select('hours-per-week_binned').distinct().orderBy('hours-per-week_binned').toPandas().head(10)

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0
4,4.0
5,5.0


In [7]:
# Example 2 - Equal frequency binning + replace original columns by transformed ones (default)
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, print_impact=True)

odf.toPandas().head(5)

+--------------+-------------+
|     attribute|unique_values|
+--------------+-------------+
|hours-per-week|            4|
| education-num|            4|
+--------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,4.0


In [8]:
# Distinct values after binning
odf.select('hours-per-week').distinct().orderBy('hours-per-week').toPandas().head(10)

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,4.0
4,5.0


In [9]:
# Example 3 - Equal frequency binning + save binning model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, pre_existing_model=False, model_path=outputPath + "/attribute_binning")

odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,4.0


In [10]:
# Example 4 - Equal frequency binning + use pre-saved model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], 
                        pre_existing_model=True, model_path=outputPath + "/attribute_binning")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,4.0


# Monotonic Binning
- API specification of function **monotonic_binning** can be found <a href="../api_specification/anovos/data_transformer/transformers.html#anovos.data_transformer.transformers.monotonic_binning">here</a>
- Bin size is computed dynamically

In [11]:
from anovos.data_transformer.transformers import monotonic_binning

In [12]:
# Example 1 - Equal Range Binning + append tranformed columns at the end
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_range", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,hours-per-week_binned,education-num_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,2.0,6.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1.0,6.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,2.0,4.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,2.0,3.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,2.0,6.0


In [13]:
# Distinct values for hours-per-week after binning 
odf.select("hours-per-week_binned").distinct().orderBy('hours-per-week_binned').toPandas()

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0


In [14]:
# Example 2 - Equal Frequency Binning + replace original columns by transformed ones (default)
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_frequency")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,2.0,12.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,12.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,2.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,2.0,12.0


In [15]:
# Distinct values for hours-per-week after binning
odf.select("hours-per-week").distinct().orderBy('hours-per-week').toPandas()

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,6.0
4,7.0


# Categorical Attribute to Numerical Attribute Conversion
- API specification of function **cat_to_num_unsupervised** can be found <a href="../api_specification/anovos/data_transformer/transformers.html#anovos.data_transformer.transformers.cat_to_num_unsupervised">here</a>
- Supports Label Encoding and One hot encoding

In [16]:
from anovos.data_transformer.transformers import cat_to_num_unsupervised

In [17]:
# Example 1 - with mandatory arguments (Label Encoding)
odf = cat_to_num_unsupervised(spark, df)
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,15092,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,1.0,0.0,0.0,2174.0,0.0,40.0,42,0
1,22895,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,13.0,42,0
2,28742,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,1.0,0.0,0.0,0.0,0.0,40.0,42,0
3,4124,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,0.0,1.0,0.0,0.0,0.0,40.0,42,0
4,11005,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,40.0,10,0


In [18]:
# Example 2 - 'all' columns (excluding drop_cols) + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], print_impact=True)
odf.toPandas().head(5)

Before
+-------+------+-----+-----------+-------+-----------+-----+------------+-------------+--------------+----------------+------------+-------+-----+------------+------------+--------------+--------------+------+
|summary|   ifa|  age|  workclass| fnlwgt|     logfnl|empty|   education|education-num|marital-status|      occupation|relationship|   race|  sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+-------+------+-----+-----------+-------+-----------+-----+------------+-------------+--------------+----------------+------------+-------+-----+------------+------------+--------------+--------------+------+
|  count| 32561|32500|      32558|  32546|      12168|    0|       32040|        32530|         32135|           32549|       32557|  32247|32557|       32548|       32549|         32452|         32561| 32561|
|    min|10000a|   17|    Private|  12285|4.283617786| null|        10th|            1|             ?|               ?|           *|      *|    ?|       

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,1.0,0.0,0.0,2174.0,0.0,40.0,42,0
1,2a,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,13.0,42,0
2,3a,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,1.0,0.0,0.0,0.0,0.0,40.0,42,0
3,4a,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,0.0,1.0,0.0,0.0,0.0,40.0,42,0
4,5a,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,40.0,10,0


In [19]:
# Example 3 - selected categorical columns + assign unique integers based on alphabetical order (asc)
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], index_order='alphabetAsc')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,1.0,77516.0,4.889391,,9.0,13.0,4.0,1.0,3.0,6.0,2.0,2174.0,0.0,40.0,41,0
1,2a,,8.0,83311.0,4.920702,,9.0,13.0,3.0,4.0,2.0,6.0,2.0,0.0,0.0,13.0,41,0
2,3a,38.0,6.0,215646.0,5.333741,,11.0,9.0,1.0,6.0,3.0,6.0,2.0,0.0,0.0,40.0,41,0
3,4a,53.0,6.0,234721.0,5.370552,,1.0,7.0,3.0,6.0,2.0,4.0,2.0,0.0,0.0,40.0,41,0
4,5a,,6.0,338409.0,5.529442,,9.0,13.0,3.0,10.0,7.0,4.0,1.0,0.0,0.0,40.0,6,0


In [20]:
# Example 4 - selected categorical columns + one hot encoding (method_type=0) + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type=0, print_impact=True)
odf.toPandas().head(5)

Before
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

After
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- educatio

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race_0,race_1,race_2,race_3,race_4,race_5,race_6,race_7,race_8,race_9
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,1,0,0,0,0,0,0,0,0,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,1,0,0,0,0,0,0,0,0,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,1,0,0,0,0,0,0,0,0,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0,1,0,0,0,0,0,0,0,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0,1,0,0,0,0,0,0,0,0


In [21]:
# Example 5 - one hot encoding + save model
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa', 'empty'], method_type=0,
                                   pre_existing_model=False, model_path=outputPath)
odf.limit(10).toPandas().head(5)

Unnamed: 0,ifa,age,fnlwgt,logfnl,empty,education-num,capital-gain,capital-loss,hours-per-week,occupation_0,...,native-country_34,native-country_35,native-country_36,native-country_37,native-country_38,native-country_39,native-country_40,native-country_41,native-country_42,native-country_43
0,1a,,77516,4.889391,,13,2174,0,40,0,...,0,0,0,0,0,0,0,0,1,0
1,2a,,83311,4.920702,,13,0,0,13,0,...,0,0,0,0,0,0,0,0,1,0
2,3a,38.0,215646,5.333741,,9,0,0,40,0,...,0,0,0,0,0,0,0,0,1,0
3,4a,53.0,234721,5.370552,,7,0,0,40,0,...,0,0,0,0,0,0,0,0,1,0
4,5a,,338409,5.529442,,13,0,0,40,1,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# Example 6 - one hot encoding + use pre-saved model
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa', 'empty'], method_type=0, 
                                  pre_existing_model=True, model_path=outputPath)
odf.limit(10).toPandas().head(5)

Unnamed: 0,ifa,age,fnlwgt,logfnl,empty,education-num,capital-gain,capital-loss,hours-per-week,occupation_0,...,native-country_34,native-country_35,native-country_36,native-country_37,native-country_38,native-country_39,native-country_40,native-country_41,native-country_42,native-country_43
0,1a,,77516,4.889391,,13,2174,0,40,0,...,0,0,0,0,0,0,0,0,1,0
1,2a,,83311,4.920702,,13,0,0,13,0,...,0,0,0,0,0,0,0,0,1,0
2,3a,38.0,215646,5.333741,,9,0,0,40,0,...,0,0,0,0,0,0,0,0,1,0
3,4a,53.0,234721,5.370552,,7,0,0,40,0,...,0,0,0,0,0,0,0,0,1,0
4,5a,,338409,5.529442,,13,0,0,40,1,...,0,0,0,0,0,0,0,0,0,0


# Missing Value Imputation
- API specification of function **imputation_MMM** can be found <a href="../api_specification/anovos/data_transformer/transformers.html#anovos.data_transformer.transformers.imputation_MMM">here</a>

In [23]:
from anovos.data_transformer.transformers import imputation_MMM

In [24]:
# Example 1 - with mandatory arguments + print impact
odf = imputation_MMM(spark, df, print_impact=True)

+--------------+-------------------+------------------+
|     attribute|missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
| education-num|                 31|                 0|
|     workclass|                  3|                 0|
|     education|                521|                 0|
|          race|                314|                 0|
|  relationship|                  4|                 0|
|  capital-gain|                 13|                 0|
|  capital-loss|                 12|                 0|
|           age|                 61|                 0|
|hours-per-week|                109|                 0|
|        fnlwgt|                 15|                 0|
|marital-status|                426|                 0|
|           sex|                  4|                 0|
|    occupation|                 12|                 0|
|        logfnl|              20393|                 0|
|         empty|              32561|            

In [25]:
# Example 2 - use mean for numerical columns + append transformed columns at the end
odf = imputation_MMM(spark, df, list_of_cols='all', method_type="mean", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,logfnl_imputed,hours-per-week_imputed,occupation_imputed,workclass_imputed,empty_imputed,marital-status_imputed,race_imputed,sex_imputed,education_imputed,relationship_imputed
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,4.889391,40,Adm-clerical,State-gov,,Never-married,White,Male,Bachelors,Not-in-family
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,4.920702,13,Exec-managerial,Self-emp-not-inc,,Married-civ-spouse,White,Male,Bachelors,Husband
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,5.333741,40,Handlers-cleaners,Private,,Divorced,White,Male,HS-grad,Not-in-family
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,5.370552,40,Handlers-cleaners,Private,,Married-civ-spouse,Black,Male,11th,Husband
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,5.529442,40,Prof-specialty,Private,,Married-civ-spouse,Black,Female,Bachelors,Wife


In [26]:
odf.select('education-num', 'education-num_imputed').where(F.col("education-num").isNull()).distinct().toPandas().head(5)

Unnamed: 0,education-num,education-num_imputed
0,,10


In [27]:
# Example 3 - save model
odf = imputation_MMM(spark, df, pre_existing_model=False, model_path=outputPath)

In [28]:
# Example 4 - use pre-saved model
odf = imputation_MMM(spark, df, pre_existing_model=True, model_path=outputPath)
odf.toPandas().head(5)

Unnamed: 0,ifa,native-country,income,education-num,age,fnlwgt,capital-loss,capital-gain,logfnl,hours-per-week,occupation,workclass,empty,marital-status,race,sex,education,relationship
0,1a,UnitedStates,<=50K,13,37,77516,0,2174,4.889391,40,Adm-clerical,State-gov,,Never-married,White,Male,Bachelors,Not-in-family
1,2a,UnitedStates,<=50K,13,37,83311,0,0,4.920702,13,Exec-managerial,Self-emp-not-inc,,Married-civ-spouse,White,Male,Bachelors,Husband
2,3a,UnitedStates,<=50K,9,38,215646,0,0,5.333741,40,Handlers-cleaners,Private,,Divorced,White,Male,HS-grad,Not-in-family
3,4a,UnitedStates,<=50K,7,53,234721,0,0,5.370552,40,Handlers-cleaners,Private,,Married-civ-spouse,Black,Male,11th,Husband
4,5a,Cuba,<=50K,13,37,338409,0,0,5.529442,40,Prof-specialty,Private,,Married-civ-spouse,Black,Female,Bachelors,Wife


In [29]:
# Example 5 - selected columns + use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts, measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
missing = write_dataset(measures_of_counts(spark, df),outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf = imputation_MMM(spark, df, list_of_cols=['marital-status', 'sex', 'occupation', 'age'], 
                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                     stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)
odf.toPandas().head(5)

+--------------+-------------------+------------------+
|     attribute|missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|           age|                 61|                 0|
|marital-status|                426|                 0|
|    occupation|                 12|                 0|
|           sex|                  4|                 0|
+--------------+-------------------+------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,relationship,race,capital-gain,capital-loss,hours-per-week,native-country,income,age,marital-status,occupation,sex
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Not-in-family,White,2174.0,0.0,40.0,UnitedStates,<=50K,37,Never-married,Adm-clerical,Male
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Husband,White,0.0,0.0,13.0,UnitedStates,<=50K,37,Married-civ-spouse,Exec-managerial,Male
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Not-in-family,White,0.0,0.0,40.0,UnitedStates,<=50K,38,Divorced,Handlers-cleaners,Male
3,4a,Private,234721.0,5.370552,,11th,7.0,Husband,Black,0.0,0.0,40.0,UnitedStates,<=50K,53,Married-civ-spouse,Handlers-cleaners,Male
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Wife,Black,0.0,0.0,40.0,Cuba,<=50K,37,Married-civ-spouse,Prof-specialty,Female


# Outlier Categories Treatment
- API specification of function **outlier_categories** can be found <a href="../api_specification/anovos/data_transformer/transformers.html#anovos.data_transformer.transformers.outlier_categories">here</a>
- Supports 2 ways of outliers detection: by max number of categories and by coverage (%)

In [30]:
from anovos.data_transformer.transformers import outlier_categories

In [31]:
# Example 1 - 'all' columns (excluding drop_cols) + max 15 categories + append transformed columns at the end
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, output_mode='append')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,occupation_outliered,workclass_outliered,income_outliered,empty_outliered,marital-status_outliered,race_outliered,sex_outliered,education_outliered,relationship_outliered,native-country_outliered
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,Adm-clerical,State-gov,<=50K,,Never-married,White,Male,Bachelors,Not-in-family,others
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,Exec-managerial,Self-emp-not-inc,<=50K,,Married-civ-spouse,White,Male,Bachelors,Husband,others
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,Handlers-cleaners,Private,<=50K,,Divorced,White,Male,HS-grad,Not-in-family,others
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,Handlers-cleaners,Private,<=50K,,Married-civ-spouse,Black,Male,11th,Husband,others
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,Prof-specialty,Private,<=50K,,Married-civ-spouse,Black,Female,Bachelors,Wife,Cuba


In [32]:
# Example 2 - selected columns + max 10 categories
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         max_category=10, print_impact=True)

+--------------+-------------------+
|     attribute|uniqueValues_before|
+--------------+-------------------+
|     education|                 16|
|    occupation|                 15|
|native-country|                 44|
+--------------+-------------------+

+--------------+------------------+
|     attribute|uniqueValues_after|
+--------------+------------------+
|     education|                10|
|    occupation|                10|
|native-country|                10|
+--------------+------------------+



In [33]:
# Example 3 - selected columns + cover 90% values
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         coverage=0.9, print_impact=True)

+--------------+-------------------+
|     attribute|uniqueValues_before|
+--------------+-------------------+
|     education|                 16|
|    occupation|                 15|
|native-country|                 44|
+--------------+-------------------+

+--------------+------------------+
|     attribute|uniqueValues_after|
+--------------+------------------+
|     education|                 9|
|    occupation|                11|
|native-country|                 3|
+--------------+------------------+



In [34]:
# Example 4 - max 15 categories + save model
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, 
                         pre_existing_model=False, model_path=outputPath, print_impact=True)

+--------------+-------------------+
|     attribute|uniqueValues_before|
+--------------+-------------------+
|    occupation|                 15|
|     workclass|                 11|
|        income|                  2|
|         empty|                  0|
|marital-status|                  7|
|          race|                  9|
|           sex|                  3|
|     education|                 16|
|  relationship|                  8|
|native-country|                 44|
+--------------+-------------------+

+--------------+------------------+
|     attribute|uniqueValues_after|
+--------------+------------------+
|    occupation|                15|
|     workclass|                11|
|        income|                 2|
|         empty|                 0|
|marital-status|                 7|
|          race|                 9|
|           sex|                 3|
|     education|                15|
|  relationship|                 8|
|native-country|                15|
+------------

In [35]:
# Example 5 - use pre-saved model
odf = outlier_categories(spark, df, drop_cols=['ifa'], pre_existing_model=True, model_path=outputPath, print_impact=True)

+--------------+-------------------+
|     attribute|uniqueValues_before|
+--------------+-------------------+
|    occupation|                 15|
|     workclass|                 11|
|        income|                  2|
|         empty|                  0|
|marital-status|                  7|
|          race|                  9|
|           sex|                  3|
|     education|                 16|
|  relationship|                  8|
|native-country|                 44|
+--------------+-------------------+

+--------------+------------------+
|     attribute|uniqueValues_after|
+--------------+------------------+
|    occupation|                15|
|     workclass|                10|
|        income|                 2|
|         empty|                 0|
|marital-status|                 7|
|          race|                 9|
|           sex|                 3|
|     education|                15|
|  relationship|                 8|
|native-country|                15|
+------------