# ANOVOS - Data Transformer
Following notebook shows the list of functions related to "data transformer" module provided under ANOVOS package and how it can be invoked accordingly.
- [Attribute Binning](#Attribute-Binning)
- [Monotonic Binning](#Monotonic-Binning)
- [Categorical Attribute to Numerical Attribute Conversion](#Categorical-Attribute-to-Numerical-Attribute-Conversion)
    - [Categorical to Numerical - Unsupervised](#Categorical-to-Numerical---Unsupervised)
    - [Categorical to Numerical - Supervised](#Categorical-to-Numerical---Supervised)
- [Attribute Rescaling](#Attribute-Rescaling)
    - [Z Standardization](#Z-Standardization)
    - [IQR Standardization](#IQR-Standardization)
    - [Normalization](#Normalization)
- [Missing Value Imputation](#Missing-Value-Imputation)
    - [Imputation MMM](#Imputation-MMM)
    - [Imputation Sklearn](#Imputation-Sklearn)
    - [Imputation Matrix Factorization](#Imputation-Matrix-Factorization)
    - [Auto Imputation](#Auto-Imputation)
- [Latent Features Generation](#Latent-Features-Generation)
    - [PCA Latent Features](#PCA-Latent-Features)
- [Feature Transformation](#Feature-Transformation)
- [Box Cox Transformation](#Box-Cox-Transformation)
- [Outlier Categories Treatment](#Outlier-Categories-Treatment)
- [Expression Parser](#Expression-Parser)

**Setting Spark Session**

In [11]:
from anovos.shared.spark import *

sc.setLogLevel("ERROR")
import warnings
warnings.filterwarnings('ignore')

**Input/Output Path** 

In [6]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_transformer"

**Read Input Data** 

In [7]:
from anovos.data_ingest.data_ingest import read_dataset
from pyspark.sql import functions as F
df = read_dataset(spark, file_path = inputPath, file_type = "csv",
                  file_configs = {"header": "True", "delimiter": "," , "inferSchema": "True"})
df = df.drop("dt_1", "dt_2")
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Attribute Binning
- API specification of function **attribute_binning** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only
- 2 binning options: Equal Range Binning (each bin is of equal size/width) and Equal Frequency Binning (each bin has equal no. of rows)

In [4]:
from anovos.data_transformer.transformers import attribute_binning

In [5]:
# Example 1 - Equal range binning + append transformed columns at the end
odf = attribute_binning(spark, idf=df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_range", 
                        bin_size=5, output_mode="append", print_impact=True)

odf.toPandas().head(5)

                                                                                

+---------------------+-------------+
|attribute            |unique_values|
+---------------------+-------------+
|education-num_binned |5            |
|hours-per-week_binned|5            |
+---------------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education-num_binned,hours-per-week_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,4.0,3.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,4.0,1.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,2.0,3.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,4.0,3.0


In [6]:
# Distinct values after binning
odf.select('hours-per-week_binned').distinct().orderBy('hours-per-week_binned').toPandas().head(10)

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0
4,4.0
5,5.0


In [7]:
# Example 2 - Equal frequency binning + replace original columns by transformed ones (default)
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, print_impact=True)

odf.toPandas().head(5)

+--------------+-------------+
|attribute     |unique_values|
+--------------+-------------+
|education-num |4            |
|hours-per-week|4            |
+--------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,education-num,hours-per-week
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,4.0,2.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,4.0,1.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,2.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,1.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,4.0,2.0


In [8]:
# Distinct values after binning
odf.select('hours-per-week').distinct().orderBy('hours-per-week').toPandas().head(10)

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,4.0
4,5.0


In [9]:
# Example 3 - Equal frequency binning + save binning model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, pre_existing_model=False, model_path=outputPath + "/attribute_binning")

odf.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,education-num,hours-per-week
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,4.0,2.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,4.0,1.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,2.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,1.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,4.0,2.0


In [10]:
# Example 4 - Equal frequency binning + use pre-saved model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], 
                        pre_existing_model=True, model_path=outputPath + "/attribute_binning")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,education-num,hours-per-week
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,4.0,2.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,4.0,1.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,1.0,2.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,1.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,4.0,2.0


# Monotonic Binning
- API specification of function **monotonic_binning** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Bin size is computed dynamically

In [11]:
from anovos.data_transformer.transformers import monotonic_binning

In [12]:
# Example 1 - Equal Range Binning + append tranformed columns at the end
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_range", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,education-num_binned,hours-per-week_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,6.0,2.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,6.0,1.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,4.0,2.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,6.0,2.0


In [13]:
# Distinct values for hours-per-week after binning 
odf.select("hours-per-week_binned").distinct().orderBy('hours-per-week_binned').toPandas()

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0


In [14]:
# Example 2 - Equal Frequency Binning + replace original columns by transformed ones (default)
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_frequency")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,income,education-num,hours-per-week
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,UnitedStates,<=50K,12.0,2.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,UnitedStates,<=50K,12.0,1.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,UnitedStates,<=50K,3.0,2.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,UnitedStates,<=50K,2.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,Cuba,<=50K,12.0,2.0


In [15]:
# Distinct values for hours-per-week after binning
odf.select("hours-per-week").distinct().orderBy('hours-per-week').toPandas()

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,6.0
4,7.0


# Categorical Attribute to Numerical Attribute Conversion

## Categorical to Numerical - Unsupervised
- API specification of function **cat_to_num_unsupervised** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports Label Encoding and One hot encoding

In [4]:
from anovos.data_transformer.transformers import cat_to_num_unsupervised

In [17]:
# Example 1 - with mandatory arguments (Label Encoding)
odf = cat_to_num_unsupervised(spark, df)
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,15092,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,1.0,0.0,0.0,2174.0,0.0,40.0,42,0
1,22895,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,13.0,42,0
2,28742,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,1.0,0.0,0.0,0.0,0.0,40.0,42,0
3,4124,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,0.0,1.0,0.0,0.0,0.0,40.0,42,0
4,11005,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,40.0,10,0


In [18]:
# Example 2 - 'all' columns (excluding drop_cols) + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], print_impact=True)
odf.toPandas().head(5)

Before


                                                                                

+-------+------+-----+-----------+-------+-----------+-----+------------+-------------+--------------+----------------+------------+-------+-----+------------+------------+--------------+--------------+------+
|summary|ifa   |age  |workclass  |fnlwgt |logfnl     |empty|education   |education-num|marital-status|occupation      |relationship|race   |sex  |capital-gain|capital-loss|hours-per-week|native-country|income|
+-------+------+-----+-----------+-------+-----------+-----+------------+-------------+--------------+----------------+------------+-------+-----+------------+------------+--------------+--------------+------+
|count  |32561 |32500|32558      |32546  |12168      |0    |32040       |32530        |32135         |32549           |32557       |32247  |32557|32548       |32549       |32452         |32561         |32561 |
|min    |10000a|17   | Private   |12285  |4.283617786|null |10th        |1            |?             |?               |*           |*      |?    |0           |0

                                                                                

+-------+------+-----+---------+-------+-----------+-----+---------+-------------+--------------+----------+------------+-----+-----+------------+------------+--------------+--------------+------+
|summary|ifa   |age  |workclass|fnlwgt |logfnl     |empty|education|education-num|marital-status|occupation|relationship|race |sex  |capital-gain|capital-loss|hours-per-week|native-country|income|
+-------+------+-----+---------+-------+-----------+-----+---------+-------------+--------------+----------+------------+-----+-----+------------+------------+--------------+--------------+------+
|count  |32561 |32500|32558    |32546  |12168      |0    |32040    |32530        |32135         |32549     |32557       |32247|32557|32548       |32549       |32452         |32561         |32561 |
|min    |10000a|17   |0        |12285  |4.283617786|null |0        |1            |0             |0         |0           |0    |0    |0           |0           |1             |0             |0     |
|max    |9a    

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,1.0,0.0,0.0,2174.0,0.0,40.0,42,0
1,2a,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,13.0,42,0
2,3a,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,1.0,0.0,0.0,0.0,0.0,40.0,42,0
3,4a,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,0.0,1.0,0.0,0.0,0.0,40.0,42,0
4,5a,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,4.0,1.0,1.0,0.0,0.0,40.0,10,0


In [19]:
# Example 3 - selected categorical columns + assign unique integers based on alphabetical order (asc)
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], index_order='alphabetAsc')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,1.0,77516.0,4.889391,,9.0,13.0,4.0,1.0,3.0,6.0,2.0,2174.0,0.0,40.0,41,0
1,2a,,8.0,83311.0,4.920702,,9.0,13.0,3.0,4.0,2.0,6.0,2.0,0.0,0.0,13.0,41,0
2,3a,38.0,6.0,215646.0,5.333741,,11.0,9.0,1.0,6.0,3.0,6.0,2.0,0.0,0.0,40.0,41,0
3,4a,53.0,6.0,234721.0,5.370552,,1.0,7.0,3.0,6.0,2.0,4.0,2.0,0.0,0.0,40.0,41,0
4,5a,,6.0,338409.0,5.529442,,9.0,13.0,3.0,10.0,7.0,4.0,1.0,0.0,0.0,40.0,6,0


In [20]:
# Example 4 - selected categorical columns + one hot encoding (method_type=0) + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type=0, print_impact=True)
odf.toPandas().head(5)

                                                                                

Before
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

After
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- educatio

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race_4,race_5,race_6,race_7,race_8,race_9,sex_0,sex_1,sex_2,sex_3
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,0,0,0,0,0,0,1,0,0,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,0,0,0,0,0,0,1,0,0,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,0,0,0,0,0,0,1,0,0,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0,0,0,0,0,0,1,0,0,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0,0,0,0,0,0,0,1,0,0


In [12]:
# Example 4 - selected categorical columns + one hot encoding (method_type=0) + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type=0, print_impact=True)
odf.show(1, False)

                                                                                

Before
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

After
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- educatio

[Stage 60:>                                                         (0 + 1) / 1]

+---+----+----------+------+-----------+-----+---------+-------------+--------------+------------+-------------+------------+------------+--------------+--------------+------+------+------+------+------+------+------+------+------+------+------+-----+-----+-----+-----+
|ifa|age |workclass |fnlwgt|logfnl     |empty|education|education-num|marital-status|occupation  |relationship |capital-gain|capital-loss|hours-per-week|native-country|income|race_0|race_1|race_2|race_3|race_4|race_5|race_6|race_7|race_8|race_9|sex_0|sex_1|sex_2|sex_3|
+---+----+----------+------+-----------+-----+---------+-------------+--------------+------------+-------------+------------+------------+--------------+--------------+------+------+------+------+------+------+------+------+------+------+------+-----+-----+-----+-----+
|1a |null| State-gov|77516 |4.889391354|null |Bachelors|13           |Never-married |Adm-clerical|Not-in-family|2174        |0           |40            |UnitedStates  |<=50K |1     |0     |0

Traceback (most recent call last):                                              
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 402, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 724, in read_int
    raise EOFError
EOFError


In [9]:
# Example 5 - one hot encoding + save model
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type=0, 
                              pre_existing_model=False, model_path=outputPath)
odf.show(1, False)

[Stage 33:>                                                         (0 + 1) / 1]

+---+----+----------+------+-----------+-----+---------+-------------+--------------+------------+-------------+------------+------------+--------------+--------------+------+------+------+------+------+------+------+------+------+------+------+-----+-----+-----+-----+
|ifa|age |workclass |fnlwgt|logfnl     |empty|education|education-num|marital-status|occupation  |relationship |capital-gain|capital-loss|hours-per-week|native-country|income|race_0|race_1|race_2|race_3|race_4|race_5|race_6|race_7|race_8|race_9|sex_0|sex_1|sex_2|sex_3|
+---+----+----------+------+-----------+-----+---------+-------------+--------------+------------+-------------+------------+------------+--------------+--------------+------+------+------+------+------+------+------+------+------+------+------+-----+-----+-----+-----+
|1a |null| State-gov|77516 |4.889391354|null |Bachelors|13           |Never-married |Adm-clerical|Not-in-family|2174        |0           |40            |UnitedStates  |<=50K |1     |0     |0

                                                                                

In [10]:
# Example 6 - one hot encoding + use pre-saved model
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type=0, 
                              pre_existing_model=True, model_path=outputPath)
odf.show(1, False)

Traceback (most recent call last):
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 402, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 724, in read_int
    raise EOFError
EOFError
[Stage 48:>                                                         (0 + 1) / 1]

+---+----+----------+------+-----------+-----+---------+-------------+--------------+------------+-------------+------------+------------+--------------+--------------+------+------+------+------+------+------+------+------+------+------+------+-----+-----+-----+-----+
|ifa|age |workclass |fnlwgt|logfnl     |empty|education|education-num|marital-status|occupation  |relationship |capital-gain|capital-loss|hours-per-week|native-country|income|race_0|race_1|race_2|race_3|race_4|race_5|race_6|race_7|race_8|race_9|sex_0|sex_1|sex_2|sex_3|
+---+----+----------+------+-----------+-----+---------+-------------+--------------+------------+-------------+------------+------------+--------------+--------------+------+------+------+------+------+------+------+------+------+------+------+-----+-----+-----+-----+
|1a |null| State-gov|77516 |4.889391354|null |Bachelors|13           |Never-married |Adm-clerical|Not-in-family|2174        |0           |40            |UnitedStates  |<=50K |1     |0     |0

Traceback (most recent call last):                                              
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 402, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/Users/sinuochen/server/spark-2.4.8-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 724, in read_int
    raise EOFError
EOFError


## Categorical to Numerical - Supervised
- API specification of function **cat_to_num_supervised**  can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [23]:
from anovos.data_transformer.transformers import cat_to_num_supervised

In [24]:
# Example 1 - 'all' columns (excluding drop_cols) + print impact 
odf = cat_to_num_supervised(spark, idf=df, list_of_cols="all", drop_cols="ifa", 
                            label_col="income", event_label=">50K", print_impact=True)

Before: 


                                                                                

+-------+--------------+------------+-----+-----------+-----+--------------+----------------+-------+------------+
|summary|marital-status|education   |sex  |workclass  |empty|native-country|occupation      |race   |relationship|
+-------+--------------+------------+-----+-----------+-----+--------------+----------------+-------+------------+
|count  |32135         |32040       |32557|32558      |0    |32561         |32549           |32247  |32557       |
|min    |?             |10th        |?    | Private   |null |*             |?               |*      |*           |
|max    |Widowed       |Some-college|Male |Without-pay|null |Yugoslavia    |Transport-moving|Whitess|Wife        |
+-------+--------------+------------+-----+-----------+-----+--------------+----------------+-------+------------+

After: 
+-------+--------------+---------+------+---------+------+--------------+----------+-----+------------+
|summary|marital-status|education|sex   |workclass|empty |native-country|occupatio

In [25]:
# Example 2 - selected cateogrical columns + append generated columns + print impact
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status'],
                            label_col="income", event_label=">50K", output_mode="append", print_impact=True)

Before: 
+-------+------------+--------------+
|summary|relationship|marital-status|
+-------+------------+--------------+
|count  |32557       |32135         |
|min    |*           |?             |
|max    |Wife        |Widowed       |
+-------+------------+--------------+

After: 
+-------+--------------------+----------------------+
|summary|relationship_encoded|marital-status_encoded|
+-------+--------------------+----------------------+
|count  |32557               |32135                 |
|min    |0.0                 |0.0458                |
|max    |0.4748              |0.4471                |
+-------+--------------------+----------------------+



In [26]:
# Example 3 - selected categorical columns + append generated column + save model
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status', 'workclass'], 
                            label_col="income", event_label=">50K", model_path=outputPath, output_mode="append")

In [27]:
# Example 4 - selected categorical columns + use pre-saved model
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status'], 
                            label_col="income", event_label=">50K", pre_existing_model=True, 
                            model_path=outputPath, print_impact=True)

Before: 
+-------+------------+--------------+
|summary|relationship|marital-status|
+-------+------------+--------------+
|count  |32557       |32135         |
|min    |*           |?             |
|max    |Wife        |Widowed       |
+-------+------------+--------------+

After: 
+-------+------------+--------------+
|summary|relationship|marital-status|
+-------+------------+--------------+
|count  |32557       |32135         |
|min    |0.0         |0.0458        |
|max    |0.4748      |0.4471        |
+-------+------------+--------------+



# Attribute Rescaling

## Z Standardization
- API specification of function **z_standardization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [28]:
from anovos.data_transformer.transformers import z_standardization

In [29]:
# Example 1 - with mandatory arguments
odf = z_standardization(spark, idf=df)

In [30]:
# Example 2 - selected columns + print impact
odf = z_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272212|13.508497735339255|105563.06445057027|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+-----------------------+----------------------+---------------------+
|summary|hours-per-week         |age                   |fnlwgt               |
+-------+-----------------------+----------------------+---------------------+
|count  |32452                  |32500                 |32546                |
|mean   |-1.2156752033466475E-15|1.4483798775717543E-16|9.16157

In [31]:
# Example 3 - 'all' columns + save model + print impact
odf = z_standardization(spark, idf=df, list_of_cols='all', model_path=outputPath)

In [32]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = z_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                        pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272212|13.508497735339255|105563.06445057027|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+-----------------------+----------------------+---------------------+
|summary|hours-per-week_scaled  |age_scaled            |fnlwgt_scaled        |
+-------+-----------------------+----------------------+---------------------+
|count  |32452                  |32500                 |32546                |
|mean   |-1.2156752033466475E-15|1.4483798775717543E-16|9.16157

## IQR Standardization
- API specification of function **IQR_standardization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [33]:
from anovos.data_transformer.transformers import IQR_standardization

In [34]:
# Example 1 - with mandatory arguments
odf = IQR_standardization(spark, idf=df)

  + str(excluded_cols)


In [35]:
# Example 2 - selected columns + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272212|13.508497735339255|105563.06445057027|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+-------------------+-------------------+-------------------+
|summary|hours-per-week     |age                |fnlwgt             |
+-------+-------------------+-------------------+-------------------+
|count  |32452              |32500              |32546              |
|mean   |0.04994453346480964|0.07928906882590958|0.10904578788801947|
|stddev |2.3828675338544496 |

In [36]:
# Example 3 - 'all' columns + save model + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols='all', model_path=outputPath)

In [37]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                          pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272212|13.508497735339255|105563.06445057027|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 
+-------+---------------------+-------------------+-------------------+
|summary|hours-per-week_scaled|age_scaled         |fnlwgt_scaled      |
+-------+---------------------+-------------------+-------------------+
|count  |32452                |32500              |32546              |
|mean   |0.04994453346480964  |0.07928906882590958|0.10904578788801947|
|stddev |2.38286753

## Normalization
- API specification of function **normalization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [38]:
from anovos.data_transformer.transformers import normalization

In [39]:
# Example 1 - with mandatory arguments
odf = normalization(idf=df)

In [40]:
# Example 2 - selected columns + print impact
odf = normalization(idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272212|13.508497735339255|105563.06445057027|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 



[Stage 1063:>                                                       (0 + 1) / 1]

+-------+-------------------+-------------------+-------------------+
|summary|hours-per-week     |age                |fnlwgt             |
+-------+-------------------+-------------------+-------------------+
|count  |32452              |32500              |32546              |
|mean   |0.42204001986123296|0.3162719473860012 |0.12054769144314768|
|stddev |0.1281111549974956 |0.19865437746694356|0.07169358238758683|
|min    |0.0                |0.0                |0.0                |
|max    |1.0                |1.0                |1.0                |
+-------+-------------------+-------------------+-------------------+




                                                                                

In [41]:
# Example 3 - 'all' columns + save model + print impact
odf = normalization(idf=df, list_of_cols='all', model_path=outputPath)

In [42]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = normalization(idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                    pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |age               |fnlwgt            |
+-------+------------------+------------------+------------------+
|count  |32452             |32500             |32546             |
|mean   |40.24972266732405 |38.506492307692305|189781.83180728814|
|stddev |11.914337669272212|13.508497735339255|105563.06445057027|
|min    |1                 |17                |12285             |
|max    |94                |85                |1484705           |
+-------+------------------+------------------+------------------+

After: 



[Stage 1076:>                                                       (0 + 1) / 1]

+-------+---------------------+---------------------+------------------+
|summary|hours-per-week_scaled|age_scaled           |fnlwgt_scaled     |
+-------+---------------------+---------------------+------------------+
|count  |32452                |32500                |32546             |
|mean   |2.6166481363438883   |0.008839874255685852 |1.8978372962177825|
|stddev |0.7942891880106049   |0.0031011243650607176|1.0556412016446013|
|min    |0.0                  |0.003902663          |0.12285123        |
|max    |6.2                  |0.019513315          |14.8471985        |
+-------+---------------------+---------------------+------------------+




                                                                                

# Missing Value Imputation

## Imputation MMM
- API specification of function **imputation_MMM** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- 2 options for numerical  attributes: median and mean
- Mode is only option for categorical attributes

In [43]:
from anovos.data_transformer.transformers import imputation_MMM

In [44]:
# Example 1 - with mandatory arguments + print impact
odf = imputation_MMM(spark, df, print_impact=True)

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|education-num |31                 |0                 |
|workclass     |3                  |0                 |
|education     |521                |0                 |
|race          |314                |0                 |
|relationship  |4                  |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|age           |61                 |0                 |
|hours-per-week|109                |0                 |
|fnlwgt        |15                 |0                 |
|marital-status|426                |0                 |
|sex           |4                  |0                 |
|occupation    |12                 |0                 |
|logfnl        |20393              |0                 |
|empty         |32561              |0           

In [45]:
# Example 2 - use mean for numerical columns + append transformed columns at the end
odf = imputation_MMM(spark, df, list_of_cols='all', method_type="mean", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,logfnl_imputed,hours-per-week_imputed,education_imputed,marital-status_imputed,sex_imputed,workclass_imputed,empty_imputed,occupation_imputed,race_imputed,relationship_imputed
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,4.889391,40,Bachelors,Never-married,Male,State-gov,,Adm-clerical,White,Not-in-family
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,4.920702,13,Bachelors,Married-civ-spouse,Male,Self-emp-not-inc,,Exec-managerial,White,Husband
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,5.333741,40,HS-grad,Divorced,Male,Private,,Handlers-cleaners,White,Not-in-family
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,5.370552,40,11th,Married-civ-spouse,Male,Private,,Handlers-cleaners,Black,Husband
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,5.529442,40,Bachelors,Married-civ-spouse,Female,Private,,Prof-specialty,Black,Wife


In [46]:
odf.select('education-num', 'education-num_imputed').where(F.col("education-num").isNull()).distinct().toPandas().head(5)

Unnamed: 0,education-num,education-num_imputed
0,,10


In [47]:
# Example 3 - save model
odf = imputation_MMM(spark, df, pre_existing_model=False, model_path=outputPath)

In [48]:
# Example 4 - use pre-saved model
odf = imputation_MMM(spark, df, pre_existing_model=True, model_path=outputPath)
odf.toPandas().head(5)

Unnamed: 0,ifa,native-country,income,education-num,age,fnlwgt,capital-loss,capital-gain,logfnl,hours-per-week,education,marital-status,sex,workclass,empty,occupation,race,relationship
0,1a,UnitedStates,<=50K,13,37,77516,0,2174,4.889391,40,Bachelors,Never-married,Male,State-gov,,Adm-clerical,White,Not-in-family
1,2a,UnitedStates,<=50K,13,37,83311,0,0,4.920702,13,Bachelors,Married-civ-spouse,Male,Self-emp-not-inc,,Exec-managerial,White,Husband
2,3a,UnitedStates,<=50K,9,38,215646,0,0,5.333741,40,HS-grad,Divorced,Male,Private,,Handlers-cleaners,White,Not-in-family
3,4a,UnitedStates,<=50K,7,53,234721,0,0,5.370552,40,11th,Married-civ-spouse,Male,Private,,Handlers-cleaners,Black,Husband
4,5a,Cuba,<=50K,13,37,338409,0,0,5.529442,40,Bachelors,Married-civ-spouse,Female,Private,,Prof-specialty,Black,Wife


In [49]:
# Example 5 - selected columns + use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts, measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
missing = write_dataset(measures_of_counts(spark, df),outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf = imputation_MMM(spark, df, list_of_cols=['marital-status', 'sex', 'occupation', 'age'], 
                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                     stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)
odf.toPandas().head(5)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|marital-status|426                |0                 |
|occupation    |12                 |0                 |
|sex           |4                  |0                 |
+--------------+-------------------+------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,relationship,race,capital-gain,capital-loss,hours-per-week,native-country,income,age,marital-status,sex,occupation
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Not-in-family,White,2174.0,0.0,40.0,UnitedStates,<=50K,37,Never-married,Male,Adm-clerical
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Husband,White,0.0,0.0,13.0,UnitedStates,<=50K,37,Married-civ-spouse,Male,Exec-managerial
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Not-in-family,White,0.0,0.0,40.0,UnitedStates,<=50K,38,Divorced,Male,Handlers-cleaners
3,4a,Private,234721.0,5.370552,,11th,7.0,Husband,Black,0.0,0.0,40.0,UnitedStates,<=50K,53,Married-civ-spouse,Male,Handlers-cleaners
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Wife,Black,0.0,0.0,40.0,Cuba,<=50K,37,Married-civ-spouse,Female,Prof-specialty


## Imputation Sklearn
- API specification of function **imputation_sklearn** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only
- 2 options supported: KNN and regression

In [50]:
from anovos.data_transformer.transformers import imputation_sklearn

In [51]:
df = df.drop('empty')

In [52]:
print(df.count())
print(df.dropna().count())

32561
11641


In [53]:
# Example 1 - with mandatory arguments + KNN method  + print impact
odf = imputation_sklearn(spark, idf=df, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|education-num |31                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|age           |61                 |0                 |
|hours-per-week|109                |0                 |
|fnlwgt        |15                 |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



In [54]:
# Example 2 - selected columns + regression method + print impact
odf = imputation_sklearn(spark, idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                         method_type='regression', print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

                                                                                

+-------------+-------------------+------------------+
|attribute    |missingCount_before|missingCount_after|
+-------------+-------------------+------------------+
|education-num|31                 |0                 |
|capital-gain |13                 |0                 |
|capital-loss |12                 |0                 |
|age          |61                 |0                 |
+-------------+-------------------+------------------+



In [55]:
# Example 3 - KNN method + smaller sample_size + save model
odf = imputation_sklearn(spark, idf=df, sample_size=1000, model_path=outputPath+'/KNN')

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

In [56]:
from anovos.data_analyzer.stats_generator import measures_of_percentiles, measures_of_counts
x = measures_of_counts(spark, odf)

# Visualization
x.orderBy('missing_count').toPandas() 

                                                                                

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,ifa,32561,1.0,0,0.0,,
1,education-num,32561,1.0,0,0.0,32561.0,1.0
2,capital-gain,32561,1.0,0,0.0,2715.0,0.0834
3,capital-loss,32561,1.0,0,0.0,1520.0,0.0467
4,income,32561,1.0,0,0.0,,
5,age,32561,1.0,0,0.0,32561.0,1.0
6,hours-per-week,32561,1.0,0,0.0,32561.0,1.0
7,fnlwgt,32561,1.0,0,0.0,32561.0,1.0
8,native-country,32561,1.0,0,0.0,,
9,logfnl,32561,1.0,0,0.0,32561.0,1.0


In [57]:
# Example 4 - KNN method + pre-saved model + append new columns + print impact
odf = imputation_sklearn(spark, idf=df, pre_existing_model=True, model_path=outputPath+'/KNN', 
                         output_mode='append', print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|age           |61                 |age_imputed           |0            |
|education-num |31                 |education-num_imputed |0            |
|logfnl        |20393              |logfnl_imputed        |0            |
+--------------+-------------------+----------------------+-------------+



In [58]:
# Example 5 - regression method + smaller sample_size + save model
odf = imputation_sklearn(spark, idf=df, sample_size=1000, model_path=outputPath+'/regression')

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

In [59]:
# Example 6 - regression method + pre-saved model + append new columns + print impact
odf = imputation_sklearn(spark, idf=df, pre_existing_model=True, model_path=outputPath+'/regression', 
                         output_mode='append', print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|age           |61                 |age_imputed           |0            |
|education-num |31                 |education-num_imputed |0            |
|logfnl        |20393              |logfnl_imputed        |0            |
+--------------+-------------------+----------------------+-------------+



In [60]:
# Example 7 - use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = imputation_sklearn(spark, df, stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                         print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|hours-per-week|109                |0                 |
|education-num |31                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|fnlwgt        |15                 |0                 |
|logfnl        |20393              |0                 |
|age           |61                 |0                 |
+--------------+-------------------+------------------+



## Imputation Matrix Factorization
- API specification of function **imputation_matrixFactorization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [61]:
from anovos.data_transformer.transformers import imputation_matrixFactorization

In [62]:
# Example 1 - all columns with missing values + print impact
odf = imputation_matrixFactorization(spark, idf=df, id_col='ifa', print_impact=True)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|education-num |31                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|age           |61                 |0                 |
|hours-per-week|109                |0                 |
|fnlwgt        |15                 |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



In [63]:
# Example 2 - selected columns + append new columns + print impact
odf = imputation_matrixFactorization(spark, idf=df, 
                                     list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                     id_col='ifa', print_impact=True)

                                                                                

+-------------+-------------------+------------------+
|attribute    |missingCount_before|missingCount_after|
+-------------+-------------------+------------------+
|education-num|31                 |0                 |
|capital-gain |13                 |0                 |
|capital-loss |12                 |0                 |
|age          |61                 |0                 |
+-------------+-------------------+------------------+



In [64]:
# Example 3 - use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = imputation_matrixFactorization(spark, df, 
                                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                                     print_impact=True)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|hours-per-week|109                |0                 |
|education-num |31                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|logfnl        |20393              |0                 |
|fnlwgt        |15                 |0                 |
|age           |61                 |0                 |
+--------------+-------------------+------------------+



## Auto Imputation
- API specification of function **auto_imputation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [67]:
from anovos.data_transformer.transformers import auto_imputation

In [66]:
# Example 1 - all columns with missing values + print impact
auto_imputation(spark, df, id_col='ifa', print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...


[('MMM-mean', 3.527763864712216), ('MMM-median', 3.92952606466687), ('KNN', 3.9519728198861728), ('regression', 3.498162349389901), ('matrix_factorization', 5.876657881935616)]
Best Imputation Method:  regression



                                                                                

DataFrame[ifa: string, age: float, fnlwgt: float, logfnl: float, education-num: float, capital-gain: float, capital-loss: float, hours-per-week: float, native-country: string, income: string, education: string, sex: string, marital-status: string, workclass: string, occupation: string, race: string, relationship: string, index: int]

In [69]:
# Example 2 - selected columns + customized null_pct + print impact
odf = auto_imputation(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                   id_col='ifa', null_pct=0.5, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature na

[('MMM-mean', 8.45557875153865), ('MMM-median', 16.711628363631945), ('KNN', 8.863458483329977), ('regression', 9.384259370085479), ('matrix_factorization', 11.739216950649823)]
Best Imputation Method:  MMM-mean



                                                                                

In [70]:
# Example 3 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = auto_imputation(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                      id_col='ifa', stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                      print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature na

[('MMM-mean', 3.7619303948670058), ('MMM-median', 4.2195222297059685), ('KNN', 4.037273402022951), ('regression', 3.7190839693796875), ('matrix_factorization', 4.725047562065969)]
Best Imputation Method:  regression



                                                                                

# Latent Features Generation

## PCA Latent Features
- API specification of function **PCA_latentFeatures** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [79]:
from anovos.data_transformer.transformers import PCA_latentFeatures

In [80]:
# Example 1 - with mandatory arguments + print impact
odf = PCA_latentFeatures(spark, df, standardization=True, print_impact=True)
odf.limit(5).toPandas()


                                                                                

Explained Variance:  0.9866



[Stage 2508:>                                                       (0 + 1) / 1]
                                                                                

+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|summary|latent_0            |latent_1             |latent_2           |latent_3             |latent_4            |latent_5             |
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|count  |12085               |12085                |12085              |12085                |12085               |12085                |
|mean   |0.006990473625180995|0.0016755667182939693|0.00531931809024164|-0.009834625663188677|0.007587318127886196|-0.008249612368146014|
|stddev |1.381797632041396   |1.1228193008371192   |1.015036593603689  |0.9838047392914232   |0.9419796818084403  |0.8971511772848009   |
|min    |-3.6521552          |-3.5124547           |-8.272692          |-2.186532            |-7.3917856          |-7.4833627           |
|max    |9.021795            |8.73

Unnamed: 0,ifa,workclass,empty,education,marital-status,occupation,relationship,race,sex,native-country,income,latent_0,latent_1,latent_2,latent_3,latent_4,latent_5
0,3a,Private,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,0.54468,-0.323292,-0.042623,0.146981,0.248483,-0.113231
1,4a,Private,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,0.70714,-0.398122,0.210681,1.422995,0.590206,-0.326371
2,6a,Private,,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,1.169402,0.817424,-0.208944,-0.57067,-0.319456,1.234487
3,7a,Private,,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,-0.014027,-2.157362,0.256786,1.788715,-0.760998,-0.507208
4,8a,Self-emp-not-inc,,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,0.342476,0.209419,0.112829,0.979675,0.671754,0.07136


In [81]:
# Example 2 - selected columns + customized explained_variance_cutoff + print impact
odf = PCA_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                         explained_variance_cutoff=0.6, standardization=True, print_impact=True)
odf.limit(5).toPandas()

                                                                                

Explained Variance:  0.7943


                                                                                

+-------+---------------------+----------------------+---------------------+
|summary|latent_0             |latent_1              |latent_2             |
+-------+---------------------+----------------------+---------------------+
|count  |32466                |32466                 |32466                |
|mean   |1.6552674503466662E-4|-6.3589239401116544E-6|2.8812043641422073E-4|
|stddev |1.0870763482758605   |1.0143077110112224    |0.9826621520658779   |
|min    |-9.827796            |-9.095775             |-2.3905334           |
|max    |2.9694445            |7.914514              |4.3574286            |
+-------+---------------------+----------------------+---------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,latent_0,latent_1,latent_2
0,3a,Private,215646,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,UnitedStates,<=50K,0.423203,0.10286,0.229906
1,4a,Private,234721,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,UnitedStates,<=50K,0.380385,-0.019731,1.574839
2,6a,Private,284582,5.454207,,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,<=50K,-0.725312,0.091507,-0.90499
3,7a,Private,160187,5.204627,,,,Other-service,Not-in-family,Black,Female,16,Jamaica,<=50K,0.991146,0.023124,1.760285
4,8a,Self-emp-not-inc,209642,5.321478,,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,>50K,-0.058382,-0.019048,1.084266


In [82]:
# Example 3 - selected columns + save model
odf = PCA_latentFeatures(spark, df, model_path=outputPath, standardization=True)


                                                                                

In [83]:
# Example 4 - selected columns + use pre-saved model
odf = PCA_latentFeatures(spark, df, pre_existing_model=True, model_path=outputPath, standardization=True, 
                         print_impact=True)
odf.limit(5).toPandas()

                                                                                

Explained Variance:  0.9866
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|summary|latent_0            |latent_1             |latent_2           |latent_3             |latent_4            |latent_5             |
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|count  |12085               |12085                |12085              |12085                |12085               |12085                |
|mean   |0.006990473625180995|0.0016755667182939693|0.00531931809024164|-0.009834625663188677|0.007587318127886196|-0.008249612368146014|
|stddev |1.381797632041396   |1.1228193008371192   |1.015036593603689  |0.9838047392914232   |0.9419796818084403  |0.8971511772848009   |
|min    |-3.6521552          |-3.5124547           |-8.272692          |-2.186532            |-7.3917856          |-7.4833627           |
|max  

Unnamed: 0,ifa,workclass,empty,education,marital-status,occupation,relationship,race,sex,native-country,income,latent_0,latent_1,latent_2,latent_3,latent_4,latent_5
0,3a,Private,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,0.54468,-0.323292,-0.042623,0.146981,0.248483,-0.113231
1,4a,Private,,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,0.70714,-0.398122,0.210681,1.422995,0.590206,-0.326371
2,6a,Private,,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,1.169402,0.817424,-0.208944,-0.57067,-0.319456,1.234487
3,7a,Private,,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,-0.014027,-2.157362,0.256786,1.788715,-0.760998,-0.507208
4,8a,Self-emp-not-inc,,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,0.342476,0.209419,0.112829,0.979675,0.671754,0.07136


In [84]:
# Example 5 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = PCA_latentFeatures(spark, df, standardization=True, 
                         stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                         print_impact=True)

Explained Variance:  0.9866
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|summary|latent_0            |latent_1             |latent_2           |latent_3             |latent_4            |latent_5             |
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|count  |12085               |12085                |12085              |12085                |12085               |12085                |
|mean   |0.006990473625180995|0.0016755667182939693|0.00531931809024164|-0.009834625663188677|0.007587318127886196|-0.008249612368146014|
|stddev |1.381797632041396   |1.1228193008371192   |1.015036593603689  |0.9838047392914232   |0.9419796818084403  |0.8971511772848009   |
|min    |-3.6521552          |-3.5124547           |-8.272692          |-2.186532            |-7.3917856          |-7.4833627           |
|max  

In [85]:
# Example 6 - use pre-saved standardization model
odf = PCA_latentFeatures(spark, df, standardization=True,
                         standardization_configs={"pre_existing_model": True, "model_path": outputPath}, 
                         print_impact=True)

                                                                                

Explained Variance:  0.9866
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|summary|latent_0            |latent_1             |latent_2           |latent_3             |latent_4            |latent_5             |
+-------+--------------------+---------------------+-------------------+---------------------+--------------------+---------------------+
|count  |12085               |12085                |12085              |12085                |12085               |12085                |
|mean   |0.006990473625180995|0.0016755667182939693|0.00531931809024164|-0.009834625663188677|0.007587318127886196|-0.008249612368146014|
|stddev |1.381797632041396   |1.1228193008371192   |1.015036593603689  |0.9838047392914232   |0.9419796818084403  |0.8971511772848009   |
|min    |-3.6521552          |-3.5124547           |-8.272692          |-2.186532            |-7.3917856          |-7.4833627           |
|max  

In [86]:
# Example 7 - impute missing values before calculation
odf = PCA_latentFeatures(spark, df, standardization=True, imputation=True, print_impact=True)

                                                                                

Explained Variance:  0.9635



[Stage 2744:>                                                       (0 + 1) / 1]

+-------+--------------------+--------------------+-------------------+--------------------+---------------------+---------------------+
|summary|latent_0            |latent_1            |latent_2           |latent_3            |latent_4             |latent_5             |
+-------+--------------------+--------------------+-------------------+--------------------+---------------------+---------------------+
|count  |32561               |32561               |32561              |32561               |32561                |32561                |
|mean   |0.014842900705817235|0.037540262676611896|0.00452105304795438|0.010756455649333153|0.0016000300296498654|-0.006853182355680893|
|stddev |1.1508209468691626  |1.0680661788123114  |1.0138911021934351 |0.9808178415429736  |0.938791376448427    |0.8967908005138374   |
|min    |-8.268953           |-3.943358           |-7.9534187         |-2.5194323          |-7.6249423           |-3.4757562           |
|max    |5.2560782           |9.891342   


                                                                                

# Feature Transformation
- API specification of function **feature_transformation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [87]:
from anovos.data_transformer.transformers import feature_transformation

In [88]:
# Example 1: sqrt 
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='sqrt', print_impact=True)

Before:
+-------+------------------+-----------------+------------------+------------------+
|summary|capital-gain      |capital-loss     |education-num     |age               |
+-------+------------------+-----------------+------------------+------------------+
|count  |32548             |32549            |32530             |32500             |
|mean   |1077.6959567408135|87.3360164674798 |10.080971411005226|38.506492307692305|
|stddev |7386.624857802765 |403.0310072565718|2.5725103263986977|13.508497735339255|
|min    |0                 |0                |1                 |17                |
|max    |99999             |4356             |16                |85                |
+-------+------------------+-----------------+------------------+------------------+

After:
+-------+------------------+------------------+------------------+------------------+
|summary|capital-gain      |capital-loss      |education-num     |age               |
+-------+------------------+------------------+

In [89]:
# Example 2: log + append generated columns
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='ln', output_mode='append', print_impact=True)

Before:
+-------+------------------+-----------------+------------------+------------------+
|summary|capital-gain      |capital-loss     |education-num     |age               |
+-------+------------------+-----------------+------------------+------------------+
|count  |32548             |32549            |32530             |32500             |
|mean   |1077.6959567408135|87.3360164674798 |10.080971411005226|38.506492307692305|
|stddev |7386.624857802765 |403.0310072565718|2.5725103263986977|13.508497735339255|
|min    |0                 |0                |1                 |17                |
|max    |99999             |4356             |16                |85                |
+-------+------------------+-----------------+------------------+------------------+

After:
+-------+------------------+-------------------+------------------+-------------------+
|summary|capital-gain_ln   |capital-loss_ln    |education-num_ln  |age_ln             |
+-------+------------------+---------------

In [90]:
# Example 3: round to 1 decimal place
odf = feature_transformation(idf=odf, 
                             list_of_cols=['education-num_ln', 'capital-gain_ln', 'capital-loss_ln', 'age_ln'], 
                             method_type='roundN', N=1, print_impact=True)

Before:
+-------+------------------+-------------------+-------------------+------------------+
|summary|education-num_ln  |age_ln             |capital-loss_ln    |capital-gain_ln   |
+-------+------------------+-------------------+-------------------+------------------+
|count  |32530             |32500              |1519               |2710              |
|mean   |2.2689316480632096|3.588027147805643  |7.508497766226061  |8.819883472603545 |
|stddev |0.3168442727686073|0.35895718528658876|0.25675668323690815|1.0158964531089267|
|min    |0.0               |2.833213344056216  |5.043425116919247  |4.736198448394496 |
|max    |2.772588722239781 |4.442651256490317  |8.37930948405285   |11.512915464920228|
+-------+------------------+-------------------+-------------------+------------------+

After:
+-------+-------------------+-------------------+-------------------+------------------+
|summary|education-num_ln   |age_ln             |capital-loss_ln    |capital-gain_ln   |
+-------+-----

In [91]:
# Example 4: square
odf = feature_transformation(idf=df, list_of_cols='age', method_type='sq', print_impact=True)

Before:
+-------+------------------+
|summary|age               |
+-------+------------------+
|count  |32500             |
|mean   |38.506492307692305|
|stddev |13.508497735339255|
|min    |17                |
|max    |85                |
+-------+------------------+

After:
+-------+------------------+
|summary|age               |
+-------+------------------+
|count  |32500             |
|mean   |1665.2238461538461|
|stddev |1154.2085383349045|
|min    |289.0             |
|max    |7225.0            |
+-------+------------------+



In [92]:
# Example 5: remainder divided by 10
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='remainderDivByN', N=10, print_impact=True)

Before:
+-------+------------------+-----------------+------------------+------------------+
|summary|capital-gain      |capital-loss     |education-num     |age               |
+-------+------------------+-----------------+------------------+------------------+
|count  |32548             |32549            |32530             |32500             |
|mean   |1077.6959567408135|87.3360164674798 |10.080971411005226|38.506492307692305|
|stddev |7386.624857802765 |403.0310072565718|2.5725103263986977|13.508497735339255|
|min    |0                 |0                |1                 |17                |
|max    |99999             |4356             |16                |85                |
+-------+------------------+-----------------+------------------+------------------+

After:
+-------+-------------------+------------------+------------------+------------------+
|summary|capital-gain       |capital-loss      |education-num     |age               |
+-------+-------------------+----------------

# Box Cox Transformation
- API specification of function **boxcox_transformation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [93]:
from anovos.data_transformer.transformers import boxcox_transformation

In [94]:
# Example 1 - selected columns + print impact
odf = boxcox_transformation(df, drop_cols=['capital-loss', 'capital-gain'], print_impact=True)

Transformed Columns:  ['education-num', 'hours-per-week', 'logfnl', 'fnlwgt', 'age']
Best BoxCox Parameter(s):  [3, 3, 1, 1, 0]
Before:
+--------+-------------------+--------------------+-------------------+------------------+------------------+
|summary |education-num      |hours-per-week      |logfnl             |fnlwgt            |age               |
+--------+-------------------+--------------------+-------------------+------------------+------------------+
|count   |32530              |32452               |12168              |32546             |32500             |
|mean    |10.080971411005226 |40.24972266732405   |5.2054654851899365 |189781.83180728814|38.506492307692305|
|stddev  |2.5725103263986977 |11.914337669272212  |0.27424241727170395|105563.06445057027|13.508497735339255|
|min     |1                  |1                   |4.283617786        |12285             |17                |
|max     |16                 |94                  |6.088696941        |1484705           |85  

In [95]:
# Example 2 - selected columns + existing lambda value + print impact
odf = boxcox_transformation(df, list_of_cols='age', boxcox_lambda=0, output_mode='append', print_impact=True)

Transformed Columns:  ['age']
Best BoxCox Parameter(s):  [0]
Before:
+--------+------------------+
|summary |age               |
+--------+------------------+
|count   |32500             |
|mean    |38.506492307692305|
|stddev  |13.508497735339255|
|min     |17                |
|max     |85                |
|skewness|0.5127993362812463|
+--------+------------------+

After:
+--------+--------------------+
|summary |age_bxcx_0          |
+--------+--------------------+
|count   |32500               |
|mean    |3.588027147805643   |
|stddev  |0.35895718528658876 |
|min     |2.833213344056216   |
|max     |4.442651256490317   |
|skewness|-0.14607838263666595|
+--------+--------------------+



# Outlier Categories Treatment
- API specification of function **outlier_categories** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports 2 ways of outliers detection: by max number of categories and by coverage (%)

In [96]:
from anovos.data_transformer.transformers import outlier_categories

In [97]:
# Example 1 - 'all' columns (excluding drop_cols) + max 15 categories + append transformed columns at the end
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, output_mode='append')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,relationship_outliered,marital-status_outliered,sex_outliered,workclass_outliered,occupation_outliered,education_outliered,race_outliered,native-country_outliered,income_outliered,empty_outliered
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,Not-in-family,Never-married,Male,State-gov,Adm-clerical,Bachelors,White,others,<=50K,
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,Husband,Married-civ-spouse,Male,Self-emp-not-inc,Exec-managerial,Bachelors,White,others,<=50K,
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,Not-in-family,Divorced,Male,Private,Handlers-cleaners,HS-grad,White,others,<=50K,
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,Husband,Married-civ-spouse,Male,Private,Handlers-cleaners,11th,Black,others,<=50K,
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,Wife,Married-civ-spouse,Female,Private,Prof-specialty,Bachelors,Black,Cuba,<=50K,


In [98]:
# Example 2 - selected columns + max 10 categories
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         max_category=10, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|occupation    |15                 |
|native-country|44                 |
|education     |16                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|occupation    |10                |
|native-country|10                |
|education     |10                |
+--------------+------------------+



In [99]:
# Example 3 - selected columns + cover 90% values
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         coverage=0.9, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|occupation    |15                 |
|native-country|44                 |
|education     |16                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|occupation    |11                |
|native-country|3                 |
|education     |9                 |
+--------------+------------------+



In [100]:
# Example 4 - max 15 categories + save model
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, 
                         pre_existing_model=False, model_path=outputPath, print_impact=True)

                                                                                ]

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|relationship  |8                  |
|marital-status|7                  |
|workclass     |11                 |
|occupation    |15                 |
|empty         |0                  |
|education     |16                 |
|race          |9                  |
|native-country|44                 |
|income        |2                  |
|sex           |3                  |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|relationship  |8                 |
|marital-status|7                 |
|workclass     |11                |
|occupation    |15                |
|empty         |0                 |
|education     |15                |
|race          |9                 |
|native-country|15                |
|income        |2                 |
|sex           |3                 |
+------------

In [101]:
# Example 5 - use pre-saved model
odf = outlier_categories(spark, df, drop_cols=['ifa'], pre_existing_model=True, model_path=outputPath, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|relationship  |8                  |
|marital-status|7                  |
|workclass     |11                 |
|occupation    |15                 |
|empty         |0                  |
|education     |16                 |
|race          |9                  |
|native-country|44                 |
|income        |2                  |
|sex           |3                  |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|relationship  |8                 |
|marital-status|7                 |
|workclass     |10                |
|occupation    |15                |
|empty         |0                 |
|education     |15                |
|race          |9                 |
|native-country|15                |
|income        |2                 |
|sex           |3                 |
+------------

# Expression Parser
- API specification of function **expression_parser** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [102]:
from anovos.data_transformer.transformers import expression_parser

In [103]:
# Example 1 - 2 generated columns + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain-capital-loss'], print_impact=True)

Columns Added:  ['f0', 'f1']
+-------+------------------+-----------------+
|summary|f0                |f1               |
+-------+------------------+-----------------+
|count  |32392             |32548            |
|mean   |78.75373549024451 |990.3572569743149|
|stddev |18.619824518135385|7410.325259409019|
|min    |20                |-4356            |
|max    |158               |99999            |
+-------+------------------+-----------------+



In [104]:
# Example 1 - 2 generated columns + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain/capital-loss'], print_impact=True)

Columns Added:  ['f0', 'f1']
+-------+------------------+----+
|summary|f0                |f1  |
+-------+------------------+----+
|count  |32392             |1519|
|mean   |78.75373549024451 |0.0 |
|stddev |18.619824518135385|0.0 |
|min    |20                |0.0 |
|max    |158               |0.0 |
+-------+------------------+----+



In [105]:
# Example 2 - 2 generated columns + customized postfix + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain - capital-loss'], postfix="_new", print_impact=True)

Columns Added:  ['f0_new', 'f1_new']
+-------+------------------+-----------------+
|summary|f0_new            |f1_new           |
+-------+------------------+-----------------+
|count  |32392             |32548            |
|mean   |78.75373549024451 |990.3572569743149|
|stddev |18.619824518135385|7410.325259409019|
|min    |20                |-4356            |
|max    |158               |99999            |
+-------+------------------+-----------------+

