# ANOVOS - Data Transformer
Following notebook shows the list of functions related to "data transformer" module provided under ANOVOS package and how it can be invoked accordingly.
- [Attribute Binning](#Attribute-Binning)
- [Monotonic Binning](#Monotonic-Binning)
- [Categorical Attribute to Numerical Attribute Conversion](#Categorical-Attribute-to-Numerical-Attribute-Conversion)
    - [Categorical to Numerical - Unsupervised](#Categorical-to-Numerical---Unsupervised)
    - [Categorical to Numerical - Supervised](#Categorical-to-Numerical---Supervised)
- [Attribute Rescaling](#Attribute-Rescaling)
    - [Z Standardization](#Z-Standardization)
    - [IQR Standardization](#IQR-Standardization)
    - [Normalization](#Normalization)
- [Missing Value Imputation](#Missing-Value-Imputation)
    - [Imputation MMM](#Imputation-MMM)
    - [Imputation Sklearn](#Imputation-Sklearn)
    - [Imputation Matrix Factorization](#Imputation-Matrix-Factorization)
    - [Auto Imputation](#Auto-Imputation)
- [Latent Features Generation](#Latent-Features-Generation)
    - [Autoencoder Latent Features](#Autoencoder-Latent-Features)
    - [PCA Latent Features](#PCA-Latent-Features)
- [Feature Transformation](#Feature-Transformation)
- [Box Cox Transformation](#Box-Cox-Transformation)
- [Outlier Categories Treatment](#Outlier-Categories-Treatment)
- [Expression Parser](#Expression-Parser)

**Setting Spark Session**

In [1]:
#set run type variable
run_type = "local" # "local", "emr", "databricks", "ak8s"

In [3]:
#For run_type Azure Kubernetes, run the following block 
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if run_type == "ak8s":
    fs_path="<insert conf spark.hadoop.fs master url here> ex: spark.hadoop.fs.azure.sas.<container>.<account_name>.blob.core.windows.net"
    auth_key="<insert value of sas_token here>"
    master_url="<insert kubernetes master url path here> ex: k8s://"
    docker_image="<insert name docker image here>"
    kubernetes_namespace ="<insert kubernetes namespace here>"

    # Create Spark config for our Kubernetes based cluster manager
    sparkConf = SparkConf()
    sparkConf.setMaster(master_url)
    sparkConf.setAppName("Anovos_pipeline")
    sparkConf.set("spark.submit.deployMode","client")
    sparkConf.set("spark.kubernetes.container.image", docker_image)
    sparkConf.set("spark.kubernetes.namespace", kubernetes_namespace)
    sparkConf.set("spark.executor.instances", "4")
    sparkConf.set("spark.executor.cores", "4")
    sparkConf.set("spark.executor.memory", "16g")
    sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
    sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    sparkConf.set(fs_path,auth_key)
    sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
    sparkConf.set("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3,io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20,org.apache.spark:spark-avro_2.12:3.2.1")

    # Initialize our Spark cluster, this will actually
    # generate the worker nodes.
    spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    sc = spark.sparkContext
    
    # [Optional but recommended] Set up the root path, for example "wasbs://.../output/"
    # Add root_path=default_root_path to your function call when calling 
        # auto_imputation
        # autoencoder_latentFeatures
        # PCA_latentFeatures
    # Add model_path=default_root_path to your function call when calling 
        # cat_to_num_supervised
    default_root_path = "<insert a root_path to save intermediate_data data>"

#For other run types import from anovos.shared.
else:
    from anovos.shared.spark import *
    auth_key = "NA"

In [4]:
sc.setLogLevel("ERROR")
import warnings
warnings.filterwarnings('ignore')

**Input/Output Path** 

In [5]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_transformer"

**Read Input Data** 

In [6]:
from anovos.data_ingest.data_ingest import read_dataset
from pyspark.sql import functions as F
df = read_dataset(spark, file_path = inputPath, file_type = "csv",
                  file_configs = {"header": "True", "delimiter": "," , "inferSchema": "True"})
df = df.drop("dt_1", "dt_2")
df.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,Black,Female,0.0,0.0,40.0,Cuba,<=50K,-41.128094,175.033722,rckq4596


# Attribute Binning
- API specification of function **attribute_binning** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only
- 2 binning options: Equal Range Binning (each bin is of equal size/width) and Equal Frequency Binning (each bin has equal no. of rows)

In [7]:
from anovos.data_transformer.transformers import attribute_binning

In [8]:
# Example 1 - Equal range binning + append transformed columns at the end
odf = attribute_binning(spark, idf=df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_range", 
                        bin_size=5, output_mode="append", print_impact=True)

odf.toPandas().head(5)

                                                                                

+---------------------+-------------+
|attribute            |unique_values|
+---------------------+-------------+
|education-num_binned |5            |
|hours-per-week_binned|5            |
+---------------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash,hours-per-week_binned,education-num_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,2174.0,0.0,40.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,3.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,0.0,0.0,13.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,0.0,0.0,40.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,3.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0.0,0.0,40.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,3.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0.0,0.0,40.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,3.0,4.0


In [9]:
# Distinct values after binning
odf.select('hours-per-week_binned').distinct().orderBy('hours-per-week_binned').toPandas().head(10)

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0
4,4.0
5,5.0


In [10]:
# Example 2 - Equal frequency binning + replace original columns by transformed ones (default)
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, print_impact=True)

odf.toPandas().head(5)

+--------------+-------------+
|attribute     |unique_values|
+--------------+-------------+
|hours-per-week|4            |
|education-num |4            |
+--------------+-------------+



Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,...,sex,capital-gain,capital-loss,native-country,income,latitude,longitude,geohash,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,...,Male,2174.0,0.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,...,Male,0.0,0.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,...,Female,0.0,0.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,2.0,4.0


In [11]:
# Distinct values after binning
odf.select('hours-per-week').distinct().orderBy('hours-per-week').toPandas().head(10)

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,4.0
4,5.0


In [12]:
# Example 3 - Equal frequency binning + save binning model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], method_type="equal_frequency", 
                        bin_size=5, pre_existing_model=False, model_path=outputPath + "/attribute_binning")

odf.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,...,sex,capital-gain,capital-loss,native-country,income,latitude,longitude,geohash,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,...,Male,2174.0,0.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,...,Male,0.0,0.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,...,Female,0.0,0.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,2.0,4.0


In [13]:
# Example 4 - Equal frequency binning + use pre-saved model
odf = attribute_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], 
                        pre_existing_model=True, model_path=outputPath + "/attribute_binning")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,...,sex,capital-gain,capital-loss,native-country,income,latitude,longitude,geohash,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,...,Male,2174.0,0.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,2.0,4.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,1.0,4.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,...,Male,0.0,0.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,2.0,1.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,2.0,1.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,...,Female,0.0,0.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,2.0,4.0


# Monotonic Binning
- API specification of function **monotonic_binning** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Bin size is computed dynamically

In [14]:
from anovos.data_transformer.transformers import monotonic_binning

In [15]:
# Example 1 - Equal Range Binning + append tranformed columns at the end
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_range", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash,hours-per-week_binned,education-num_binned
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,2174.0,0.0,40.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,2.0,6.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,0.0,0.0,13.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,1.0,6.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,0.0,0.0,40.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,2.0,4.0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0.0,0.0,40.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,2.0,3.0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0.0,0.0,40.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,2.0,6.0


In [16]:
# Distinct values for hours-per-week after binning 
odf.select("hours-per-week_binned").distinct().orderBy('hours-per-week_binned').toPandas()

Unnamed: 0,hours-per-week_binned
0,
1,1.0
2,2.0
3,3.0


In [17]:
# Example 2 - Equal Frequency Binning + replace original columns by transformed ones (default)
odf = monotonic_binning(spark, df, list_of_cols=["education-num", "hours-per-week"], label_col="income", 
                        event_label=">50K", bin_method="equal_frequency")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,marital-status,occupation,relationship,...,sex,capital-gain,capital-loss,native-country,income,latitude,longitude,geohash,hours-per-week,education-num
0,1a,,State-gov,77516.0,4.889391,,Bachelors,Never-married,Adm-clerical,Not-in-family,...,Male,2174.0,0.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,2.0,14.0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,Married-civ-spouse,Exec-managerial,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,1.0,14.0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,Divorced,Handlers-cleaners,Not-in-family,...,Male,0.0,0.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,2.0,3.0
3,4a,53.0,Private,234721.0,5.370552,,11th,Married-civ-spouse,Handlers-cleaners,Husband,...,Male,0.0,0.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,2.0,2.0
4,5a,,Private,338409.0,5.529442,,Bachelors,Married-civ-spouse,Prof-specialty,Wife,...,Female,0.0,0.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,2.0,14.0


In [18]:
# Distinct values for hours-per-week after binning
odf.select("hours-per-week").distinct().orderBy('hours-per-week').toPandas()

Unnamed: 0,hours-per-week
0,
1,1.0
2,2.0
3,5.0
4,6.0


# Categorical Attribute to Numerical Attribute Conversion

## Categorical to Numerical - Unsupervised
- API specification of function **cat_to_num_unsupervised** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports Label Encoding (default) and One hot encoding

In [19]:
from anovos.data_transformer.transformers import cat_to_num_unsupervised

In [20]:
# Example 1 - with mandatory arguments (Label Encoding) + print impact
odf = cat_to_num_unsupervised(spark, df, print_impact=True)
odf.toPandas().head(5)

                                                                                

Before


                                                                                

+-------+-----+------------+-------+----------------+-----------+------------+--------------+-----+------+--------------+
|summary|sex  |education   |race   |occupation      |workclass  |relationship|marital-status|empty|income|native-country|
+-------+-----+------------+-------+----------------+-----------+------------+--------------+-----+------+--------------+
|count  |32557|32040       |32247  |32549           |32558      |32557       |32135         |0    |32561 |32561         |
|min    |?    |10th        |*      |?               | Private   |*           |?             |null |<=50K |*             |
|max    |Male |Some-college|Whitess|Transport-moving|Without-pay|Wife        |Widowed       |null |>50K  |Yugoslavia    |
+-------+-----+------------+-------+----------------+-----------+------------+--------------+-----+------+--------------+

After
+-------+-----+---------+-----+----------+---------+------------+--------------+-----+------+--------------+
|summary|sex  |education|race 

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash
0,1a,,10.0,77516.0,4.889391,,2.0,13.0,1.0,3.0,...,0.0,0.0,2174.0,0.0,40.0,42,0,-38.624096,177.982468,rb68np99
1,2a,,1.0,83311.0,4.920702,,2.0,13.0,0.0,2.0,...,0.0,0.0,0.0,0.0,13.0,42,0,-40.880497,174.992142,rckjypw0
2,3a,38.0,0.0,215646.0,5.333741,,0.0,9.0,2.0,9.0,...,0.0,0.0,0.0,0.0,40.0,42,0,-37.73563,176.164047,rckm712q
3,4a,53.0,0.0,234721.0,5.370552,,5.0,7.0,0.0,9.0,...,1.0,0.0,0.0,0.0,40.0,42,0,-39.536491,176.832321,rckndgte
4,5a,,0.0,338409.0,5.529442,,2.0,13.0,0.0,0.0,...,1.0,1.0,0.0,0.0,40.0,10,0,-41.128094,175.033722,rckq4596


In [21]:
# Example 2 - selected categorical columns + assign unique integers based on alphabetical order (asc)
odf = cat_to_num_unsupervised(spark, df, list_of_cols='all', drop_cols=['ifa'], index_order='alphabetAsc')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash
0,1a,,1.0,77516.0,4.889391,,9.0,13.0,4.0,1.0,...,6.0,2.0,2174.0,0.0,40.0,41,0,-38.624096,177.982468,rb68np99
1,2a,,8.0,83311.0,4.920702,,9.0,13.0,3.0,4.0,...,6.0,2.0,0.0,0.0,13.0,41,0,-40.880497,174.992142,rckjypw0
2,3a,38.0,6.0,215646.0,5.333741,,11.0,9.0,1.0,6.0,...,6.0,2.0,0.0,0.0,40.0,41,0,-37.73563,176.164047,rckm712q
3,4a,53.0,6.0,234721.0,5.370552,,1.0,7.0,3.0,6.0,...,4.0,2.0,0.0,0.0,40.0,41,0,-39.536491,176.832321,rckndgte
4,5a,,6.0,338409.0,5.529442,,9.0,13.0,3.0,10.0,...,4.0,1.0,0.0,0.0,40.0,6,0,-41.128094,175.033722,rckq4596


In [22]:
# Example 3 - selected categorical columns + one hot encoding + print impact
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], 
                              method_type="onehot_encoding", print_impact=True)
odf.toPandas().head(5)

Before
root
 |-- sex: string (nullable = true)
 |-- race: string (nullable = true)

After
root
 |-- sex_0: integer (nullable = true)
 |-- sex_1: integer (nullable = true)
 |-- sex_2: integer (nullable = true)
 |-- sex_3: integer (nullable = true)
 |-- race_0: integer (nullable = true)
 |-- race_1: integer (nullable = true)
 |-- race_2: integer (nullable = true)
 |-- race_3: integer (nullable = true)
 |-- race_4: integer (nullable = true)
 |-- race_5: integer (nullable = true)
 |-- race_6: integer (nullable = true)
 |-- race_7: integer (nullable = true)
 |-- race_8: integer (nullable = true)
 |-- race_9: integer (nullable = true)



                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race_0,race_1,race_2,race_3,race_4,race_5,race_6,race_7,race_8,race_9
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,1,0,0,0,0,0,0,0,0,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,1,0,0,0,0,0,0,0,0,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,1,0,0,0,0,0,0,0,0,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0,1,0,0,0,0,0,0,0,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0,1,0,0,0,0,0,0,0,0


In [23]:
# Example 4 - one hot encoding + save model
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type="onehot_encoding", 
                              pre_existing_model=False, model_path=outputPath)
odf.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race_0,race_1,race_2,race_3,race_4,race_5,race_6,race_7,race_8,race_9
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,1,0,0,0,0,0,0,0,0,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,1,0,0,0,0,0,0,0,0,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,1,0,0,0,0,0,0,0,0,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0,1,0,0,0,0,0,0,0,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0,1,0,0,0,0,0,0,0,0


In [24]:
# Example 5 - one hot encoding + use pre-saved model
odf = cat_to_num_unsupervised(spark, df, list_of_cols=['race', 'sex'], method_type="onehot_encoding", 
                              pre_existing_model=True, model_path=outputPath)
odf.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race_0,race_1,race_2,race_3,race_4,race_5,race_6,race_7,race_8,race_9
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,1,0,0,0,0,0,0,0,0,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,1,0,0,0,0,0,0,0,0,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,1,0,0,0,0,0,0,0,0,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,0,1,0,0,0,0,0,0,0,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,0,1,0,0,0,0,0,0,0,0


## Categorical to Numerical - Supervised
- API specification of function **cat_to_num_supervised**  can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [25]:
from anovos.data_transformer.transformers import cat_to_num_supervised

In [26]:
# Example 1 - 'all' columns (excluding drop_cols) + print impact 
odf = cat_to_num_supervised(spark, idf=df, list_of_cols="all", drop_cols="ifa", 
                            label_col="income", event_label=">50K", print_impact=True)

Before: 
+-------+-----+------------+-------+----------------+-----------+------------+--------------+-----+--------+--------------+
|summary|sex  |education   |race   |occupation      |workclass  |relationship|marital-status|empty|geohash |native-country|
+-------+-----+------------+-------+----------------+-----------+------------+--------------+-----+--------+--------------+
|count  |32557|32040       |32247  |32549           |32558      |32557       |32135         |0    |32561   |32561         |
|min    |?    |10th        |*      |?               | Private   |*           |?             |null |pxxxv250|*             |
|max    |Male |Some-college|Whitess|Transport-moving|Without-pay|Wife        |Widowed       |null |rcsj7dsd|Yugoslavia    |
+-------+-----+------------+-------+----------------+-----------+------------+--------------+-----+--------+--------------+

After: 
+-------+------+---------+-----+----------+---------+------------+--------------+------+-------+--------------+
|s

In [27]:
# Example 2 - selected cateogrical columns + append generated columns + print impact
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status'],
                            label_col="income", event_label=">50K", output_mode="append", print_impact=True)

Before: 
+-------+------------+--------------+
|summary|relationship|marital-status|
+-------+------------+--------------+
|count  |32557       |32135         |
|min    |*           |?             |
|max    |Wife        |Widowed       |
+-------+------------+--------------+

After: 
+-------+--------------------+----------------------+
|summary|relationship_encoded|marital-status_encoded|
+-------+--------------------+----------------------+
|count  |32557               |32135                 |
|min    |0.0                 |0.0458                |
|max    |0.4748              |0.4471                |
+-------+--------------------+----------------------+



In [28]:
# Example 3 - selected categorical columns + append generated column + save model
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status', 'workclass'], 
                            label_col="income", event_label=">50K", model_path=outputPath, output_mode="append")

In [29]:
# Example 4 - selected categorical columns + use pre-saved model
odf = cat_to_num_supervised(spark, idf=df, list_of_cols=['relationship', 'marital-status'], 
                            label_col="income", event_label=">50K", pre_existing_model=True, 
                            model_path=outputPath, print_impact=True)

Before: 
+-------+------------+--------------+
|summary|relationship|marital-status|
+-------+------------+--------------+
|count  |32557       |32135         |
|min    |*           |?             |
|max    |Wife        |Widowed       |
+-------+------------+--------------+

After: 
+-------+------------+--------------+
|summary|relationship|marital-status|
+-------+------------+--------------+
|count  |32557       |32135         |
|min    |0.0         |0.0458        |
|max    |0.4748      |0.4471        |
+-------+------------+--------------+



# Attribute Rescaling

## Z Standardization
- API specification of function **z_standardization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [30]:
from anovos.data_transformer.transformers import z_standardization

In [31]:
# Example 1 - with mandatory arguments
odf = z_standardization(spark, idf=df)

In [32]:
# Example 2 - selected columns + print impact
odf = z_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |fnlwgt            |age               |
+-------+------------------+------------------+------------------+
|count  |32452             |32546             |32500             |
|mean   |40.24972266732405 |189781.83180728814|38.506492307692305|
|stddev |11.914337669272234|105563.06445056995|13.508497735339269|
|min    |1                 |12285             |17                |
|max    |94                |1484705           |85                |
+-------+------------------+------------------+------------------+

After: 
+-------+-----------------------+---------------------+--------------------+
|summary|hours-per-week         |fnlwgt               |age                 |
+-------+-----------------------+---------------------+--------------------+
|count  |32452                  |32546                |32500               |
|mean   |-1.1475269561563802E-15|8.623621355166213E-17|-2.8367052296884

In [33]:
# Example 3 - 'all' columns + save model + print impact
odf = z_standardization(spark, idf=df, list_of_cols='all', model_path=outputPath)

In [34]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = z_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                        pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |fnlwgt            |age               |
+-------+------------------+------------------+------------------+
|count  |32452             |32546             |32500             |
|mean   |40.24972266732405 |189781.83180728814|38.506492307692305|
|stddev |11.914337669272234|105563.06445056995|13.508497735339269|
|min    |1                 |12285             |17                |
|max    |94                |1484705           |85                |
+-------+------------------+------------------+------------------+

After: 
+-------+-----------------------+---------------------+--------------------+
|summary|hours-per-week_scaled  |fnlwgt_scaled        |age_scaled          |
+-------+-----------------------+---------------------+--------------------+
|count  |32452                  |32546                |32500               |
|mean   |-1.1475269561563802E-15|8.623621355166213E-17|-2.8367052296884

## IQR Standardization
- API specification of function **IQR_standardization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [35]:
from anovos.data_transformer.transformers import IQR_standardization

In [36]:
# Example 1 - with mandatory arguments
odf = IQR_standardization(spark, idf=df)

In [37]:
# Example 2 - selected columns + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |fnlwgt            |age               |
+-------+------------------+------------------+------------------+
|count  |32452             |32546             |32500             |
|mean   |40.24972266732405 |189781.83180728814|38.506492307692305|
|stddev |11.914337669272234|105563.06445056995|13.508497735339269|
|min    |1                 |12285             |17                |
|max    |94                |1484705           |85                |
+-------+------------------+------------------+------------------+

After: 
+-------+-------------------+-------------------+-------------------+
|summary|hours-per-week     |fnlwgt             |age                |
+-------+-------------------+-------------------+-------------------+
|count  |32452              |32546              |32500              |
|mean   |0.04994453346480975|0.10249343317802741|0.07928906882591061|
|stddev |2.3828675338544456 |

In [38]:
# Example 3 - 'all' columns + save model + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols='all', model_path=outputPath)

In [39]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = IQR_standardization(spark, idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                          pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |fnlwgt            |age               |
+-------+------------------+------------------+------------------+
|count  |32452             |32546             |32500             |
|mean   |40.24972266732405 |189781.83180728814|38.506492307692305|
|stddev |11.914337669272234|105563.06445056995|13.508497735339269|
|min    |1                 |12285             |17                |
|max    |94                |1484705           |85                |
+-------+------------------+------------------+------------------+

After: 
+-------+---------------------+-------------------+-------------------+
|summary|hours-per-week_scaled|fnlwgt_scaled      |age_scaled         |
+-------+---------------------+-------------------+-------------------+
|count  |32452                |32546              |32500              |
|mean   |0.04994453346480975  |0.10249343317802741|0.07928906882591061|
|stddev |2.38286753

## Normalization
- API specification of function **normalization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [40]:
from anovos.data_transformer.transformers import normalization

In [41]:
# Example 1 - with mandatory arguments
odf = normalization(idf=df)

                                                                                

In [42]:
# Example 2 - 'all' columns + print impact
odf = normalization(idf=df, list_of_cols='all', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+-----------------+-------------------+-----------------+------------------+------------------+------------------+
|summary|fnlwgt            |education-num     |age               |longitude        |logfnl             |capital-loss     |hours-per-week    |latitude          |capital-gain      |
+-------+------------------+------------------+------------------+-----------------+-------------------+-----------------+------------------+------------------+------------------+
|count  |32546             |32530             |32500             |32561            |12168              |32549            |32452             |32561             |32548             |
|mean   |189781.83180728814|10.080971411005226|38.506492307692305|174.250656002674 |5.2054654851899365 |87.3360164674798 |40.24972266732405 |-39.53292998854152|1077.6959567408135|
|stddev |105563.06445056995|2.5725103263986973|13.508497735339269|4.885018027625808|0.27424

                                                                                

+-------+-------------------+------------------+------------------+--------------------+-------------------+--------------------+-------------------+-------------------+--------------------+
|summary|fnlwgt             |education-num     |age               |longitude           |logfnl             |capital-loss        |hours-per-week     |latitude           |capital-gain        |
+-------+-------------------+------------------+------------------+--------------------+-------------------+--------------------+-------------------+-------------------+--------------------+
|count  |32546              |32530             |32500             |32561               |12168              |32549               |32452              |32561              |32548               |
|mean   |0.12054769144314768|0.6053981130856241|0.3162719473860012|0.9885226896953745  |0.5106965512077881 |0.020049590646946842|0.42204001986123296|0.08497928854415603|0.010777067434618917|
|stddev |0.07169358238758665|0.17150068662382

In [43]:
# Example 3 - selected columns + save model
odf = normalization(idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                    pre_existing_model=False, model_path=outputPath, print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |fnlwgt            |age               |
+-------+------------------+------------------+------------------+
|count  |32452             |32546             |32500             |
|mean   |40.24972266732405 |189781.83180728814|38.506492307692305|
|stddev |11.914337669272234|105563.06445056995|13.508497735339269|
|min    |1                 |12285             |17                |
|max    |94                |1484705           |85                |
+-------+------------------+------------------+------------------+

After: 


                                                                                

+-------+-------------------+-------------------+------------------+
|summary|hours-per-week     |fnlwgt             |age               |
+-------+-------------------+-------------------+------------------+
|count  |32452              |32546              |32500             |
|mean   |0.42204001986123296|0.12054769144314768|0.3162719473860012|
|stddev |0.12811115499749567|0.07169358238758665|0.1986543774669423|
|min    |0.0                |0.0                |0.0               |
|max    |1.0                |1.0                |1.0               |
+-------+-------------------+-------------------+------------------+



In [44]:
# Example 4 - selected columns + append new columns + use pre-saved model + print impact
odf = normalization(idf=df, list_of_cols=['fnlwgt', 'age', 'hours-per-week'], 
                    pre_existing_model=True, model_path=outputPath, output_mode='append', print_impact=True)

Before: 
+-------+------------------+------------------+------------------+
|summary|hours-per-week    |fnlwgt            |age               |
+-------+------------------+------------------+------------------+
|count  |32452             |32546             |32500             |
|mean   |40.24972266732405 |189781.83180728814|38.506492307692305|
|stddev |11.914337669272234|105563.06445056995|13.508497735339269|
|min    |1                 |12285             |17                |
|max    |94                |1484705           |85                |
+-------+------------------+------------------+------------------+

After: 


                                                                                

+-------+---------------------+-------------------+------------------+
|summary|hours-per-week_scaled|fnlwgt_scaled      |age_scaled        |
+-------+---------------------+-------------------+------------------+
|count  |32452                |32546              |32500             |
|mean   |0.42204001986123296  |0.12054769144314768|0.3162719473860012|
|stddev |0.12811115499749567  |0.07169358238758665|0.1986543774669423|
|min    |0.0                  |0.0                |0.0               |
|max    |1.0                  |1.0                |1.0               |
+-------+---------------------+-------------------+------------------+



# Missing Value Imputation

## Imputation MMM
- API specification of function **imputation_MMM** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- 2 options for numerical  attributes: median and mean
- Mode is only option for categorical attributes

In [45]:
from anovos.data_transformer.transformers import imputation_MMM

In [46]:
# Example 1 - with mandatory arguments + print impact
odf = imputation_MMM(spark, df, print_impact=True)

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education     |521                |0                 |
|education-num |31                 |0                 |
|empty         |32561              |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
|marital-status|426                |0                 |
|occupation    |12                 |0                 |
|race          |314                |0                 |
|relationship  |4                  |0                 |
|sex           |4                  |0                 |
|workclass     |3                  |0           

In [47]:
# Example 2 - use mean for numerical columns + append transformed columns at the end
odf = imputation_MMM(spark, df, list_of_cols='all', method_type="mean", output_mode="append")
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,logfnl_imputed,capital-gain_imputed,education_imputed,occupation_imputed,sex_imputed,race_imputed,workclass_imputed,relationship_imputed,empty_imputed,marital-status_imputed
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,4.889391,2174,Bachelors,Adm-clerical,Male,White,State-gov,Not-in-family,,Never-married
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,4.920702,0,Bachelors,Exec-managerial,Male,White,Self-emp-not-inc,Husband,,Married-civ-spouse
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,5.333741,0,HS-grad,Handlers-cleaners,Male,White,Private,Not-in-family,,Divorced
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,5.370552,0,11th,Handlers-cleaners,Male,Black,Private,Husband,,Married-civ-spouse
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,5.529442,0,Bachelors,Prof-specialty,Female,Black,Private,Wife,,Married-civ-spouse


In [48]:
odf.select('education-num', 'education-num_imputed').where(F.col("education-num").isNull()).distinct().toPandas().head(5)

Unnamed: 0,education-num,education-num_imputed
0,,10


In [49]:
# Example 3 - save model
odf = imputation_MMM(spark, df, pre_existing_model=False, model_path=outputPath)

In [50]:
# Example 4 - use pre-saved model
odf = imputation_MMM(spark, df, pre_existing_model=True, model_path=outputPath)
odf.toPandas().head(5)

Unnamed: 0,ifa,native-country,income,latitude,longitude,geohash,fnlwgt,education-num,age,capital-loss,...,logfnl,capital-gain,education,race,occupation,sex,workclass,relationship,empty,marital-status
0,1a,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,77516,13,37,0,...,4.889391,2174,Bachelors,White,Adm-clerical,Male,State-gov,Not-in-family,,Never-married
1,2a,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,83311,13,37,0,...,4.920702,0,Bachelors,White,Exec-managerial,Male,Self-emp-not-inc,Husband,,Married-civ-spouse
2,3a,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,215646,9,38,0,...,5.333741,0,HS-grad,White,Handlers-cleaners,Male,Private,Not-in-family,,Divorced
3,4a,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,234721,7,53,0,...,5.370552,0,11th,Black,Handlers-cleaners,Male,Private,Husband,,Married-civ-spouse
4,5a,Cuba,<=50K,-41.128094,175.033722,rckq4596,338409,13,37,0,...,5.529442,0,Bachelors,Black,Prof-specialty,Female,Private,Wife,,Married-civ-spouse


In [51]:
# Example 5 - selected columns + use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts, measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
missing = write_dataset(measures_of_counts(spark, df),outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf = imputation_MMM(spark, df, list_of_cols=['marital-status', 'sex', 'occupation', 'age'], 
                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                     stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)
odf.toPandas().head(5)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|marital-status|426                |0                 |
|occupation    |12                 |0                 |
|sex           |4                  |0                 |
+--------------+-------------------+------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,relationship,race,capital-gain,...,hours-per-week,native-country,income,latitude,longitude,geohash,age,sex,occupation,marital-status
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Not-in-family,White,2174.0,...,40.0,UnitedStates,<=50K,-38.624096,177.982468,rb68np99,37,Male,Adm-clerical,Never-married
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Husband,White,0.0,...,13.0,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0,37,Male,Exec-managerial,Married-civ-spouse
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Not-in-family,White,0.0,...,40.0,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,38,Male,Handlers-cleaners,Divorced
3,4a,Private,234721.0,5.370552,,11th,7.0,Husband,Black,0.0,...,40.0,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,53,Male,Handlers-cleaners,Married-civ-spouse
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Wife,Black,0.0,...,40.0,Cuba,<=50K,-41.128094,175.033722,rckq4596,37,Female,Prof-specialty,Married-civ-spouse


## Imputation Sklearn
- API specification of function **imputation_sklearn** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only
- 2 options supported: KNN and regression

In [52]:
from anovos.data_transformer.transformers import imputation_sklearn

In [53]:
df = df.drop('empty')

In [54]:
print(df.count())
print(df.dropna().count())

32561
11641


In [55]:
# Example 1 - with mandatory arguments + KNN method  + print impact
odf = imputation_sklearn(spark, idf=df, run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



In [56]:
# Example 2 - selected columns + regression method + print impact
odf = imputation_sklearn(spark, idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                         method_type='regression', run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

                                                                                

+-------------+-------------------+------------------+
|attribute    |missingCount_before|missingCount_after|
+-------------+-------------------+------------------+
|age          |61                 |0                 |
|capital-gain |13                 |0                 |
|capital-loss |12                 |0                 |
|education-num|31                 |0                 |
+-------------+-------------------+------------------+



In [57]:
# Example 3 - KNN method + smaller sample_size + save model
odf = imputation_sklearn(spark, idf=df, sample_size=1000, model_path=outputPath+'/KNN', run_type=run_type, auth_key=auth_key)

In [58]:
from anovos.data_analyzer.stats_generator import measures_of_percentiles, measures_of_counts
x = measures_of_counts(spark, odf)

# Visualization
x.orderBy('missing_count').toPandas() 

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,age,32561,1.0,0,0.0,32561.0,1.0
1,capital-gain,32561,1.0,0,0.0,2723.0,0.0836
2,capital-loss,32561,1.0,0,0.0,1531.0,0.047
3,education-num,32561,1.0,0,0.0,32561.0,1.0
4,fnlwgt,32561,1.0,0,0.0,32561.0,1.0
5,geohash,32561,1.0,0,0.0,,
6,hours-per-week,32561,1.0,0,0.0,32561.0,1.0
7,ifa,32561,1.0,0,0.0,,
8,income,32561,1.0,0,0.0,,
9,latitude,32561,1.0,0,0.0,32561.0,1.0


In [59]:
# Example 4 - KNN method + pre-saved model + append new columns + print impact
odf = imputation_sklearn(spark, idf=df, pre_existing_model=True, model_path=outputPath+'/KNN', 
                         output_mode='append', run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|age           |61                 |age_imputed           |0            |
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|education-num |31                 |education-num_imputed |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|logfnl        |20393              |logfnl_imputed        |0            |
+--------------+-------------------+----------------------+-------------+



In [60]:
# Example 5 - regression method + smaller sample_size + save model
odf = imputation_sklearn(spark, idf=df, sample_size=1000, model_path=outputPath+'/regression', run_type=run_type, auth_key=auth_key)

In [61]:
# Example 6 - regression method + pre-saved model + append new columns + print impact
odf = imputation_sklearn(spark, idf=df, pre_existing_model=True, model_path=outputPath+'/regression', 
                         output_mode='append', run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|age           |61                 |age_imputed           |0            |
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|education-num |31                 |education-num_imputed |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|logfnl        |20393              |logfnl_imputed        |0            |
+--------------+-------------------+----------------------+-------------+



In [62]:
# Example 7 - use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = imputation_sklearn(spark, df, stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                         run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



## Imputation Matrix Factorization
- API specification of function **imputation_matrixFactorization** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [63]:
from anovos.data_transformer.transformers import imputation_matrixFactorization

In [64]:
# Example 1 - all columns with missing values + print impact
odf = imputation_matrixFactorization(spark, idf=df, id_col='ifa', print_impact=True)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



In [65]:
# Example 2 - selected columns + append new columns + print impact
odf = imputation_matrixFactorization(spark, idf=df, 
                                     list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                     id_col='ifa', print_impact=True)

                                                                                

+-------------+-------------------+------------------+
|attribute    |missingCount_before|missingCount_after|
+-------------+-------------------+------------------+
|age          |61                 |0                 |
|capital-gain |13                 |0                 |
|capital-loss |12                 |0                 |
|education-num|31                 |0                 |
+-------------+-------------------+------------------+



In [66]:
# Example 3 - use pre-saved stats
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = imputation_matrixFactorization(spark, df, 
                                     stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                                     print_impact=True)

                                                                                

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|age           |61                 |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|education-num |31                 |0                 |
|fnlwgt        |15                 |0                 |
|hours-per-week|109                |0                 |
|logfnl        |20393              |0                 |
+--------------+-------------------+------------------+



## Auto Imputation
- API specification of function **auto_imputation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [67]:
from anovos.data_transformer.transformers import auto_imputation

In [68]:
# Example 1 - all columns with missing values + print impact
auto_imputation(spark, df, id_col='ifa', run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num
- fnlwgt
- ...

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
- _4
- ...
Feature names seen at fit

[('MMM-mean', 3.905632160684866), ('MMM-median', 4.346316148237697), ('KNN', 4.2059300523943115), ('regression', 3.8829553978413944), ('matrix_factorization', 4.383045492544182)]
Best Imputation Method:  regression


DataFrame[ifa: string, age: float, fnlwgt: float, logfnl: float, education-num: float, capital-gain: float, capital-loss: float, hours-per-week: float, native-country: string, income: string, latitude: double, longitude: double, geohash: string, sex: string, race: string, education: string, occupation: string, workclass: string, relationship: string, marital-status: string, index: int]

In [69]:
# Example 2 - selected columns + customized null_pct + print impact
odf = auto_imputation(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                   id_col='ifa', null_pct=0.5, run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature na

[('MMM-mean', 7.923311924558329), ('MMM-median', 15.622517927011389), ('KNN', 9.229465537973077), ('regression', 8.890673237796118), ('matrix_factorization', 6.996145629311291)]
Best Imputation Method:  matrix_factorization


In [70]:
# Example 3 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = auto_imputation(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                      id_col='ifa', stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                      run_type=run_type, auth_key=auth_key, print_impact=True)

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature names seen at fit time, yet now missing:
- age
- capital-gain
- capital-loss
- education-num

Feature names unseen at fit time:
- _0
- _1
- _2
- _3
Feature na

[('MMM-mean', 4.047327904504398), ('MMM-median', 4.540678537978075), ('KNN', 4.409580971384189), ('regression', 4.006836796483809), ('matrix_factorization', 4.713820213919539)]
Best Imputation Method:  regression


# Latent Features Generation

## Autoencoder Latent Features
- API specification of function **autoencoder_latentFeatures** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [71]:
from anovos.data_transformer.transformers import autoencoder_latentFeatures

In [72]:
# Example 1 - with mandatory arguments + print impact
odf = autoencoder_latentFeatures(spark, df, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash
0,1a,,State-gov,77516,4.889391,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,UnitedStates,<=50K,-38.624096,177.982468,rb68np99
1,2a,,Self-emp-not-inc,83311,4.920702,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0
2,3a,38.0,Private,215646,5.333741,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,UnitedStates,<=50K,-37.73563,176.164047,rckm712q
3,4a,53.0,Private,234721,5.370552,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,UnitedStates,<=50K,-39.536491,176.832321,rckndgte
4,5a,,Private,338409,5.529442,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,-41.128094,175.033722,rckq4596


In [73]:
# Example 2 - selected columns + less epochs + larger bach size + print impact
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                 epochs=50, batch_size=528, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash
0,1a,,State-gov,77516,4.889391,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,UnitedStates,<=50K,-38.624096,177.982468,rb68np99
1,2a,,Self-emp-not-inc,83311,4.920702,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0
2,3a,38.0,Private,215646,5.333741,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,UnitedStates,<=50K,-37.73563,176.164047,rckm712q
3,4a,53.0,Private,234721,5.370552,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,UnitedStates,<=50K,-39.536491,176.832321,rckndgte
4,5a,,Private,338409,5.529442,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,-41.128094,175.033722,rckq4596


In [74]:
# Example 3 - selected columns + smaller sample_size used for training + save model
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'],
                                 sample_size=20000, model_path=outputPath, run_type=run_type, auth_key=auth_key)
odf.limit(5).toPandas()

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,latitude,longitude,geohash
0,1a,,State-gov,77516,4.889391,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,UnitedStates,<=50K,-38.624096,177.982468,rb68np99
1,2a,,Self-emp-not-inc,83311,4.920702,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,UnitedStates,<=50K,-40.880497,174.992142,rckjypw0
2,3a,38.0,Private,215646,5.333741,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,UnitedStates,<=50K,-37.73563,176.164047,rckm712q
3,4a,53.0,Private,234721,5.370552,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,UnitedStates,<=50K,-39.536491,176.832321,rckndgte
4,5a,,Private,338409,5.529442,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,-41.128094,175.033722,rckq4596


In [75]:
# Example 4 - use pre-saved model
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 pre_existing_model=True, model_path=outputPath, run_type=run_type, auth_key=auth_key, print_impact=True)

In [76]:
# Example 5 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                                 run_type=run_type, auth_key=auth_key, print_impact=True)

In [77]:
# Example 6 - use pre-saved standardization model
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 standardization_configs={"pre_existing_model": True, "model_path": outputPath}, 
                                 run_type=run_type, auth_key=auth_key, print_impact=True)

In [78]:
# Example 7 - impute missing values before calculation
odf = autoencoder_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                                 imputation=True, run_type=run_type, auth_key=auth_key, print_impact=True)

## PCA Latent Features
- API specification of function **PCA_latentFeatures** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [79]:
from anovos.data_transformer.transformers import PCA_latentFeatures

In [80]:
# Example 1 - with mandatory arguments + print impact
odf = PCA_latentFeatures(spark, df, standardization=True, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Explained Variance:  0.9892


                                                                                

+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|summary|latent_0             |latent_1              |latent_2             |latent_3             |latent_4             |latent_5             |latent_6             |latent_7              |
+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|count  |12085                |12085                 |12085                |12085                |12085                |12085                |12085                |12085                 |
|mean   |-0.007176269432713798|-0.0014481854362107835|0.0069587292898954526|-0.013046236140803887|0.0010059698741191339|-0.007285535987645989|-0.008536972717610026|-0.0030189979595461407|
|stddev |1.3819132714838467   |1.122882522077078     |1.0156

Unnamed: 0,ifa,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income,geohash,latent_0,latent_1,latent_2,latent_3,latent_4,latent_5,latent_6,latent_7
0,3a,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,rckm712q,-0.533405,0.30899,-0.13204,0.550614,0.301941,-0.260926,-0.079102,-0.333057
1,4a,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,rckndgte,-0.705774,0.3947,0.164807,1.054631,-0.987722,-0.577889,-0.314984,-0.510917
2,6a,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,rb68nhrb,-1.152625,-0.837139,-0.315987,0.213388,0.982082,0.293066,1.273045,0.023457
3,7a,Private,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,rb037vbc,0.006493,2.163871,0.256257,1.019012,-1.509674,0.782596,-0.524798,-0.266838
4,8a,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,rb6bb7jm,-0.320516,-0.236217,-0.082729,1.511231,0.064315,-0.707327,0.117424,0.086532


In [81]:
# Example 2 - selected columns + customized explained_variance_cutoff + print impact
odf = PCA_latentFeatures(spark, df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                         explained_variance_cutoff=0.6, standardization=True, run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Explained Variance:  0.7943


                                                                                

+-------+---------------------+---------------------+----------------------+
|summary|latent_0             |latent_1             |latent_2              |
+-------+---------------------+---------------------+----------------------+
|count  |32466                |32466                |32466                 |
|mean   |1.6552674503466662E-4|6.3589239401116544E-6|-2.8812043635684855E-4|
|stddev |1.0870763482758592   |1.0143077110112213   |0.9826621520658773    |
|min    |-9.827796            |-7.914514            |-4.3574286            |
|max    |2.9694445            |9.095775             |2.3905334             |
+-------+---------------------+---------------------+----------------------+



Unnamed: 0,ifa,workclass,fnlwgt,logfnl,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income,latitude,longitude,geohash,latent_0,latent_1,latent_2
0,3a,Private,215646,5.333741,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,UnitedStates,<=50K,-37.73563,176.164047,rckm712q,0.423203,-0.10286,-0.229906
1,4a,Private,234721,5.370552,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,UnitedStates,<=50K,-39.536491,176.832321,rckndgte,0.380385,0.019731,-1.574839
2,6a,Private,284582,5.454207,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,40,United-States,<=50K,-36.697498,174.721725,rb68nhrb,-0.725312,-0.091507,0.90499
3,7a,Private,160187,5.204627,,,Other-service,Not-in-family,Black,Female,16,Jamaica,<=50K,-40.933228,175.547897,rb037vbc,0.991146,-0.023124,-1.760285
4,8a,Self-emp-not-inc,209642,5.321478,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,45,United-States,>50K,-35.81871,174.511719,rb6bb7jm,-0.058382,0.019048,-1.084266


In [82]:
# Example 3 - selected columns + save model
odf = PCA_latentFeatures(spark, df, model_path=outputPath, standardization=True, run_type=run_type, auth_key=auth_key)

In [83]:
# Example 4 - selected columns + use pre-saved model
odf = PCA_latentFeatures(spark, df, pre_existing_model=True, model_path=outputPath, standardization=True, 
                         run_type=run_type, auth_key=auth_key, print_impact=True)
odf.limit(5).toPandas()

Explained Variance:  0.9892


                                                                                

+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|summary|latent_0             |latent_1              |latent_2             |latent_3             |latent_4             |latent_5             |latent_6             |latent_7              |
+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|count  |12085                |12085                 |12085                |12085                |12085                |12085                |12085                |12085                 |
|mean   |-0.007176269432713798|-0.0014481854362107835|0.0069587292898954526|-0.013046236140803887|0.0010059698741191339|-0.007285535987645989|-0.008536972717610026|-0.0030189979595461407|
|stddev |1.3819132714838467   |1.122882522077078     |1.0156

Unnamed: 0,ifa,workclass,education,marital-status,occupation,relationship,race,sex,native-country,income,geohash,latent_0,latent_1,latent_2,latent_3,latent_4,latent_5,latent_6,latent_7
0,3a,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,UnitedStates,<=50K,rckm712q,-0.533405,0.30899,-0.13204,0.550614,0.301941,-0.260926,-0.079102,-0.333057
1,4a,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,UnitedStates,<=50K,rckndgte,-0.705774,0.3947,0.164807,1.054631,-0.987722,-0.577889,-0.314984,-0.510917
2,6a,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States,<=50K,rb68nhrb,-1.152625,-0.837139,-0.315987,0.213388,0.982082,0.293066,1.273045,0.023457
3,7a,Private,,,Other-service,Not-in-family,Black,Female,Jamaica,<=50K,rb037vbc,0.006493,2.163871,0.256257,1.019012,-1.509674,0.782596,-0.524798,-0.266838
4,8a,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,>50K,rb6bb7jm,-0.320516,-0.236217,-0.082729,1.511231,0.064315,-0.707327,0.117424,0.086532


In [84]:
# Example 5 - selected columns + use pre-saved stats + print impact
from anovos.data_analyzer.stats_generator import measures_of_counts
from anovos.data_ingest.data_ingest import write_dataset

write_dataset(measures_of_counts(spark, df), outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})

odf = PCA_latentFeatures(spark, df, standardization=True, 
                         stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                         run_type=run_type, auth_key=auth_key, print_impact=True)

                                                                                

Explained Variance:  0.9892




+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|summary|latent_0             |latent_1              |latent_2             |latent_3             |latent_4             |latent_5             |latent_6             |latent_7              |
+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|count  |12085                |12085                 |12085                |12085                |12085                |12085                |12085                |12085                 |
|mean   |-0.007176269432713798|-0.0014481854362107835|0.0069587292898954526|-0.013046236140803887|0.0010059698741191339|-0.007285535987645989|-0.008536972717610026|-0.0030189979595461407|
|stddev |1.3819132714838467   |1.122882522077078     |1.0156

                                                                                

In [85]:
# Example 6 - use pre-saved standardization model
odf = PCA_latentFeatures(spark, df, standardization=True,
                         standardization_configs={"pre_existing_model": True, "model_path": outputPath}, 
                         run_type=run_type, auth_key=auth_key, print_impact=True)

Explained Variance:  0.9892




+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|summary|latent_0             |latent_1              |latent_2             |latent_3             |latent_4             |latent_5             |latent_6             |latent_7              |
+-------+---------------------+----------------------+---------------------+---------------------+---------------------+---------------------+---------------------+----------------------+
|count  |12085                |12085                 |12085                |12085                |12085                |12085                |12085                |12085                 |
|mean   |-0.007176269432713798|-0.0014481854362107835|0.0069587292898954526|-0.013046236140803887|0.0010059698741191339|-0.007285535987645989|-0.008536972717610026|-0.0030189979595461407|
|stddev |1.3819132714838467   |1.122882522077078     |1.0156

                                                                                

In [86]:
# Example 7 - impute missing values before calculation
odf = PCA_latentFeatures(spark, df, standardization=True, imputation=True, run_type=run_type, auth_key=auth_key, print_impact=True)

Explained Variance:  0.9723




+-------+---------------------+--------------------+---------------------+---------------------+--------------------+--------------------+---------------------+---------------------+
|summary|latent_0             |latent_1            |latent_2             |latent_3             |latent_4            |latent_5            |latent_6             |latent_7             |
+-------+---------------------+--------------------+---------------------+---------------------+--------------------+--------------------+---------------------+---------------------+
|count  |32561                |32561               |32561                |32561                |32561               |32561               |32561                |32561                |
|mean   |-0.014850290235659282|0.037480620428929926|0.0017535751359506586|-0.004223198522921649|0.010891314617903328|9.564596892539407E-4|0.0016409298551183353|-0.006840066973801671|
|stddev |1.150891003362958    |1.0682412258371443  |1.0274408247129856   |1.013864079

                                                                                

# Feature Transformation
- API specification of function **feature_transformation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [87]:
from anovos.data_transformer.transformers import feature_transformation

In [88]:
# Example 1: sqrt 
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='sqrt', print_impact=True)

Before:
+-------+-----------------+------------------+------------------+------------------+
|summary|capital-loss     |age               |education-num     |capital-gain      |
+-------+-----------------+------------------+------------------+------------------+
|count  |32549            |32500             |32530             |32548             |
|mean   |87.3360164674798 |38.506492307692305|10.080971411005226|1077.6959567408135|
|stddev |403.0310072565711|13.508497735339269|2.5725103263986973|7386.624857802761 |
|min    |0                |17                |1                 |0                 |
|max    |4356             |85                |16                |99999             |
+-------+-----------------+------------------+------------------+------------------+

After:
+-------+-----------------+-----------------+------------------+------------------+
|summary|capital-loss     |age              |education-num     |capital-gain      |
+-------+-----------------+-----------------+------

In [89]:
# Example 2: log + append generated columns
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='ln', output_mode='append', print_impact=True)

Before:
+-------+-----------------+------------------+------------------+------------------+
|summary|capital-loss     |age               |education-num     |capital-gain      |
+-------+-----------------+------------------+------------------+------------------+
|count  |32549            |32500             |32530             |32548             |
|mean   |87.3360164674798 |38.506492307692305|10.080971411005226|1077.6959567408135|
|stddev |403.0310072565711|13.508497735339269|2.5725103263986973|7386.624857802761 |
|min    |0                |17                |1                 |0                 |
|max    |4356             |85                |16                |99999             |
+-------+-----------------+------------------+------------------+------------------+

After:
+-------+-------------------+------------------+------------------+------------------+
|summary|capital-loss_ln    |age_ln            |education-num_ln  |capital-gain_ln   |
+-------+-------------------+----------------

In [90]:
# Example 3: round to 1 decimal place
odf = feature_transformation(idf=odf, 
                             list_of_cols=['education-num_ln', 'capital-gain_ln', 'capital-loss_ln', 'age_ln'], 
                             method_type='roundN', N=1, print_impact=True)

Before:
+-------+------------------+------------------+-------------------+------------------+
|summary|capital-gain_ln   |age_ln            |capital-loss_ln    |education-num_ln  |
+-------+------------------+------------------+-------------------+------------------+
|count  |2710              |32500             |1519               |32530             |
|mean   |8.819883472603658 |3.588027147805673 |7.508497766226014  |2.268931648063395 |
|stddev |1.0158964531089265|0.3589571852865876|0.25675668323690803|0.3168442727686081|
|min    |4.736198448394496 |2.833213344056216 |5.043425116919247  |0.0               |
|max    |11.512915464920228|4.442651256490317 |8.37930948405285   |2.772588722239781 |
+-------+------------------+------------------+-------------------+------------------+

After:
+-------+------------------+------------------+------------------+-------------------+
|summary|capital-gain_ln   |age_ln            |capital-loss_ln   |education-num_ln   |
+-------+------------------

In [91]:
# Example 4: square
odf = feature_transformation(idf=df, list_of_cols='age', method_type='sq', print_impact=True)

Before:
+-------+------------------+
|summary|age               |
+-------+------------------+
|count  |32500             |
|mean   |38.506492307692305|
|stddev |13.508497735339269|
|min    |17                |
|max    |85                |
+-------+------------------+

After:
+-------+------------------+
|summary|age               |
+-------+------------------+
|count  |32500             |
|mean   |1665.2238461538461|
|stddev |1154.208538334907 |
|min    |289.0             |
|max    |7225.0            |
+-------+------------------+



In [92]:
# Example 5: remainder divided by 10
odf = feature_transformation(idf=df, list_of_cols=['education-num', 'capital-gain', 'capital-loss', 'age'], 
                             method_type='remainderDivByN', N=10, print_impact=True)

Before:
+-------+-----------------+------------------+------------------+------------------+
|summary|capital-loss     |age               |education-num     |capital-gain      |
+-------+-----------------+------------------+------------------+------------------+
|count  |32549            |32500             |32530             |32548             |
|mean   |87.3360164674798 |38.506492307692305|10.080971411005226|1077.6959567408135|
|stddev |403.0310072565711|13.508497735339269|2.5725103263986973|7386.624857802761 |
|min    |0                |17                |1                 |0                 |
|max    |4356             |85                |16                |99999             |
+-------+-----------------+------------------+------------------+------------------+

After:
+-------+------------------+------------------+------------------+-------------------+
|summary|capital-loss      |age               |education-num     |capital-gain       |
+-------+------------------+-----------------

# Box Cox Transformation
- API specification of function **boxcox_transformation** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports numerical attributes only

In [93]:
from anovos.data_transformer.transformers import boxcox_transformation

In [94]:
# Example 1 - selected columns + print impact
odf = boxcox_transformation(df, drop_cols=['capital-loss', 'capital-gain', 'latitude', 'longitude', 'geohash'], print_impact=True)

Transformed Columns:  ['fnlwgt', 'education-num', 'age', 'logfnl', 'hours-per-week']
Best BoxCox Parameter(s):  [1, 3, 0, 1, 3]
Before:
+--------+------------------+--------------------+------------------+-------------------+--------------------+
|summary |fnlwgt            |education-num       |age               |logfnl             |hours-per-week      |
+--------+------------------+--------------------+------------------+-------------------+--------------------+
|count   |32546             |32530               |32500             |12168              |32452               |
|mean    |189781.83180728814|10.080971411005226  |38.506492307692305|5.2054654851899365 |40.24972266732405   |
|stddev  |105563.06445056995|2.5725103263986973  |13.508497735339269|0.27424241727170395|11.914337669272234  |
|min     |12285             |1                   |17                |4.283617786        |1                   |
|max     |1484705           |16                  |85                |6.088696941       

In [95]:
# Example 2 - selected columns + existing lambda value + print impact
odf = boxcox_transformation(df, list_of_cols='age', boxcox_lambda=0, output_mode='append', print_impact=True)

Transformed Columns:  ['age']
Best BoxCox Parameter(s):  [0]
Before:
+--------+------------------+
|summary |age               |
+--------+------------------+
|count   |32500             |
|mean    |38.506492307692305|
|stddev  |13.508497735339269|
|min     |17                |
|max     |85                |
|skewness|0.5127993362812433|
+--------+------------------+

After:
+--------+--------------------+
|summary |age_bxcx_0          |
+--------+--------------------+
|count   |32500               |
|mean    |3.588027147805673   |
|stddev  |0.3589571852865876  |
|min     |2.833213344056216   |
|max     |4.442651256490317   |
|skewness|-0.14607838263666723|
+--------+--------------------+



# Outlier Categories Treatment
- API specification of function **outlier_categories** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>
- Supports 2 ways of outliers detection: by max number of categories and by coverage (%)

In [96]:
from anovos.data_transformer.transformers import outlier_categories

In [97]:
# Example 1 - 'all' columns (excluding drop_cols) + max 15 categories + append transformed columns at the end
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, output_mode='append')
odf.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,education,education-num,marital-status,occupation,relationship,...,sex_outliered,education_outliered,race_outliered,occupation_outliered,workclass_outliered,relationship_outliered,marital-status_outliered,income_outliered,geohash_outliered,native-country_outliered
0,1a,,State-gov,77516.0,4.889391,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,...,Male,Bachelors,White,Adm-clerical,State-gov,Not-in-family,Never-married,<=50K,outlier_categories,outlier_categories
1,2a,,Self-emp-not-inc,83311.0,4.920702,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,...,Male,Bachelors,White,Exec-managerial,Self-emp-not-inc,Husband,Married-civ-spouse,<=50K,outlier_categories,outlier_categories
2,3a,38.0,Private,215646.0,5.333741,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,...,Male,HS-grad,White,Handlers-cleaners,Private,Not-in-family,Divorced,<=50K,outlier_categories,outlier_categories
3,4a,53.0,Private,234721.0,5.370552,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,...,Male,11th,Black,Handlers-cleaners,Private,Husband,Married-civ-spouse,<=50K,outlier_categories,outlier_categories
4,5a,,Private,338409.0,5.529442,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,...,Female,Bachelors,Black,Prof-specialty,Private,Wife,Married-civ-spouse,<=50K,outlier_categories,Cuba


In [98]:
# Example 2 - selected columns + max 10 categories
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         max_category=10, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|education     |16                 |
|occupation    |15                 |
|native-country|44                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|education     |10                |
|occupation    |10                |
|native-country|10                |
+--------------+------------------+



In [99]:
# Example 3 - selected columns + cover 90% values
odf = outlier_categories(spark, df, list_of_cols=['education', 'occupation', 'native-country'], 
                         coverage=0.9, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|education     |16                 |
|occupation    |15                 |
|native-country|44                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|education     |9                 |
|occupation    |11                |
|native-country|3                 |
+--------------+------------------+



In [100]:
# Example 4 - max 15 categories + save model
odf = outlier_categories(spark, df, drop_cols=['ifa'], max_category=15, 
                         pre_existing_model=False, model_path=outputPath, print_impact=True)

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|sex           |3                  |
|race          |9                  |
|education     |16                 |
|occupation    |15                 |
|workclass     |11                 |
|relationship  |8                  |
|marital-status|7                  |
|income        |2                  |
|geohash       |3558               |
|native-country|44                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|sex           |3                 |
|race          |9                 |
|education     |15                |
|occupation    |15                |
|workclass     |11                |
|relationship  |8                 |
|marital-status|7                 |
|income        |2                 |
|geohash       |38                |
|native-country|15                |
+------------

In [101]:
# Example 5 - use pre-saved model
odf = outlier_categories(spark, df, drop_cols=['ifa'], pre_existing_model=True, model_path=outputPath, print_impact=True)

                                                                                

+--------------+-------------------+
|attribute     |uniqueValues_before|
+--------------+-------------------+
|sex           |3                  |
|race          |9                  |
|education     |16                 |
|occupation    |15                 |
|workclass     |11                 |
|relationship  |8                  |
|marital-status|7                  |
|income        |2                  |
|geohash       |3558               |
|native-country|44                 |
+--------------+-------------------+

+--------------+------------------+
|attribute     |uniqueValues_after|
+--------------+------------------+
|sex           |3                 |
|race          |9                 |
|education     |15                |
|occupation    |15                |
|workclass     |10                |
|relationship  |8                 |
|marital-status|7                 |
|income        |2                 |
|geohash       |38                |
|native-country|15                |
+------------

# Expression Parser
- API specification of function **expression_parser** can be found <a href="https://docs.anovos.ai/api/data_transformer/transformers.html">here</a>

In [102]:
from anovos.data_transformer.transformers import expression_parser

In [103]:
# Example 1 - 2 generated columns + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain-capital-loss'], print_impact=True)

Columns Added:  ['f0', 'f1']
+-------+-----------------+-----------------+
|summary|f0               |f1               |
+-------+-----------------+-----------------+
|count  |32392            |32548            |
|mean   |78.75373549024451|990.3572569743149|
|stddev |18.61982451813538|7410.325259409036|
|min    |20               |-4356            |
|max    |158              |99999            |
+-------+-----------------+-----------------+



In [104]:
# Example 1 - 2 generated columns + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain/capital-loss'], print_impact=True)

Columns Added:  ['f0', 'f1']
+-------+-----------------+----+
|summary|f0               |f1  |
+-------+-----------------+----+
|count  |32392            |1519|
|mean   |78.75373549024451|0.0 |
|stddev |18.61982451813538|0.0 |
|min    |20               |0.0 |
|max    |158              |0.0 |
+-------+-----------------+----+



In [105]:
# Example 2 - 2 generated columns + customized postfix + print impact
odf = expression_parser(df, ['age + hours-per-week', 'capital-gain - capital-loss'], postfix="_new", print_impact=True)

Columns Added:  ['f0_new', 'f1_new']
+-------+-----------------+-----------------+
|summary|f0_new           |f1_new           |
+-------+-----------------+-----------------+
|count  |32392            |32548            |
|mean   |78.75373549024451|990.3572569743149|
|stddev |18.61982451813538|7410.325259409036|
|min    |20               |-4356            |
|max    |158              |99999            |
+-------+-----------------+-----------------+

