# ANOVOS - Association Evaluator
Following notebook shows the list of functions related to "asociation evaultion" module provided under ANOVOS package and how it can be invoked accordingly.
- [Correlation Matrix](#Correlation-Matrix)
- [Variable Clustering](#Variable-Clustering)
- [Information Value (IV)](#Information-Value-(IV))
- [Information Gain (IG)](#Information-Gain-(IG))

**Setting Spark Session**

In [1]:
#set run type variable
run_type = "local" # "local", "emr", "databricks", "ak8s"

In [2]:
#For run_type Azure Kubernetes, run the following block 
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if run_type == "ak8s":
    fs_path="<insert conf spark.hadoop.fs master url here> ex: spark.hadoop.fs.azure.sas.<container>.<account_name>.blob.core.windows.net"
    auth_key="<insert value of sas_token here>"
    master_url="<insert kubernetes master url path here> ex: k8s://"
    docker_image="<insert name docker image here>"
    kubernetes_namespace ="<insert kubernetes namespace here>"

    # Create Spark config for our Kubernetes based cluster manager
    sparkConf = SparkConf()
    sparkConf.setMaster(master_url)
    sparkConf.setAppName("Anovos_pipeline")
    sparkConf.set("spark.submit.deployMode","client")
    sparkConf.set("spark.kubernetes.container.image", docker_image)
    sparkConf.set("spark.kubernetes.namespace", kubernetes_namespace)
    sparkConf.set("spark.executor.instances", "4")
    sparkConf.set("spark.executor.cores", "4")
    sparkConf.set("spark.executor.memory", "16g")
    sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
    sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    sparkConf.set(fs_path,auth_key)
    sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
    sparkConf.set("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3,io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20,org.apache.spark:spark-avro_2.12:3.2.1")

    # Initialize our Spark cluster, this will actually
    # generate the worker nodes.
    spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    sc = spark.sparkContext

#For other run types import from anovos.shared.
else:
    from anovos.shared.spark import *
    auth_key = "NA"

2022-06-07 11:54:39.501 | INFO     | anovos.shared.spark:init_spark:54 - Getting spark session, context and sql context app_name: Anovos_pipeline


:: loading settings :: url = jar:file:/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/mobilewalla/.ivy2/cache
The jars for the packages stored in: /Users/mobilewalla/.ivy2/jars
io.github.histogrammar#histogrammar_2.12 added as a dependency
io.github.histogrammar#histogrammar-sparksql_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0185f22c-7c52-427c-8a54-daa46bb27e9b;1.0
	confs: [default]
	found io.github.histogrammar#histogrammar_2.12;1.0.20 in central
	found io.github.histogrammar#histogrammar-sparksql_2.12;1.0.20 in central
	found org.apache.spark#spark-avro_2.12;3.2.1 in central
	found org.tukaani#xz;1.8 in central
	found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 269ms :: artifacts dl 9ms
	:: modules in use:
	io.github.histogrammar#histogrammar-sparksql_2.12;1.0.20 from central in [default]
	io.github.histogrammar#histogrammar_2.12;1.0.20 from central in [default]
	org.apache.spark#spark-avro_2.12;

In [3]:
sc.setLogLevel("ERROR")
import warnings
warnings.filterwarnings('ignore')

**Input/Output Path**

In [4]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_analyzer"

In [5]:
from anovos.data_ingest.data_ingest import read_dataset

In [6]:
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df = df.drop("dt_1", "dt_2")
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Correlation Matrix
- API specification of function **correlation_matrix** can be found <a href="https://docs.anovos.ai/api/data_analyzer/association_evaluator.html">here</a>

In [7]:
from anovos.data_analyzer.association_evaluator import correlation_matrix

In [8]:
# Example 1 - 'all' columns (excluding drop_cols) --- MUST remove high cardinality columns
odf = correlation_matrix(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf.toPandas()

2022-06-06 19:24:45,995 INFO [histogram_filler_base]: Filling 136 specified histograms. auto-binning.
100%|█████████████████████████████████████████████████████████████| 136/136 [00:08<00:00, 16.43it/s]
                                                                                

Unnamed: 0,attribute,age,capital-gain,capital-loss,education,education-num,fnlwgt,hours-per-week,income,logfnl,marital-status,native-country,occupation,race,relationship,sex,workclass
0,age,1.0,0.210563,0.13696,0.351468,0.30771,0.093912,0.374326,0.371282,0.095748,0.567636,0.090161,0.317458,0.03757,0.518391,0.136002,0.299177
1,capital-gain,0.210563,1.0,0.0,0.232282,0.278753,0.0,0.134373,0.363159,0.0,0.147185,0.0,0.150385,0.0,0.220468,0.077803,0.104854
2,capital-loss,0.13696,0.0,1.0,0.137968,0.097767,0.0,0.067687,0.236992,0.0,0.139274,0.024224,0.075212,0.0,0.133671,0.090275,0.050194
3,education,0.351468,0.232282,0.137968,1.0,0.999993,0.0628,0.214868,0.408941,0.084964,0.615107,0.416369,0.495816,0.250895,0.385173,0.111322,0.234612
4,education-num,0.30771,0.278753,0.097767,0.999993,1.0,0.064602,0.217432,0.467219,0.087599,0.193128,0.421516,0.590182,0.119653,0.227612,0.112019,0.227179
5,fnlwgt,0.093912,0.0,0.0,0.0628,0.064602,1.0,0.030981,0.045367,0.960338,0.069952,0.17202,0.079118,0.134013,0.037772,0.018828,0.061078
6,hours-per-week,0.374326,0.134373,0.067687,0.214868,0.217432,0.030981,1.0,0.310192,0.079669,0.26409,0.073563,0.346109,0.114166,0.322045,0.323359,0.279103
7,income,0.371282,0.363159,0.236992,0.408941,0.467219,0.045367,0.310192,1.0,0.067692,0.591336,0.115731,0.446535,0.131476,0.452741,0.323706,0.230552
8,logfnl,0.095748,0.0,0.0,0.084964,0.087599,0.960338,0.079669,0.067692,1.0,0.059919,0.196478,0.110344,0.187069,0.064678,0.031693,0.080529
9,marital-status,0.567636,0.147185,0.139274,0.615107,0.193128,0.069952,0.26409,0.591336,0.059919,1.0,0.152067,0.351355,0.246052,0.713109,0.547093,0.201391


In [9]:
# Example 2 - selected columns
odf = correlation_matrix(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf.toPandas()

2022-06-06 19:24:57,129 INFO [histogram_filler_base]: Filling 15 specified histograms. auto-binning.
100%|███████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 21.68it/s]
                                                                                

Unnamed: 0,attribute,age,fnlwgt,race,sex,workclass
0,age,1.0,0.093912,0.03757,0.136002,0.299177
1,fnlwgt,0.093912,1.0,0.134013,0.018828,0.061078
2,race,0.03757,0.134013,1.0,0.608577,0.085917
3,sex,0.136002,0.018828,0.608577,1.0,0.186017
4,workclass,0.299177,0.061078,0.085917,0.186017,1.0


In [10]:
# Example 3 - selected columns + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_cardinality
from anovos.data_ingest.data_ingest import write_dataset
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})

odf = correlation_matrix(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'],
                                  stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"})
odf.toPandas()

2022-06-06 19:25:02,820 INFO [histogram_filler_base]: Filling 15 specified histograms. auto-binning.
100%|███████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 19.70it/s]


Unnamed: 0,attribute,age,fnlwgt,race,sex,workclass
0,age,1.0,0.093912,0.03757,0.136002,0.299177
1,fnlwgt,0.093912,1.0,0.134013,0.018828,0.061078
2,race,0.03757,0.134013,1.0,0.608577,0.085917
3,sex,0.136002,0.018828,0.608577,1.0,0.186017
4,workclass,0.299177,0.061078,0.085917,0.186017,1.0


# Variable Clustering
- API specification of function **variable_clustering** can be found <a href="https://docs.anovos.ai/api/data_analyzer/association_evaluator.html">here</a>
- Valid only on smaller dataset which can fit into pandas dataframe. Sample size can controlled by sample_size argument (default value: 100,000)

In [11]:
from anovos.data_analyzer.association_evaluator import variable_clustering

In [12]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf = variable_clustering(spark, df)
odf.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,hours-per-week,0.8264
1,0,marital-status,0.4999
2,0,relationship,0.3526
3,0,sex,0.3369
4,1,logfnl,0.2273
5,1,fnlwgt,0.2277
6,2,income,0.5764
7,2,education-num,0.4163
8,2,occupation,0.5893
9,2,capital-loss,0.8975


In [13]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = variable_clustering(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,hours-per-week,0.8264
1,0,marital-status,0.4999
2,0,relationship,0.3526
3,0,sex,0.3369
4,1,logfnl,0.2271
5,1,fnlwgt,0.2277
6,2,income,0.5764
7,2,education-num,0.4163
8,2,occupation,0.5893
9,2,capital-loss,0.8975


In [14]:
# Example 3 - selected columns
odf = variable_clustering(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,workclass,0.5303
1,0,age,0.4587
2,0,fnlwgt,0.827
3,1,race,0.4607
4,1,sex,0.4618


In [15]:
# Example 4 - only numerical columns (user warning is shown as encoding was not required due to absence of any categorical column)
odf = variable_clustering(spark, idf = df, list_of_cols= ['age','education-num','capital-gain'])
odf.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,age,0.764
1,0,capital-gain,0.4886
2,0,education-num,0.5839


In [16]:
# Example 5 - only categorical columns
odf = variable_clustering(spark, idf = df, list_of_cols= ['sex','race','workclass'])
odf.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,race,0.4606
1,0,sex,0.4605
2,1,workclass,0.0


In [17]:
# Example 6 - Change in Sample Size
odf = variable_clustering(spark, idf = df, list_of_cols= 'all', sample_size=10000)
odf.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,hours-per-week,0.8462
1,0,marital-status,0.4953
2,0,relationship,0.3586
3,0,sex,0.3328
4,1,logfnl,0.2217
5,1,fnlwgt,0.2224
6,2,income,0.5328
7,2,education-num,0.4581
8,2,occupation,0.6609
9,2,capital-loss,0.8752


In [18]:
# Example 7 - selected columns + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_cardinality, measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf = variable_clustering(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'],
                                  stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"},
                                  stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"})
odf.toPandas()

                                                                                

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,workclass,0.5303
1,0,age,0.4587
2,0,fnlwgt,0.827
3,1,race,0.4607
4,1,sex,0.4618


# Information Value (IV)
- API specification of function **IV_calculation** can be found <a href="https://docs.anovos.ai/api/data_analyzer/association_evaluator.html">here</a>
- Supports only binary target variable

In [19]:
from anovos.data_analyzer.association_evaluator import IV_calculation

In [20]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf = IV_calculation(spark, df, label_col='income', event_label=">50K")
odf.toPandas()



Unnamed: 0,attribute,iv
0,relationship,1.5348
1,marital-status,1.339
2,age,1.0704
3,occupation,0.7772
4,education,0.7345
5,education-num,0.6984
6,hours-per-week,0.4563
7,capital-gain,0.3138
8,sex,0.3037
9,workclass,0.1625


In [21]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = IV_calculation(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], label_col='income', event_label=">50K")
odf.toPandas()

Unnamed: 0,attribute,iv
0,relationship,1.5348
1,marital-status,1.339
2,age,1.0704
3,occupation,0.7772
4,education,0.7345
5,education-num,0.6984
6,hours-per-week,0.4563
7,capital-gain,0.3138
8,sex,0.3037
9,workclass,0.1625


In [22]:
# Example 3 - selected columns
odf = IV_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', event_label=">50K")
odf.toPandas()

Unnamed: 0,attribute,iv
0,age,1.0704
1,sex,0.3037
2,workclass,0.1625
3,race,0.0697
4,fnlwgt,0.0087


In [23]:
# Example 4 - selected columns + encoding configs (bin method equal_range instead of default equal_frequency )
odf = IV_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', 
                    event_label=">50K", encoding_configs={'bin_method': 'equal_range', 
                                                          'bin_size': 10, 'monotonicity_check': 0})
odf.toPandas()

Unnamed: 0,attribute,iv
0,age,1.0436
1,sex,0.3037
2,workclass,0.1625
3,race,0.0697
4,fnlwgt,0.0016


In [24]:
# Example 5 - selected columns + encoding configs (bin_size 20 instead of default 10 )
odf = IV_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', 
                    event_label=">50K", encoding_configs={'bin_method': 'equal_frequency', 
                                                          'bin_size': 20, 'monotonicity_check': 0})
odf.toPandas()

Unnamed: 0,attribute,iv
0,age,1.1592
1,sex,0.3037
2,workclass,0.1625
3,race,0.0697
4,fnlwgt,0.016


In [25]:
# Example 6 - selected columns + encoding configs (monotonicity check )
odf = IV_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', 
                    event_label=">50K", encoding_configs={'bin_method': 'equal_frequency', 
                                                          'bin_size': 10, 'monotonicity_check': 1})
odf.toPandas()

Unnamed: 0,attribute,iv
0,age,0.5842
1,sex,0.3037
2,workclass,0.1625
3,race,0.0697
4,fnlwgt,0.0087


# Information Gain (IG)
- API specification of function **IG_calculation** can be found <a href="https://docs.anovos.ai/api/data_analyzer/association_evaluator.html">here</a>
- Supports only binary target variable

In [26]:
from anovos.data_analyzer.association_evaluator import IG_calculation

In [27]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf = IG_calculation(spark, df, label_col='income', event_label=">50K")
odf.toPandas()

Unnamed: 0,attribute,ig
0,relationship,0.1654
1,marital-status,0.1565
2,age,0.0935
3,education,0.0932
4,occupation,0.0931
5,education-num,0.0883
6,hours-per-week,0.0565
7,capital-gain,0.0428
8,sex,0.0372
9,workclass,0.0217


In [28]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = IG_calculation(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], label_col='income', event_label=">50K")
odf.toPandas()

Unnamed: 0,attribute,ig
0,relationship,0.1654
1,marital-status,0.1565
2,age,0.0935
3,education,0.0932
4,occupation,0.0931
5,education-num,0.0883
6,hours-per-week,0.0565
7,capital-gain,0.0428
8,sex,0.0372
9,workclass,0.0217


In [29]:
# Example 3 - selected columns
odf = IG_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', event_label=">50K")
odf.toPandas()

Unnamed: 0,attribute,ig
0,age,0.0935
1,sex,0.0372
2,workclass,0.0217
3,race,0.0086
4,fnlwgt,0.0011


In [30]:
# Example 4 - selected columns + encoding configs (bin method equal_range instead of default equal_frequency )
odf = IG_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', 
                    event_label=">50K", encoding_configs={'bin_method': 'equal_range', 
                                                          'bin_size': 10, 'monotonicity_check': 0})
odf.toPandas()

Unnamed: 0,attribute,ig
0,age,0.0918
1,sex,0.0372
2,workclass,0.0217
3,race,0.0086
4,fnlwgt,0.0002


In [31]:
# Example 5 - selected columns + encoding configs (bin_size 20 instead of default 10 )
odf = IG_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', 
                    event_label=">50K", encoding_configs={'bin_method': 'equal_frequency', 
                                                          'bin_size': 20, 'monotonicity_check': 0})
odf.toPandas()

Unnamed: 0,attribute,ig
0,age,0.0968
1,sex,0.0372
2,workclass,0.0217
3,race,0.0086
4,fnlwgt,0.0021


In [32]:
# Example 6 - selected columns + encoding configs (monotonicity check )
odf = IG_calculation(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], label_col='income', 
                    event_label=">50K", encoding_configs={'bin_method': 'equal_frequency', 
                                                          'bin_size': 10, 'monotonicity_check': 1})
odf.toPandas()

Unnamed: 0,attribute,ig
0,age,0.0688
1,sex,0.0372
2,workclass,0.0217
3,race,0.0086
4,fnlwgt,0.0011
