# ANOVOS - Quality Checkers
Following notebook shows the list of functions related to "quality checker" module provided under ANOVOS package and how it can be invoked accordingly.
- [Row Level Checks](#Row-Level-Checks)
    - [Duplicate Detection](#Duplicate-Detection)
    - [Null Detection (Row-wise)](#Null-Detection-(Row-wise))
- [Column Level Checks](#Column-Level-Checks)
    - [Null Detection (Column-wise)](#Null-Detection-(Column-wise))
    - [Outlier Detection](#Outlier-Detection)
    - [IDness Detection](#IDness-Detection)
    - [Biasedness Detection](#Biasedness-Detection)
    - [Invalid Entries Detection](#Invalid-Entries-Detection)

**Setting Spark Session**

In [1]:
#set run type variable
run_type = "local" # "local", "emr", "databricks", "ak8s"

In [5]:
#For run_type Azure Kubernetes, run the following block 
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if run_type == "ak8s":
    fs_path="<insert conf spark.hadoop.fs master url here> ex: spark.hadoop.fs.azure.sas.<container>.<account_name>.blob.core.windows.net"
    auth_key="<insert value of sas_token here>"
    master_url="<insert kubernetes master url path here> ex: k8s://"
    docker_image="<insert name docker image here>"
    kubernetes_namespace ="<insert kubernetes namespace here>"

    # Create Spark config for our Kubernetes based cluster manager
    sparkConf = SparkConf()
    sparkConf.setMaster(master_url)
    sparkConf.setAppName("Anovos_pipeline")
    sparkConf.set("spark.submit.deployMode","client")
    sparkConf.set("spark.kubernetes.container.image", docker_image)
    sparkConf.set("spark.kubernetes.namespace", kubernetes_namespace)
    sparkConf.set("spark.executor.instances", "4")
    sparkConf.set("spark.executor.cores", "4")
    sparkConf.set("spark.executor.memory", "16g")
    sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
    sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    sparkConf.set(fs_path,auth_key)
    sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
    sparkConf.set("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3,io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20,org.apache.spark:spark-avro_2.12:3.2.1")

    # Initialize our Spark cluster, this will actually
    # generate the worker nodes.
    spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    sc = spark.sparkContext

#For other run types import from anovos.shared.
else:
    from anovos.shared.spark import *
    auth_key = "NA"

In [3]:
sc.setLogLevel("ERROR")
import warnings
warnings.filterwarnings('ignore')

**Input/Output Path**

In [4]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_analyzer"

In [6]:
from anovos.data_ingest.data_ingest import read_dataset

In [7]:
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                                  "delimiter": "," , 
                                                                                  "inferSchema": "True"})
df = df.drop("dt_1", "dt_2")
df.toPandas().head(5)

                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Row Level Checks

## Duplicate Detection
- API specification of function **duplicate_detection** can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>

In [8]:
from anovos.data_analyzer.quality_checker import duplicate_detection

In [10]:
# Example 1 - with mandatory arguments and print_impact=True (rest arguments have default values)
odf, odf_stats = duplicate_detection(spark, df, print_impact=True)
odf_stats.toPandas()

                                                                                

No. of Rows: 32561
No. of UNIQUE Rows: 32561
No. of Duplicate Rows: 0
Percentage of Duplicate Rows: 0.0


                                                                                

Unnamed: 0,metric,value
0,rows_count,32561.0
1,unique_rows_count,32561.0
2,duplicate_rows,0.0
3,duplicate_pct,0.0


In [12]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = duplicate_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], print_impact=True)
odf_stats.toPandas()

                                                                                

No. of Rows: 32561
No. of UNIQUE Rows: 32548
No. of Duplicate Rows: 13
Percentage of Duplicate Rows: 0.0004


Unnamed: 0,metric,value
0,rows_count,32561.0
1,unique_rows_count,32548.0
2,duplicate_rows,13.0
3,duplicate_pct,0.0004


In [13]:
# Example 3 - selected columns
odf, odf_stats = duplicate_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'], print_impact=True)
odf_stats.toPandas()

No. of Rows: 32561
No. of UNIQUE Rows: 30601
No. of Duplicate Rows: 1960
Percentage of Duplicate Rows: 0.0602


Unnamed: 0,metric,value
0,rows_count,32561.0
1,unique_rows_count,30601.0
2,duplicate_rows,1960.0
3,duplicate_pct,0.0602


In [14]:
# Example 4 - with treatment (Deduplication)
odf, odf_stats = duplicate_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'],treatment=True, print_impact=True)
print(odf.count())
odf_stats.toPandas()

No. of Rows: 32561
No. of UNIQUE Rows: 30601
No. of Duplicate Rows: 1960
Percentage of Duplicate Rows: 0.0602
30601


Unnamed: 0,metric,value
0,rows_count,32561.0
1,unique_rows_count,30601.0
2,duplicate_rows,1960.0
3,duplicate_pct,0.0602


## Null Detection (Row-wise)
- API specification of function **nullRows_detection** can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>

In [10]:
from anovos.data_analyzer.quality_checker import nullRows_detection

In [11]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = nullRows_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,1,11641,0.3575,0
1,2,20003,0.6143,0
2,3,879,0.027,0
3,4,19,0.0006,0
4,5,12,0.0004,0
5,8,4,0.0001,0
6,9,3,0.0001,0


In [12]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = nullRows_detection(spark, idf = df, list_of_cols='all', drop_cols=['age'], treatment_threshold=0.4)
odf_stats.toPandas()

Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,1,11665,0.3583,0
1,2,20005,0.6144,0
2,3,855,0.0263,0
3,4,23,0.0007,0
4,5,6,0.0002,0
5,8,7,0.0002,1


In [13]:
# Example 3 - selected columns
odf, odf_stats = nullRows_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf_stats.toPandas()

Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,0,32181,0.9883,0
1,1,366,0.0112,0
2,2,11,0.0003,0
3,3,3,0.0001,0


In [14]:
# Example 4 - with treatment (row removal)
odf, odf_stats = nullRows_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75)
print(odf.count())
odf_stats.toPandas()

32561


Unnamed: 0,null_cols_count,row_count,row_pct,treated
0,1,11641,0.3575,0
1,2,20003,0.6143,0
2,3,879,0.027,0
3,4,19,0.0006,0
4,5,12,0.0004,0
5,8,4,0.0001,0
6,9,3,0.0001,0


# Column Level Checks

## Null Detection (Column-wise)
- API specification of function nullColumns_detection can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>

In [15]:
from anovos.data_analyzer.quality_checker import nullColumns_detection

In [16]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = nullColumns_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,education-num,31,0.001
1,workclass,3,0.0001
2,education,521,0.016
3,race,314,0.0096
4,relationship,4,0.0001
5,capital-gain,13,0.0004
6,capital-loss,12,0.0004
7,age,61,0.0019
8,hours-per-week,109,0.0033
9,fnlwgt,15,0.0005


In [17]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,education-num,31,0.001
1,workclass,3,0.0001
2,education,521,0.016
3,race,314,0.0096
4,relationship,4,0.0001
5,capital-gain,13,0.0004
6,capital-loss,12,0.0004
7,income,0,0.0
8,age,61,0.0019
9,hours-per-week,109,0.0033


In [18]:
# Example 3 - selected columns
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,workclass,3,0.0001
1,race,314,0.0096
2,age,61,0.0019
3,fnlwgt,15,0.0005
4,sex,4,0.0001


In [19]:
# Example 4 - with treatment (row removal)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                                       treatment_method="row_removal", print_impact=True)

+--------------+-------------+-----------+
|     attribute|missing_count|missing_pct|
+--------------+-------------+-----------+
|           ifa|            0|        0.0|
| education-num|           31|      0.001|
|     workclass|            3|     1.0E-4|
|     education|          521|      0.016|
|          race|          314|     0.0096|
|  relationship|            4|     1.0E-4|
|  capital-gain|           13|     4.0E-4|
|  capital-loss|           12|     4.0E-4|
|        income|            0|        0.0|
|           age|           61|     0.0019|
|hours-per-week|          109|     0.0033|
|        fnlwgt|           15|     5.0E-4|
|native-country|            0|        0.0|
|marital-status|          426|     0.0131|
|           sex|            4|     1.0E-4|
|    occupation|           12|     4.0E-4|
|        logfnl|        20393|     0.6263|
+--------------+-------------+-----------+
only showing top 17 rows

Before Count: 32561
After Count: 11641


In [20]:
# Example 5 - with treatment (row removal)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                        treatment_method="column_removal", treatment_configs={'treatment_threshold':0.5},print_impact=True)

+--------------+-------------+-----------+
|     attribute|missing_count|missing_pct|
+--------------+-------------+-----------+
|           ifa|            0|        0.0|
| education-num|           31|      0.001|
|     workclass|            3|     1.0E-4|
|     education|          521|      0.016|
|          race|          314|     0.0096|
|  relationship|            4|     1.0E-4|
|  capital-gain|           13|     4.0E-4|
|  capital-loss|           12|     4.0E-4|
|        income|            0|        0.0|
|           age|           61|     0.0019|
|hours-per-week|          109|     0.0033|
|        fnlwgt|           15|     5.0E-4|
|native-country|            0|        0.0|
|marital-status|          426|     0.0131|
|           sex|            4|     1.0E-4|
|    occupation|           12|     4.0E-4|
|        logfnl|        20393|     0.6263|
|         empty|        32561|        1.0|
+--------------+-------------+-----------+

Removed Columns:  ['logfnl', 'empty']


In [21]:
# Example 6 - with treatment (Median & Mode)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"median", 
                                                    "pre_existing_model":False,"model_path":"NA",
                                                    "output_mode":"replace"},print_impact=True)

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|ifa           |0                  |0                 |
|education-num |31                 |0                 |
|workclass     |3                  |0                 |
|education     |521                |0                 |
|race          |314                |0                 |
|relationship  |4                  |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|income        |0                  |0                 |
|age           |61                 |0                 |
|hours-per-week|109                |0                 |
|fnlwgt        |15                 |0                 |
|native-country|0                  |0                 |
|marital-status|426                |0                 |
|sex           |4                  |0           

In [22]:
# Example 7 - with treatment (Mean & Mode) and saving model
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":False,"model_path":outputPath,
                                                    "output_mode":"replace"},print_impact=True)

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|ifa           |0                  |0                 |
|education-num |31                 |0                 |
|workclass     |3                  |0                 |
|education     |521                |0                 |
|race          |314                |0                 |
|relationship  |4                  |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|income        |0                  |0                 |
|age           |61                 |0                 |
|hours-per-week|109                |0                 |
|fnlwgt        |15                 |0                 |
|native-country|0                  |0                 |
|marital-status|426                |0                 |
|sex           |4                  |0           

In [23]:
# Example 8 - with treatment (Mean & Mode) and using pre-saved model
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":True,"model_path":outputPath,
                                                    "output_mode":"replace"},print_impact=True)

+--------------+-------------------+------------------+
|attribute     |missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|ifa           |0                  |0                 |
|education-num |31                 |0                 |
|workclass     |3                  |0                 |
|education     |521                |0                 |
|race          |314                |0                 |
|relationship  |4                  |0                 |
|capital-gain  |13                 |0                 |
|capital-loss  |12                 |0                 |
|income        |0                  |0                 |
|age           |61                 |0                 |
|hours-per-week|109                |0                 |
|fnlwgt        |15                 |0                 |
|native-country|0                  |0                 |
|marital-status|426                |0                 |
|sex           |4                  |0           

In [24]:
# Example 9 - with treatment (Mean & Mode), using pre-saved model and appending imputed columns
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":True,"model_path":outputPath,
                                                    "output_mode":"append"},print_impact=True)

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|race          |314                |race_imputed          |0            |
|capital-gain  |13                 |capital-gain_imputed  |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|sex           |4                  |sex_imputed           |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|hours-per-week|109                |hours-per-week_imputed|0            |
|occupation    |12                 |occupation_imputed    |0            |
|marital-status|426                |marital-status_imputed|0            |
|age           |61                 |age_imputed           |0            |
|education-num |31                 |education-num_imputed |0            |
|workclass     |3                  |wo

In [25]:
# Example 10 - with treatment (Mean & Mode), using pre-saved model + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_counts, measures_of_centralTendency, measures_of_cardinality
from anovos.data_ingest.data_ingest import write_dataset
missing = write_dataset(measures_of_counts(spark, df),outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":True,"model_path":outputPath,
                                                    "output_mode":"append"},
                    stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                    stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"}, 
                    stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)

                                                                                

+--------------+-------------------+----------------------+-------------+
|attribute     |missingCount_before|attribute_after       |missing_count|
+--------------+-------------------+----------------------+-------------+
|hours-per-week|109                |hours-per-week_imputed|0            |
|education-num |31                 |education-num_imputed |0            |
|capital-loss  |12                 |capital-loss_imputed  |0            |
|capital-gain  |13                 |capital-gain_imputed  |0            |
|fnlwgt        |15                 |fnlwgt_imputed        |0            |
|logfnl        |20393              |logfnl_imputed        |0            |
|age           |61                 |age_imputed           |0            |
|marital-status|426                |marital-status_imputed|0            |
|relationship  |4                  |relationship_imputed  |0            |
|occupation    |12                 |occupation_imputed    |0            |
|workclass     |3                  |wo

## Outlier Detection
- API specification of function **outlier_detection** can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>
- Calculated only for numerical columns

In [8]:
from anovos.data_analyzer.quality_checker import outlier_detection

In [9]:
# Example 1 - with mandatory arguments (rest arguments have default values)
# Treatment will be applied with "value_replacement" method_type and the treated dataframe will be returned

odf = outlier_detection(spark, df)

df.select(['age','education-num','capital-gain','logfnl']).describe().show()
odf.select(['age','education-num','capital-gain','logfnl']).describe().show()

+-------+------------------+------------------+------------------+-------------------+
|summary|               age|     education-num|      capital-gain|             logfnl|
+-------+------------------+------------------+------------------+-------------------+
|  count|             32500|             32530|             32548|              12168|
|   mean|38.506492307692305|10.080971411005226|1077.6959567408135| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977| 7386.624857802765|0.27424241727170395|
|    min|                17|                 1|                 0|        4.283617786|
|    max|                85|                16|             99999|        6.088696941|
+-------+------------------+------------------+------------------+-------------------+



[Stage 11:>                                                         (0 + 1) / 1]

+-------+------------------+------------------+-----------------+-------------------+
|summary|               age|     education-num|     capital-gain|             logfnl|
+-------+------------------+------------------+-----------------+-------------------+
|  count|             32500|             32530|            32548|              12168|
|   mean|           38.4866|10.080971411005226| 292.871850804965| 5.2052811608347405|
| stddev|13.450031532874966|2.5725103263986977|996.8784267578802|0.27377636354326645|
|    min|              17.0|               1.0|              0.0|        4.283617786|
|    max|              75.5|              16.0|           3908.0|        5.823027151|
+-------+------------------+------------------+-----------------+-------------------+



                                                                                

In [10]:
# Example 2 - change in detection configs - only use the percentile method + row_removal treatment
# odf_stats is returned only when print_impact is True

odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols='all', drop_cols=['age'], 
                                   detection_side="lower", detection_configs={'pctile_lower': 0.02}, 
                                   treatment_method="row_removal", print_impact=True)

print("Number of rows before treatment:", df.count())
print("Number of rows after treatment:", odf.count())

                                                                                

+--------------+--------------+--------------+------------------------+
|attribute     |lower_outliers|upper_outliers|excluded_due_to_skewness|
+--------------+--------------+--------------+------------------------+
|education-num |551           |0             |0                       |
|capital-gain  |0             |0             |0                       |
|logfnl        |124           |0             |0                       |
|hours-per-week|458           |0             |0                       |
|fnlwgt        |822           |0             |0                       |
|capital-loss  |0             |0             |1                       |
+--------------+--------------+--------------+------------------------+

Number of rows before treatment: 32561
Number of rows after treatment: 30750


                                                                                

In [11]:
# Example 3 - detect outliers on both ends + null_replacement treatment
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'],
                                   detection_side="both", treatment_method="null_replacement", print_impact=True)

df.select(['age','education-num','capital-gain','logfnl']).describe().show(1)
odf.select(['age','education-num','capital-gain','logfnl']).describe().show(1)

                                                                                

+-------------+--------------+--------------+------------------------+
|attribute    |lower_outliers|upper_outliers|excluded_due_to_skewness|
+-------------+--------------+--------------+------------------------+
|education-num|1196          |0             |0                       |
|age          |0             |193           |0                       |
|capital-gain |0             |1908          |0                       |
|logfnl       |592           |27            |0                       |
+-------------+--------------+--------------+------------------------+

+-------+-----+-------------+------------+------+
|summary|  age|education-num|capital-gain|logfnl|
+-------+-----+-------------+------------+------+
|  count|32500|        32530|       32548| 12168|
+-------+-----+-------------+------------+------+
only showing top 1 row

+-------+-----+-------------+------------+------+
|summary|  age|education-num|capital-gain|logfnl|
+-------+-----+-------------+------------+------+
|  coun

In [13]:
# Example 7 - use 2 methods for detection + saving model
# Using the config below, s value is considered as outlier if it is declared as outlier by at least 1 methodology

odf, odf_stats = outlier_detection(spark, df, list_of_cols= ['age','education-num','capital-loss','logfnl'], 
                                   detection_side="both", 
                                   detection_configs={
                                       "pctile_lower": 0.05, "pctile_upper": 0.95,
                                       "IQR_lower": 1.5, "IQR_upper": 1.5,
                                       "min_validation": 1}, 
                                   pre_existing_model=False, model_path=outputPath, print_impact=True)

                                                                                

+-------------+--------------+--------------+------------------------+
|attribute    |lower_outliers|upper_outliers|excluded_due_to_skewness|
+-------------+--------------+--------------+------------------------+
|education-num|1709          |988           |0                       |
|age          |1656          |1726          |0                       |
|logfnl       |677           |718           |0                       |
|capital-loss |0             |0             |1                       |
+-------------+--------------+--------------+------------------------+



In [15]:
# Example 8 - using pre-saved model
odf, odf_stats = outlier_detection(spark, df, list_of_cols= ['age','education-num','capital-loss','logfnl'], 
                                   detection_side="upper", pre_existing_model=True, 
                                   model_path=outputPath, print_impact=True)

+-------------+--------------+--------------+------------------------+
|attribute    |lower_outliers|upper_outliers|excluded_due_to_skewness|
+-------------+--------------+--------------+------------------------+
|education-num|0             |988           |0                       |
|age          |0             |1726          |0                       |
|logfnl       |0             |718           |0                       |
|capital-loss |0             |0             |1                       |
+-------------+--------------+--------------+------------------------+



In [18]:
# Example 9 - using pre-saved model and appending treated columns
# Only columns present in the saved model will be used

odf = outlier_detection(spark, idf = df, detection_side="lower", pre_existing_model=True, 
                        model_path=outputPath, output_mode="append")

print(odf.columns)
df.select(['education-num','capital-loss','logfnl']).describe().show()
odf.select(['education-num_outliered','capital-loss','logfnl_outliered']).describe().show()

['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income', 'education-num_outliered', 'age_outliered', 'logfnl_outliered']
+-------+------------------+-----------------+-------------------+
|summary|     education-num|     capital-loss|             logfnl|
+-------+------------------+-----------------+-------------------+
|  count|             32530|            32549|              12168|
|   mean|10.080971411005226| 87.3360164674798| 5.2054654851899365|
| stddev|2.5725103263986977|403.0310072565718|0.27424241727170395|
|    min|                 1|                0|        4.283617786|
|    max|                16|             4356|        6.088696941|
+-------+------------------+-----------------+-------------------+

+-------+-----------------------+-----------------+-------------------+
|summary|education-num_outliered

## IDness Detection
- API specification of function **IDness_detection** can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>
- Supports only categorical columns

In [37]:
from anovos.data_analyzer.quality_checker import IDness_detection

In [38]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = IDness_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,unique_values,IDness,flagged
0,ifa,32561,1.0,1
1,education-num,16,0.0005,0
2,workclass,11,0.0003,0
3,education,16,0.0005,0
4,race,9,0.0003,0
5,relationship,8,0.0002,0
6,capital-gain,119,0.0037,0
7,capital-loss,92,0.0028,0
8,income,2,0.0001,0
9,age,69,0.0021,0


In [39]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], treatment_threshold=0.75)
odf_stats.toPandas()

Unnamed: 0,attribute,unique_values,IDness,flagged
0,education-num,16,0.0005,0
1,workclass,11,0.0003,0
2,education,16,0.0005,0
3,race,9,0.0003,0
4,relationship,8,0.0002,0
5,capital-gain,119,0.0037,0
6,capital-loss,92,0.0028,0
7,income,2,0.0001,0
8,age,69,0.0021,0
9,hours-per-week,89,0.0027,0


In [40]:
# Example 3 - selected categorical columns
odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols= ['sex','race','workclass'])
odf_stats.toPandas()

Unnamed: 0,attribute,unique_values,IDness,flagged
0,workclass,11,0.0003,0
1,race,9,0.0003,0
2,sex,3,0.0001,0


In [41]:
# Example 4 - with treatment (column removal)
odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, print_impact=True)

+--------------+-------------+------+-------+
|     attribute|unique_values|IDness|treated|
+--------------+-------------+------+-------+
|           ifa|        32561|   1.0|      1|
| education-num|           16|5.0E-4|      0|
|     workclass|           11|3.0E-4|      0|
|     education|           16|5.0E-4|      0|
|          race|            9|3.0E-4|      0|
|  relationship|            8|2.0E-4|      0|
|  capital-gain|          119|0.0037|      0|
|  capital-loss|           92|0.0028|      0|
|        income|            2|1.0E-4|      0|
|           age|           69|0.0021|      0|
|hours-per-week|           89|0.0027|      0|
|        fnlwgt|        21640|0.6649|      0|
|native-country|           44|0.0014|      0|
|marital-status|            7|2.0E-4|      0|
|           sex|            3|1.0E-4|      0|
|    occupation|           15|5.0E-4|      0|
|         empty|            0|  null|      0|
+--------------+-------------+------+-------+

Removed Columns:  ['ifa']


In [42]:
# Example 5 - with treatment (column removal) + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_cardinality
from anovos.data_ingest.data_ingest import write_dataset
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, 
                                  stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"}, print_impact=True)

+--------------+-------------+------+-------+
|     attribute|unique_values|IDness|treated|
+--------------+-------------+------+-------+
|native-country|           44|0.0014|      0|
|hours-per-week|           89|0.0027|      0|
|marital-status|            7|2.0E-4|      0|
| education-num|           16|5.0E-4|      0|
|  relationship|            8|2.0E-4|      0|
|  capital-gain|          119|0.0037|      0|
|  capital-loss|           92|0.0028|      0|
|    occupation|           15|5.0E-4|      0|
|     education|           16|5.0E-4|      0|
|     workclass|           11|3.0E-4|      0|
|        fnlwgt|        21640|0.6649|      0|
|        income|            2|1.0E-4|      0|
|          race|            9|3.0E-4|      0|
|           sex|            3|1.0E-4|      0|
|           age|           69|0.0021|      0|
|           ifa|        32561|   1.0|      1|
|         empty|            0|  null|      0|
+--------------+-------------+------+-------+

Removed Columns:  ['ifa']


## Biasedness Detection
- API specification of function **biasedness_detection** can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>
- Supports only discrete columns (string + integer datatypes)

In [43]:
from anovos.data_analyzer.quality_checker import biasedness_detection

In [44]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = biasedness_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,mode,mode_rows,mode_pct,flagged
0,ifa,176a,1.0,0.0,0
1,education-num,9,10491.0,0.3225,0
2,workclass,Private,22685.0,0.6968,0
3,education,HS-grad,10490.0,0.3274,0
4,race,White,27791.0,0.8618,1
5,relationship,Husband,13185.0,0.405,0
6,capital-gain,0,29838.0,0.9167,1
7,capital-loss,0,31030.0,0.9533,1
8,income,<=50K,24720.0,0.7592,0
9,age,36,897.0,0.0276,0


In [45]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], treatment_threshold=0.75)
odf_stats.toPandas()

Unnamed: 0,attribute,mode,mode_rows,mode_pct,flagged
0,education-num,9,10491.0,0.3225,0
1,workclass,Private,22685.0,0.6968,0
2,education,HS-grad,10490.0,0.3274,0
3,race,White,27791.0,0.8618,1
4,relationship,Husband,13185.0,0.405,0
5,capital-gain,0,29838.0,0.9167,1
6,capital-loss,0,31030.0,0.9533,1
7,income,<=50K,24720.0,0.7592,1
8,age,36,897.0,0.0276,0
9,hours-per-week,40,15215.0,0.4688,0


In [46]:
# Example 3 - selected columns
odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','logfnl'])
odf_stats.toPandas()

Unnamed: 0,attribute,mode,mode_rows,mode_pct,flagged
0,workclass,Private,22685,0.6968,0
1,race,White,27791,0.8618,1
2,age,36,897,0.0276,0
3,sex,Male,21783,0.6691,0


In [47]:
# Example 4 - with treatment (column removal)
odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, print_impact=True)

+--------------+------------------+---------+--------+-------+
|     attribute|              mode|mode_rows|mode_pct|treated|
+--------------+------------------+---------+--------+-------+
|           ifa|              778a|        1|     0.0|      0|
| education-num|                 9|    10491|  0.3225|      0|
|     workclass|           Private|    22685|  0.6968|      0|
|     education|           HS-grad|    10490|  0.3274|      0|
|          race|             White|    27791|  0.8618|      1|
|  relationship|           Husband|    13185|   0.405|      0|
|  capital-gain|                 0|    29838|  0.9167|      1|
|  capital-loss|                 0|    31030|  0.9533|      1|
|        income|             <=50K|    24720|  0.7592|      1|
|           age|                36|      897|  0.0276|      0|
|hours-per-week|                40|    15215|  0.4688|      0|
|        fnlwgt|            164190|       13|  4.0E-4|      0|
|native-country|     United-States|    29166|  0.8957| 

In [48]:
# Example 5 - with treatment (column removal) + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, 
                                  stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)

                                                                                

+--------------+------------------+---------+--------+-------+
|     attribute|              mode|mode_rows|mode_pct|treated|
+--------------+------------------+---------+--------+-------+
|hours-per-week|                40|    15215|  0.4688|      0|
| education-num|                 9|    10491|  0.3225|      0|
|  capital-gain|                 0|    29838|  0.9167|      1|
|  capital-loss|                 0|    31030|  0.9533|      1|
|marital-status|Married-civ-spouse|    14957|  0.4654|      0|
|        fnlwgt|            164190|       13|  4.0E-4|      0|
|native-country|     United-States|    29166|  0.8957|      1|
|           age|                36|      897|  0.0276|      0|
|    occupation|    Prof-specialty|     4136|  0.1271|      0|
|  relationship|           Husband|    13185|   0.405|      0|
|     education|           HS-grad|    10490|  0.3274|      0|
|     workclass|           Private|    22685|  0.6968|      0|
|        income|             <=50K|    24720|  0.7592| 

## Invalid Entries Detection
- API specification of function **invalidEntries_detection** can be found <a href="https://docs.anovos.ai/api/data_analyzer/quality_checker.html">here</a>
- Supports only discrete columns (string + integer datatypes)

In [49]:
from anovos.data_analyzer.quality_checker import invalidEntries_detection

In [50]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = invalidEntries_detection(spark, df)
odf_stats.toPandas()

                                                                                

Unnamed: 0,attribute,invalid_entries,invalid_count,invalid_pct
0,hours-per-week,,0,0.0
1,race,*|?,22,0.0007
2,capital-loss,,0,0.0
3,workclass,?,1846,0.0567
4,empty,,0,0.0
5,education,?,33,0.001
6,education-num,,0,0.0
7,occupation,?,1861,0.0572
8,relationship,*|?,18,0.0006
9,marital-status,?,23,0.0007


In [51]:
# Example 2 - selected columns + auto detection (by default)
odf, odf_stats = invalidEntries_detection(spark, df, list_of_cols= ['age','sex','race','workclass','logfnl'])
odf_stats.toPandas()

                                                                                

Unnamed: 0,attribute,invalid_entries,invalid_count,invalid_pct
0,race,*|?,22,0.0007
1,workclass,?,1846,0.0567
2,sex,?,9,0.0003
3,age,,0,0.0
4,logfnl,,0,0.0


In [52]:
# Example 3 - manual detection: treat Self-emp-not-inc and Self-emp-inc as invalid entries
odf, odf_stats = invalidEntries_detection(spark, df, list_of_cols='workclass', detection_type="manual", 
                                          invalid_entries=["self-emp.*"], treatment_method='null_replacement')
odf_stats.toPandas()

Unnamed: 0,attribute,invalid_entries,invalid_count,invalid_pct
0,workclass,Self-emp-not-inc|Self-emp-inc,3656,0.1123


In [53]:
# Example 4 - manual and auto detection (both): treat only Self-emp-not-inc and Self-emp-inc as valid entries
odf, odf_stats = invalidEntries_detection(spark, df, list_of_cols='workclass', detection_type="both", 
                                          valid_entries=["self-emp.*"], treatment_method='null_replacement')
odf_stats.show(1, False)

+---------+--------------------------------------------------------------------------------------+-------------+-----------+
|attribute|invalid_entries                                                                       |invalid_count|invalid_pct|
+---------+--------------------------------------------------------------------------------------+-------------+-----------+
|workclass| State-gov|Local-gov|State-gov|Private|Without-pay|Federal-gov|Never-worked| Private|?|28902        |0.8876     |
+---------+--------------------------------------------------------------------------------------+-------------+-----------+



In [54]:
# Example 5 - with treatment (invalid entries replaced by null)
odf, odf_stats = invalidEntries_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','logfnl'], 
                                          treatment=True, print_impact=True)

df.select(['age','sex','race','workclass','logfnl']).describe().show()
odf.select(['age','sex','race','workclass','logfnl']).describe().show()

                                                                                

+---------+---------------+-------------+-----------+
|attribute|invalid_entries|invalid_count|invalid_pct|
+---------+---------------+-------------+-----------+
|     race|            *|?|           22|     7.0E-4|
|workclass|              ?|         1846|     0.0567|
|      sex|              ?|            9|     3.0E-4|
|      age|               |            0|        0.0|
|   logfnl|               |            0|        0.0|
+---------+---------------+-------------+-----------+

+-------+------------------+-----+-------+-----------+-------------------+
|summary|               age|  sex|   race|  workclass|             logfnl|
+-------+------------------+-----+-------+-----------+-------------------+
|  count|             32500|32557|  32247|      32558|              12168|
|   mean|38.506492307692305| null|   null|       null| 5.2054654851899365|
| stddev|13.508497735339255| null|   null|       null|0.27424241727170395|
|    min|                17|    ?|      *|    Private|        4

In [55]:
# Example 5 - with treatment (column removal) + append columns

odf, odf_stats = invalidEntries_detection(spark, idf = df, list_of_cols= ['sex','race','workclass'], 
                                          treatment=True, output_mode="append", print_impact=True)

df.select(['sex','race','workclass']).describe().show()
odf.select(['sex_invalid','race_invalid','workclass_invalid']).describe().show()

                                                                                

+---------+---------------+-------------+-----------+
|attribute|invalid_entries|invalid_count|invalid_pct|
+---------+---------------+-------------+-----------+
|     race|            *|?|           22|     7.0E-4|
|workclass|              ?|         1846|     0.0567|
|      sex|              ?|            9|     3.0E-4|
+---------+---------------+-------------+-----------+

+-------+-----+-------+-----------+
|summary|  sex|   race|  workclass|
+-------+-----+-------+-----------+
|  count|32557|  32247|      32558|
|   mean| null|   null|       null|
| stddev| null|   null|       null|
|    min|    ?|      *|    Private|
|    max| Male|Whitess|Without-pay|
+-------+-----+-------+-----------+

+-------+-----------+------------+-----------------+
|summary|sex_invalid|race_invalid|workclass_invalid|
+-------+-----------+------------+-----------------+
|  count|      32548|       32225|            30712|
|   mean|       null|        null|             null|
| stddev|       null|        nu