<h1>ANOVOS - Quality Checker<span class="tocSkip"></span></h1>
<p> Following notebook shows the list of functions related to "quality checker" module provided under ANOVOS package and how it can be invoked accordingly</p>
<div class="toc"><ul class="toc-item"><li><span><a href="#Row-Level-Checks" data-toc-modified-id="Row-Level-Checks-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Row Level Checks</a></span><ul class="toc-item"><li><span><a href="#Duplicate-Detection" data-toc-modified-id="Duplicate-Detection-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Duplicate Detection</a></span></li><li><span><a href="#Null-Detection-(Row-wise)" data-toc-modified-id="Null-Detection-(Row-wise)-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Null Detection (Row-wise)</a></span></li></ul></li><li><span><a href="#Column-Level-Checks" data-toc-modified-id="Column-Level-Checks-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Column Level Checks</a></span><ul class="toc-item"><li><span><a href="#Null-Detection-(Column-wise)" data-toc-modified-id="Null-Detection-(Column-wise)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Null Detection (Column-wise)</a></span></li><li><span><a href="#Outlier-Detection" data-toc-modified-id="Outlier-Detection-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Outlier Detection</a></span></li><li><span><a href="#IDness-Detection" data-toc-modified-id="IDness-Detection-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>IDness Detection</a></span></li><li><span><a href="#Biasedness-Detection" data-toc-modified-id="Biasedness-Detection-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Biasedness Detection</a></span></li><li><span><a href="#Invalid-Entries-Detection" data-toc-modified-id="Invalid-Entries-Detection-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Invalid Entries Detection</a></span></li></ul></li></ul></div>

**Setting Spark Session**

In [1]:
from anovos.shared.spark import *

**Input/Output Path**

In [2]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_analyzer"

In [3]:
from anovos.data_ingest.data_ingest import read_dataset

In [4]:
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Row Level Checks

## Duplicate Detection
- API specification of function **duplicate_detection** can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.duplicate_detection">here</a>

In [5]:
from anovos.data_analyzer.quality_checker import duplicate_detection

In [6]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = duplicate_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,metric,value
0,rows_count,32561
1,unique_rows_count,32561


In [7]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = duplicate_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf_stats.toPandas()

Unnamed: 0,metric,value
0,rows_count,32561
1,unique_rows_count,32548


In [8]:
# Example 3 - selected columns
odf, odf_stats = duplicate_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf_stats.toPandas()

Unnamed: 0,metric,value
0,rows_count,32561
1,unique_rows_count,30601


In [9]:
# Example 4 - with treatment (Deduplication)
odf, odf_stats = duplicate_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'],treatment=True)
print(odf.count())
odf_stats.toPandas()

30601


Unnamed: 0,metric,value
0,rows_count,32561
1,unique_rows_count,30601


## Null Detection (Row-wise)
- API specification of function **nullRows_detection** can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.nullRows_detection">here</a>

In [10]:
from anovos.data_analyzer.quality_checker import nullRows_detection

In [11]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = nullRows_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,1,11641,0.3575,0
1,2,20003,0.6143,0
2,3,879,0.027,0
3,4,19,0.0006,0
4,5,12,0.0004,0
5,8,4,0.0001,0
6,9,3,0.0001,0


In [12]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = nullRows_detection(spark, idf = df, list_of_cols='all', drop_cols=['age'], treatment_threshold=0.4)
odf_stats.toPandas()

Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,1,11665,0.3583,0
1,2,20005,0.6144,0
2,3,855,0.0263,0
3,4,23,0.0007,0
4,5,6,0.0002,0
5,8,7,0.0002,1


In [13]:
# Example 3 - selected columns
odf, odf_stats = nullRows_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf_stats.toPandas()

Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,0,32181,0.9883,0
1,1,366,0.0112,0
2,2,11,0.0003,0
3,3,3,0.0001,0


In [14]:
# Example 4 - with treatment (row removal)
odf, odf_stats = nullRows_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75)
print(odf.count())
odf_stats.toPandas()

32561


Unnamed: 0,null_cols_count,row_count,row_pct,flagged
0,1,11641,0.3575,0
1,2,20003,0.6143,0
2,3,879,0.027,0
3,4,19,0.0006,0
4,5,12,0.0004,0
5,8,4,0.0001,0
6,9,3,0.0001,0


# Column Level Checks

## Null Detection (Column-wise)
- API specification of function nullColumns_detection can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.nullColumns_detection">here</a>

In [15]:
from anovos.data_analyzer.quality_checker import nullColumns_detection

In [16]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = nullColumns_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,education-num,31,0.001
1,workclass,3,0.0001
2,education,521,0.016
3,race,314,0.0096
4,relationship,4,0.0001
5,capital-gain,13,0.0004
6,capital-loss,12,0.0004
7,age,61,0.0019
8,hours-per-week,109,0.0033
9,fnlwgt,15,0.0005


In [17]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,education-num,31,0.001
1,workclass,3,0.0001
2,education,521,0.016
3,race,314,0.0096
4,relationship,4,0.0001
5,capital-gain,13,0.0004
6,capital-loss,12,0.0004
7,income,0,0.0
8,age,61,0.0019
9,hours-per-week,109,0.0033


In [18]:
# Example 3 - selected columns
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,workclass,3,0.0001
1,race,314,0.0096
2,age,61,0.0019
3,fnlwgt,15,0.0005
4,sex,4,0.0001


In [19]:
# Example 4 - with treatment (row removal)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                                       treatment_method="row_removal", print_impact=True)

+--------------+-------------+-----------+
|     attribute|missing_count|missing_pct|
+--------------+-------------+-----------+
|           ifa|            0|        0.0|
| education-num|           31|      0.001|
|     workclass|            3|     1.0E-4|
|     education|          521|      0.016|
|          race|          314|     0.0096|
|  relationship|            4|     1.0E-4|
|  capital-gain|           13|     4.0E-4|
|  capital-loss|           12|     4.0E-4|
|        income|            0|        0.0|
|           age|           61|     0.0019|
|hours-per-week|          109|     0.0033|
|        fnlwgt|           15|     5.0E-4|
|native-country|            0|        0.0|
|marital-status|          426|     0.0131|
|           sex|            4|     1.0E-4|
|    occupation|           12|     4.0E-4|
|        logfnl|        20393|     0.6263|
+--------------+-------------+-----------+
only showing top 17 rows

Before Count: 32561
After Count: 11641


In [20]:
# Example 5 - with treatment (row removal)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                        treatment_method="column_removal", treatment_configs={'treatment_threshold':0.5},print_impact=True)

Removed Columns:  ['logfnl', 'empty']


In [21]:
# Example 6 - with treatment (Median & Mode)
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"median", 
                                                    "pre_existing_model":False,"model_path":"NA",
                                                    "output_mode":"replace"},print_impact=True)

+--------------+-------------------+------------------+
|     attribute|missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|           ifa|                  0|                 0|
| education-num|                 31|                 0|
|     workclass|                  3|                 0|
|     education|                521|                 0|
|          race|                314|                 0|
|  relationship|                  4|                 0|
|  capital-gain|                 13|                 0|
|  capital-loss|                 12|                 0|
|        income|                  0|                 0|
|           age|                 61|                 0|
|hours-per-week|                109|                 0|
|        fnlwgt|                 15|                 0|
|native-country|                  0|                 0|
|marital-status|                426|                 0|
|           sex|                  4|            

In [22]:
# Example 7 - with treatment (Mean & Mode) and saving model
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":False,"model_path":outputPath,
                                                    "output_mode":"replace"},print_impact=True)

+--------------+-------------------+------------------+
|     attribute|missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|           ifa|                  0|                 0|
| education-num|                 31|                 0|
|     workclass|                  3|                 0|
|     education|                521|                 0|
|          race|                314|                 0|
|  relationship|                  4|                 0|
|  capital-gain|                 13|                 0|
|  capital-loss|                 12|                 0|
|        income|                  0|                 0|
|           age|                 61|                 0|
|hours-per-week|                109|                 0|
|        fnlwgt|                 15|                 0|
|native-country|                  0|                 0|
|marital-status|                426|                 0|
|           sex|                  4|            

In [23]:
# Example 8 - with treatment (Mean & Mode) and using pre-saved model
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":True,"model_path":outputPath,
                                                    "output_mode":"replace"},print_impact=True)

+--------------+-------------------+------------------+
|     attribute|missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|           ifa|                  0|                 0|
| education-num|                 31|                 0|
|     workclass|                  3|                 0|
|     education|                521|                 0|
|          race|                314|                 0|
|  relationship|                  4|                 0|
|  capital-gain|                 13|                 0|
|  capital-loss|                 12|                 0|
|        income|                  0|                 0|
|           age|                 61|                 0|
|hours-per-week|                109|                 0|
|        fnlwgt|                 15|                 0|
|native-country|                  0|                 0|
|marital-status|                426|                 0|
|           sex|                  4|            

In [24]:
# Example 9 - with treatment (Mean & Mode), using pre-saved model and appending imputed columns
odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":True,"model_path":outputPath,
                                                    "output_mode":"append"},print_impact=True)

+--------------+-------------------+--------------------+-------------+
|     attribute|missingCount_before|     attribute_after|missing_count|
+--------------+-------------------+--------------------+-------------+
|          race|                314|        race_imputed|            0|
|  capital-gain|                 13|capital-gain_imputed|            0|
|  capital-loss|                 12|capital-loss_imputed|            0|
|           sex|                  4|         sex_imputed|            0|
|        fnlwgt|                 15|      fnlwgt_imputed|            0|
|hours-per-week|                109|hours-per-week_im...|            0|
|    occupation|                 12|  occupation_imputed|            0|
|marital-status|                426|marital-status_im...|            0|
|           age|                 61|         age_imputed|            0|
| education-num|                 31|education-num_imp...|            0|
|     workclass|                  3|   workclass_imputed|       

In [25]:
# Example 10 - with treatment (Mean & Mode), using pre-saved model + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_counts, measures_of_centralTendency, measures_of_cardinality
from anovos.data_ingest.data_ingest import write_dataset
missing = write_dataset(measures_of_counts(spark, df),outputPath+"/missing","parquet", file_configs={"mode":"overwrite"})
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = nullColumns_detection(spark, idf = df, list_of_cols= 'all', treatment=True, 
                     treatment_method="MMM", treatment_configs={'method_type':"mean", 
                                                    "pre_existing_model":True,"model_path":outputPath,
                                                    "output_mode":"append"},
                    stats_missing={"file_path":outputPath+"/missing", "file_type": "parquet"}, 
                    stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"}, 
                    stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)

+--------------+-------------------+--------------------+-------------+
|     attribute|missingCount_before|     attribute_after|missing_count|
+--------------+-------------------+--------------------+-------------+
|hours-per-week|                109|hours-per-week_im...|            0|
| education-num|                 31|education-num_imp...|            0|
|  capital-gain|                 13|capital-gain_imputed|            0|
|  capital-loss|                 12|capital-loss_imputed|            0|
|        fnlwgt|                 15|      fnlwgt_imputed|            0|
|        logfnl|              20393|      logfnl_imputed|            0|
|           age|                 61|         age_imputed|            0|
|marital-status|                426|marital-status_im...|            0|
|  relationship|                  4|relationship_imputed|            0|
|    occupation|                 12|  occupation_imputed|            0|
|     workclass|                  3|   workclass_imputed|       

## Outlier Detection
- API specification of function **outlier_detection** can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.outlier_detection">here</a>
- Calculated only for numerical columns

In [26]:
from anovos.data_analyzer.quality_checker import outlier_detection

In [27]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = outlier_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,lower_outliers,upper_outliers
0,education-num,0,0
1,hours-per-week,0,1005
2,fnlwgt,0,1105
3,capital-gain,0,1908
4,capital-loss,0,1519
5,age,0,193
6,logfnl,0,27


In [28]:
# Example 2 - 'all' columns (excluding drop_cols) + change in detection configs
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols='all', drop_cols=['age'], detection_configs={'pctile_lower': 0.02, 'pctile_upper': 0.98,
                                                                                     'stdev_lower': 3.0, 'stdev_upper': 3.5,
                                                                                     'IQR_lower': 1.75, 'IQR_upper': 2.5,
                                                                                     'min_validation': 2})
odf_stats.toPandas()

Unnamed: 0,attribute,lower_outliers,upper_outliers
0,education-num,0,0
1,hours-per-week,0,717
2,fnlwgt,0,285
3,capital-gain,0,855
4,capital-loss,0,1383
5,logfnl,0,0


In [29]:
# Example 3 - selected numerical columns
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'])
odf_stats.toPandas()

Unnamed: 0,attribute,lower_outliers,upper_outliers
0,age,0,193
1,education-num,0,0
2,capital-gain,0,1908
3,logfnl,0,27


In [30]:
# Example 4 - with treatment (row removal)
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'], 
                                   treatment=True, treatment_method="row_removal", print_impact=True)
df.select(['age','education-num','capital-gain','logfnl']).describe().show()
odf.select(['age','education-num','capital-gain','logfnl']).describe().show()

  "Columns dropped from outlier treatment due to highly skewed distribution: " + (',').join(skewed_cols))


+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-gain|             0|          1908|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+------------------+-------------------+
|summary|               age|     education-num|      capital-gain|             logfnl|
+-------+------------------+------------------+------------------+-------------------+
|  count|             32500|             32530|             32548|              12168|
|   mean|38.506492307692305|10.080971411005226|1077.6959567408135| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977| 7386.624857802765|0.27424241727170395|
|    min|                17|                 1|                 0|        4.283617786|
|    max|             

Treating capital-loss in above example will result in a column with single value (i.e. skewness is further aggrevated)

In [31]:
# Example 5 - with treatment (null_replacement i.e. outliers are replaced by null)
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'], 
                                   treatment=True, treatment_method="null_replacement", print_impact=True)

df.select(['age','education-num','capital-gain','logfnl']).describe().show()
odf.select(['age','education-num','capital-gain','logfnl']).describe().show()

  "Columns dropped from outlier treatment due to highly skewed distribution: " + (',').join(skewed_cols))


+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-gain|             0|          1908|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+------------------+-------------------+
|summary|               age|     education-num|      capital-gain|             logfnl|
+-------+------------------+------------------+------------------+-------------------+
|  count|             32500|             32530|             32548|              12168|
|   mean|38.506492307692305|10.080971411005226|1077.6959567408135| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977| 7386.624857802765|0.27424241727170395|
|    min|                17|                 1|                 0|        4.283617786|
|    max|             

In [32]:
# Example 6 - with treatment (value_replacement i.e. outliers are replaced by maximum/minimum permissible value)
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'], 
                                   treatment=True, treatment_method="value_replacement", print_impact=True)

df.select(['age','education-num','capital-gain','logfnl']).describe().show()
odf.select(['age','education-num','capital-gain','logfnl']).describe().show()

+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-gain|             0|          1908|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+------------------+-------------------+
|summary|               age|     education-num|      capital-gain|             logfnl|
+-------+------------------+------------------+------------------+-------------------+
|  count|             32500|             32530|             32548|              12168|
|   mean|38.506492307692305|10.080971411005226|1077.6959567408135| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977| 7386.624857802765|0.27424241727170395|
|    min|                17|                 1|                 0|        4.283617786|
|    max|             

In [33]:
# Example 7 - with treatment (value_replacement) and saving model
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-loss','logfnl'], 
                                   treatment=True, treatment_method="value_replacement", 
                                   pre_existing_model=False,model_path=outputPath, drop_cols=['ifa'],print_impact=True)

df.select(['age','education-num','capital-loss','logfnl']).describe().show()
odf.select(['age','education-num','capital-loss','logfnl']).describe().show()

  "Columns dropped from outlier treatment due to highly skewed distribution: " + (',').join(skewed_cols))


+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-loss|             0|          1519|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+-----------------+-------------------+
|summary|               age|     education-num|     capital-loss|             logfnl|
+-------+------------------+------------------+-----------------+-------------------+
|  count|             32500|             32530|            32549|              12168|
|   mean|38.506492307692305|10.080971411005226| 87.3360164674798| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977|403.0310072565718|0.27424241727170395|
|    min|                17|                 1|                0|        4.283617786|
|    max|                85| 

In [34]:
# Example 8 - with treatment (value_replacement) and using pre-saved model
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-loss','logfnl'], 
                                   treatment=True, treatment_method="value_replacement", 
                                   pre_existing_model=True,model_path=outputPath,print_impact=True)

df.select(['age','education-num','capital-loss','logfnl']).describe().show()
odf.select(['age','education-num','capital-loss','logfnl']).describe().show()

+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-loss|             0|          1519|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+-----------------+-------------------+
|summary|               age|     education-num|     capital-loss|             logfnl|
+-------+------------------+------------------+-----------------+-------------------+
|  count|             32500|             32530|            32549|              12168|
|   mean|38.506492307692305|10.080971411005226| 87.3360164674798| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977|403.0310072565718|0.27424241727170395|
|    min|                17|                 1|                0|        4.283617786|
|    max|                85| 

In [35]:
# Example 9 - with treatment (Mean & Mode), using pre-saved model and appending imputed columns
odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-loss','logfnl'], 
                                   treatment=True, treatment_method="value_replacement", 
                                   pre_existing_model=True,model_path=outputPath,
                                   output_mode="append",print_impact=True)

df.select(['age','education-num','capital-loss','logfnl']).describe().show()
odf.select(['age_outliered','education-num_outliered','capital-loss','logfnl_outliered']).describe().show()

+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-loss|             0|          1519|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+-----------------+-------------------+
|summary|               age|     education-num|     capital-loss|             logfnl|
+-------+------------------+------------------+-----------------+-------------------+
|  count|             32500|             32530|            32549|              12168|
|   mean|38.506492307692305|10.080971411005226| 87.3360164674798| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977|403.0310072565718|0.27424241727170395|
|    min|                17|                 1|                0|        4.283617786|
|    max|                85| 

In [36]:
# Example 10 - with treatment (Mean & Mode), using pre-saved model + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_cardinality
from anovos.data_ingest.data_ingest import write_dataset
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = outlier_detection(spark, idf = df, list_of_cols= ['age','education-num','capital-loss','logfnl'], 
                                   treatment=True, treatment_method="value_replacement", 
                                   pre_existing_model=True,model_path=outputPath,
                                   stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"},print_impact=True)

df.select(['age','education-num','capital-loss','logfnl']).describe().show()
odf.select(['age','education-num','capital-loss','logfnl']).describe().show()

+-------------+--------------+--------------+
|    attribute|lower_outliers|upper_outliers|
+-------------+--------------+--------------+
|          age|             0|           193|
|education-num|             0|             0|
| capital-loss|             0|          1519|
|       logfnl|             0|            27|
+-------------+--------------+--------------+

+-------+------------------+------------------+-----------------+-------------------+
|summary|               age|     education-num|     capital-loss|             logfnl|
+-------+------------------+------------------+-----------------+-------------------+
|  count|             32500|             32530|            32549|              12168|
|   mean|38.506492307692305|10.080971411005226| 87.3360164674798| 5.2054654851899365|
| stddev|13.508497735339255|2.5725103263986977|403.0310072565718|0.27424241727170395|
|    min|                17|                 1|                0|        4.283617786|
|    max|                85| 

## IDness Detection
- API specification of function **IDness_detection** can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.IDness_detection">here</a>
- Supports only categorical columns

In [37]:
from anovos.data_analyzer.quality_checker import IDness_detection

In [38]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = IDness_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,unique_values,IDness,flagged
0,ifa,32561,1.0,1
1,workclass,11,0.0003,0
2,education,16,0.0005,0
3,race,9,0.0003,0
4,relationship,8,0.0002,0
5,income,2,0.0001,0
6,native-country,44,0.0014,0
7,marital-status,7,0.0002,0
8,sex,3,0.0001,0
9,occupation,15,0.0005,0


In [39]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], treatment_threshold=0.75)
odf_stats.toPandas()

Unnamed: 0,attribute,unique_values,IDness,flagged
0,workclass,11,0.0003,0
1,education,16,0.0005,0
2,race,9,0.0003,0
3,relationship,8,0.0002,0
4,income,2,0.0001,0
5,native-country,44,0.0014,0
6,marital-status,7,0.0002,0
7,sex,3,0.0001,0
8,occupation,15,0.0005,0
9,empty,0,,0


In [40]:
# Example 3 - selected categorical columns
odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols= ['sex','race','workclass'])
odf_stats.toPandas()

Unnamed: 0,attribute,unique_values,IDness,flagged
0,workclass,11,0.0003,0
1,race,9,0.0003,0
2,sex,3,0.0001,0


In [41]:
# Example 4 - with treatment (column removal)
odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, print_impact=True)

+--------------+-------------+------+-------+
|     attribute|unique_values|IDness|flagged|
+--------------+-------------+------+-------+
|           ifa|        32561|   1.0|      1|
|     workclass|           11|3.0E-4|      0|
|     education|           16|5.0E-4|      0|
|          race|            9|3.0E-4|      0|
|  relationship|            8|2.0E-4|      0|
|        income|            2|1.0E-4|      0|
|native-country|           44|0.0014|      0|
|marital-status|            7|2.0E-4|      0|
|           sex|            3|1.0E-4|      0|
|    occupation|           15|5.0E-4|      0|
|         empty|            0|  null|      0|
+--------------+-------------+------+-------+

Removed Columns:  ['ifa']


In [42]:
# Example 5 - with treatment (column removal) + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_cardinality
from anovos.data_ingest.data_ingest import write_dataset
unique = write_dataset(measures_of_cardinality(spark, df),outputPath+"/unique","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = IDness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, 
                                  stats_unique={"file_path":outputPath+"/unique", "file_type": "parquet"}, print_impact=True)

+--------------+-------------+------+-------+
|     attribute|unique_values|IDness|flagged|
+--------------+-------------+------+-------+
|native-country|           44|0.0014|      0|
|marital-status|            7|2.0E-4|      0|
|  relationship|            8|2.0E-4|      0|
|    occupation|           15|5.0E-4|      0|
|     education|           16|5.0E-4|      0|
|     workclass|           11|3.0E-4|      0|
|        income|            2|1.0E-4|      0|
|          race|            9|3.0E-4|      0|
|           ifa|        32561|   1.0|      1|
|           sex|            3|1.0E-4|      0|
|         empty|            0|  null|      0|
+--------------+-------------+------+-------+

Removed Columns:  ['ifa']


## Biasedness Detection
- API specification of function **biasedness_detection** can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.biasedness_detection">here</a>
- Supports only discrete columns (string + integer datatypes)

In [43]:
from anovos.data_analyzer.quality_checker import biasedness_detection

In [44]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = biasedness_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,mode,mode_pct,flagged
0,ifa,254a,0.0,0
1,education-num,9,0.3225,0
2,workclass,Private,0.6968,0
3,education,HS-grad,0.3274,0
4,race,White,0.8618,1
5,relationship,Husband,0.405,0
6,capital-gain,0,0.9167,1
7,capital-loss,0,0.9533,1
8,income,<=50K,0.7592,0
9,age,36,0.0276,0


In [45]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'], treatment_threshold=0.75)
odf_stats.toPandas()

Unnamed: 0,attribute,mode,mode_pct,flagged
0,education-num,9,0.3225,0
1,workclass,Private,0.6968,0
2,education,HS-grad,0.3274,0
3,race,White,0.8618,1
4,relationship,Husband,0.405,0
5,capital-gain,0,0.9167,1
6,capital-loss,0,0.9533,1
7,income,<=50K,0.7592,1
8,age,36,0.0276,0
9,hours-per-week,40,0.4688,0


In [46]:
# Example 3 - selected columns
odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','logfnl'])
odf_stats.toPandas()

Unnamed: 0,attribute,mode,mode_pct,flagged
0,workclass,Private,0.6968,0
1,race,White,0.8618,1
2,age,36,0.0276,0
3,sex,Male,0.6691,0


In [47]:
# Example 4 - with treatment (column removal)
odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, print_impact=True)

+--------------+------------------+--------+-------+
|     attribute|              mode|mode_pct|flagged|
+--------------+------------------+--------+-------+
|           ifa|              245a|     0.0|      0|
| education-num|                 9|  0.3225|      0|
|     workclass|           Private|  0.6968|      0|
|     education|           HS-grad|  0.3274|      0|
|          race|             White|  0.8618|      1|
|  relationship|           Husband|   0.405|      0|
|  capital-gain|                 0|  0.9167|      1|
|  capital-loss|                 0|  0.9533|      1|
|        income|             <=50K|  0.7592|      1|
|           age|                36|  0.0276|      0|
|hours-per-week|                40|  0.4688|      0|
|        fnlwgt|            164190|  4.0E-4|      0|
|native-country|     United-States|  0.8957|      1|
|marital-status|Married-civ-spouse|  0.4654|      0|
|           sex|              Male|  0.6691|      0|
|    occupation|    Prof-specialty|  0.1271|  

In [48]:
# Example 5 - with treatment (column removal) + presaved stats
from anovos.data_analyzer.stats_generator import measures_of_centralTendency
from anovos.data_ingest.data_ingest import write_dataset
mode = write_dataset(measures_of_centralTendency(spark, df),outputPath+"/mode","parquet", file_configs={"mode":"overwrite"})

odf, odf_stats = biasedness_detection(spark, idf = df, list_of_cols= 'all', treatment=True, treatment_threshold=0.75, 
                                  stats_mode={"file_path":outputPath+"/mode", "file_type": "parquet"}, print_impact=True)

+--------------+------------------+--------+-------+
|     attribute|              mode|mode_pct|flagged|
+--------------+------------------+--------+-------+
|hours-per-week|                40|  0.4688|      0|
| education-num|                 9|  0.3225|      0|
|  capital-gain|                 0|  0.9167|      1|
|  capital-loss|                 0|  0.9533|      1|
|marital-status|Married-civ-spouse|  0.4654|      0|
|        fnlwgt|            164190|  4.0E-4|      0|
|native-country|     United-States|  0.8957|      1|
|           age|                36|  0.0276|      0|
|    occupation|    Prof-specialty|  0.1271|      0|
|  relationship|           Husband|   0.405|      0|
|     workclass|           Private|  0.6968|      0|
|     education|           HS-grad|  0.3274|      0|
|        income|             <=50K|  0.7592|      1|
|          race|             White|  0.8618|      1|
|           ifa|              385a|     0.0|      0|
|           sex|              Male|  0.6691|  

## Invalid Entries Detection
- API specification of function **invalidEntries_detection** can be found <a href="../api_specification/anovos/data_analyzer/quality_checker.html#anovos.data_analyzer.quality_checker.invalidEntries_detection">here</a>
- Supports only discrete columns (string + integer datatypes)

In [49]:
from anovos.data_analyzer.quality_checker import invalidEntries_detection

In [50]:
# Example 1 - with mandatory arguments (rest arguments have default values)
odf, odf_stats = invalidEntries_detection(spark, df)
odf_stats.toPandas()

Unnamed: 0,attribute,invalid_entries,invalid_count,invalid_pct
0,occupation,?,1861,0.0572
1,income,,0,0.0
2,education-num,,0,0.0
3,race,*|?,22,0.0007
4,education,?,33,0.001
5,empty,,0,0.0
6,hours-per-week,,0,0.0
7,native-country,*|?,583,0.0179
8,workclass,?,1846,0.0567
9,relationship,*|?,18,0.0006


In [51]:
# Example 2 - 'all' columns (excluding drop_cols)
odf, odf_stats = invalidEntries_detection(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf_stats.toPandas()

Unnamed: 0,attribute,invalid_entries,invalid_count,invalid_pct
0,occupation,?,1861,0.0572
1,income,,0,0.0
2,education-num,,0,0.0
3,race,*|?,22,0.0007
4,education,?,33,0.001
5,empty,,0,0.0
6,hours-per-week,,0,0.0
7,native-country,*|?,583,0.0179
8,workclass,?,1846,0.0567
9,relationship,*|?,18,0.0006


In [52]:
# Example 3 - selected columns
odf, odf_stats = invalidEntries_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','logfnl'])
odf_stats.toPandas()

Unnamed: 0,attribute,invalid_entries,invalid_count,invalid_pct
0,race,*|?,22,0.0007
1,workclass,?,1846,0.0567
2,age,,0,0.0
3,sex,?,9,0.0003
4,logfnl,,0,0.0


In [53]:
# Example 4 - with treatment (invalid entries replaced by null)
odf, odf_stats = invalidEntries_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','logfnl'], 
                                          treatment=True, print_impact=True)

df.select(['age','sex','race','workclass','logfnl']).describe().show()
odf.select(['age','sex','race','workclass','logfnl']).describe().show()

+---------+---------------+-------------+-----------+
|attribute|invalid_entries|invalid_count|invalid_pct|
+---------+---------------+-------------+-----------+
|     race|            *|?|           22|     7.0E-4|
|workclass|              ?|         1846|     0.0567|
|      age|               |            0|        0.0|
|      sex|              ?|            9|     3.0E-4|
|   logfnl|               |            0|        0.0|
+---------+---------------+-------------+-----------+

+-------+------------------+-----+-------+-----------+-------------------+
|summary|               age|  sex|   race|  workclass|             logfnl|
+-------+------------------+-----+-------+-----------+-------------------+
|  count|             32500|32557|  32247|      32558|              12168|
|   mean|38.506492307692305| null|   null|       null| 5.2054654851899365|
| stddev|13.508497735339255| null|   null|       null|0.27424241727170395|
|    min|                17|    ?|      *|    Private|        4

In [54]:
# Example 5 - with treatment (column removal) + append columns

odf, odf_stats = invalidEntries_detection(spark, idf = df, list_of_cols= ['age','sex','race','workclass','logfnl'], 
                                          treatment=True, output_mode="append", print_impact=True)

df.select(['age','sex','race','workclass','logfnl']).describe().show()
odf.select(['age_invalid','sex_invalid','race_invalid','workclass_invalid','logfnl_invalid']).describe().show()

+---------+---------------+-------------+-----------+
|attribute|invalid_entries|invalid_count|invalid_pct|
+---------+---------------+-------------+-----------+
|     race|            *|?|           22|     7.0E-4|
|workclass|              ?|         1846|     0.0567|
|      age|               |            0|        0.0|
|      sex|              ?|            9|     3.0E-4|
|   logfnl|               |            0|        0.0|
+---------+---------------+-------------+-----------+

+-------+------------------+-----+-------+-----------+-------------------+
|summary|               age|  sex|   race|  workclass|             logfnl|
+-------+------------------+-----+-------+-----------+-------------------+
|  count|             32500|32557|  32247|      32558|              12168|
|   mean|38.506492307692305| null|   null|       null| 5.2054654851899365|
| stddev|13.508497735339255| null|   null|       null|0.27424241727170395|
|    min|                17|    ?|      *|    Private|        4