<h1>Quick Start<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Install-ANOVOS" data-toc-modified-id="Install-ANOVOS-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Install ANOVOS</a></span></li><li><span><a href="#Feature-Engineering-using-ANOVOS" data-toc-modified-id="Feature-Engineering-using-ANOVOS-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Feature Engineering using ANOVOS</a></span><ul class="toc-item"><li><span><a href="#Set-Up-Spark-Session" data-toc-modified-id="Set-Up-Spark-Session-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Set Up Spark Session</a></span></li><li><span><a href="#Read-Input-Dataset" data-toc-modified-id="Read-Input-Dataset-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Read Input Dataset</a></span></li><li><span><a href="#Integration-with-other-datasets" data-toc-modified-id="Integration-with-other-datasets-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Integration with other datasets</a></span></li><li><span><a href="#Basic-ETL-Transformation" data-toc-modified-id="Basic-ETL-Transformation-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Basic ETL Transformation</a></span></li><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Descriptive Statistics</a></span></li><li><span><a href="#Data-Quality-Inspection-&amp;-Treatment" data-toc-modified-id="Data-Quality-Inspection-&amp;-Treatment-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Data Quality Inspection &amp; Treatment</a></span><ul class="toc-item"><li><span><a href="#Row-Quality-Checks" data-toc-modified-id="Row-Quality-Checks-2.6.1"><span class="toc-item-num">2.6.1&nbsp;&nbsp;</span>Row Quality Checks</a></span></li><li><span><a href="#Column-Quality-Checks" data-toc-modified-id="Column-Quality-Checks-2.6.2"><span class="toc-item-num">2.6.2&nbsp;&nbsp;</span>Column Quality Checks</a></span></li><li><span><a href="#Example:-Null-Detection-&amp;-Treatment-" data-toc-modified-id="Example:-Null-Detection-&amp;-Treatment--2.6.3"><span class="toc-item-num">2.6.3&nbsp;&nbsp;</span>Example: Null Detection &amp; Treatment <br></a></span></li><li><span><a href="#Example:-Outlier-Detection-&amp;-Treatment-" data-toc-modified-id="Example:-Outlier-Detection-&amp;-Treatment--2.6.4"><span class="toc-item-num">2.6.4&nbsp;&nbsp;</span>Example: Outlier Detection &amp; Treatment <br></a></span></li></ul></li><li><span><a href="#Attribute-Associations" data-toc-modified-id="Attribute-Associations-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Attribute Associations</a></span><ul class="toc-item"><li><span><a href="#Attribute-Attribute-Association" data-toc-modified-id="Attribute-Attribute-Association-2.7.1"><span class="toc-item-num">2.7.1&nbsp;&nbsp;</span>Attribute-Attribute Association</a></span></li><li><span><a href="#Attribute-Target-Association" data-toc-modified-id="Attribute-Target-Association-2.7.2"><span class="toc-item-num">2.7.2&nbsp;&nbsp;</span>Attribute-Target Association</a></span></li></ul></li></ul></li><li><span><a href="#Drift-&amp;-Stability-Analysis-using-ANOVOS" data-toc-modified-id="Drift-&amp;-Stability-Analysis-using-ANOVOS-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Drift &amp; Stability Analysis using ANOVOS</a></span><ul class="toc-item"><li><span><a href="#Drift-Statistics" data-toc-modified-id="Drift-Statistics-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Drift Statistics</a></span></li><li><span><a href="#Stability-Index" data-toc-modified-id="Stability-Index-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Stability Index</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Install ANOVOS
Install ANOVOS using pip:
```bash
pip3 install anovos
```

Note that examples in this tutorial will be running in local environment.

# Feature Engineering using ANOVOS
* ANOVOS is an open-source tool for feature engineering using PySpark.
* In this guide, we will use ANOVOS to perform feature engineering on the income dataset.
* Full list of supported functions and detailed examples for each module can be found in the following notebooks:
    * example_notebook_1_data_ingest.ipynb
    * example_notebook_2.1_stats_generator.ipynb
    * example_notebook_2.2_quality_checkers.ipynb
    * example_notebook_2.3_association_evaluator.ipynb
    * example_notebook_3_data_drift.ipynb
    * example_notebook_4_data_transformer.ipynb
    * example_notebook_5_data_report.ipynb

## Set Up Spark Session

In [1]:
from anovos.shared.spark import spark, sc, sqlContext
sc.getConf().get("spark.jars.packages")

'io.github.histogrammar:histogrammar_2.11:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.11:1.0.20,org.apache.spark:spark-avro_2.11:2.4.0'

## Read Input Dataset
* Function read_dataset (from anovos.data_ingest.data_ingest) can be used to read data. It supports csv, parquet and avro file types.

In [2]:
inputPath = "../data/income_dataset/csv"

In [3]:
from anovos.data_ingest.data_ingest import read_dataset
df = read_dataset(spark, file_path = inputPath, file_type = "csv", file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


## Integration with other datasets
* Supports concatenation and joining of the multiple datasets
* In below example, we will use concatenate our input dataset and another dataset of same schema
    * In case there is a mismatch in the column names, it is recommended to rename the columns before concatenation (as illustrated in Step 4 - ETL Transformation), unless the column positioning are exactly same. In this, concatenation can be done by method "index".
* Refer **example_notebook_1_data_ingest.ipynb** for the full list of supported functions and detailed examples

In [4]:
from anovos.data_ingest.data_ingest import concatenate_dataset

concatPath = "../data/income_dataset/parquet"
df_concat = read_dataset(spark, concatPath,"parquet")
df = concatenate_dataset(df, df_concat, method_type ="name")

## Basic ETL Transformation
* Supports selecting, deleting, renaming and recasting (i.e. editing datatype) columns
* In below example, we will use function rename_column (from anovos.data_ingest.data_ingest) to rename selected columns.
    * We will replace all hyphens in column names by underscores.
* Refer **example_notebook_1_data_ingest.ipynb** for the full list of supported functions and detailed examples

In [5]:
from anovos.data_ingest.data_ingest import rename_column
df = rename_column(idf=df, list_of_cols=['education-num','marital-status','capital-gain', 'capital-loss', 'hours-per-week', 'native-country'], 
                           list_of_newcols=['education_num','marital_status','capital_gain', 'capital_loss', 'hours_per_week', 'native_country'])
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


## Descriptive Statistics
* Descriptive statistics are categorized into different metric types - global_summary, measures_of_counts, measures_of_centralTendency, measures_of_cardinality, measures_of_dispersion, measures_of_percentiles, measures_of_shape.
* Refer **example_notebook_2.1_stats_generator.ipynb** for the full list of supported functions and detailed examples.

In [6]:
from anovos.data_analyzer.stats_generator import global_summary, measures_of_counts, measures_of_centralTendency, measures_of_cardinality, measures_of_dispersion, measures_of_percentiles, measures_of_shape

for func in [global_summary, measures_of_counts, measures_of_centralTendency, measures_of_cardinality, measures_of_dispersion, measures_of_percentiles, measures_of_shape]:
    print(func.__name__,":\n")
    stats = func(spark, df)
    stats.show()

global_summary :

No. of Rows: 65,122
No. of Columns: 18
Numerical Columns: 7
['capital_loss', 'capital_gain', 'fnlwgt', 'age', 'logfnl', 'hours_per_week', 'education_num']
Categorical Columns: 11
['relationship', 'income', 'native_country', 'sex', 'ifa', 'empty', 'education', 'marital_status', 'race', 'workclass', 'occupation']
+---------------+--------------------+
|         metric|               value|
+---------------+--------------------+
|     rows_count|               65122|
|  columns_count|                  18|
|  numcols_count|                   7|
|   numcols_name|capital_loss, cap...|
|  catcols_count|                  11|
|   catcols_name|relationship, inc...|
|othercols_count|                   0|
| othercols_name|                    |
+---------------+--------------------+

measures_of_counts :

+--------------+----------+--------+-------------+-----------+-------------+-----------+
|     attribute|fill_count|fill_pct|missing_count|missing_pct|nonzero_count|nonzero_pct|


## Data Quality Inspection & Treatment

* Data quality can be checked at both row level and column level. For each identified quality issue, treatment option(s) are available to fix them.
* These functions can be used in 3 ways:
    1. Only detection
    2. detection, followed by treatment
    3. Both detection & treatment together <br>
    Depending upon the requirement and the degree of familarity with the data, one of the above approaches can be taken. In examples 6A & 6B, we are taking the first approach i.e. only detecting the quality issues. Further in this section, we will demonstrate the second approach where we will analyze the statistics first (detection) and then will do treatment based on the statistics.
* Refer **example_notebook_2.2_quality_checkers.ipynb** for the full list of supported functions and detailed examples.

### Row Quality Checks
At row level, we do duplicate detection and null detection (i.e. %columns missing for a row) checks.

In [7]:
from anovos.data_analyzer.quality_checker import duplicate_detection, nullRows_detection

for func in [duplicate_detection, nullRows_detection]:
    print(func.__name__,":\n")
    df, stats = func(spark, df)
    stats.show()

duplicate_detection :

+-----------------+-----+
|           metric|value|
+-----------------+-----+
|       rows_count|65122|
|unique_rows_count|32561|
+-----------------+-----+

nullRows_detection :

+---------------+---------+-------+-------+
|null_cols_count|row_count|row_pct|flagged|
+---------------+---------+-------+-------+
|              1|    23282| 0.3575|      0|
|              2|    40006| 0.6143|      0|
|              3|     1758|  0.027|      0|
|              4|       38| 6.0E-4|      0|
|              5|       24| 4.0E-4|      0|
|              8|        8| 1.0E-4|      0|
|              9|        6| 1.0E-4|      0|
+---------------+---------+-------+-------+



### Column Quality Checks
At column level, we do the following checks:
* Null Detection (i.e. %rows missing for a column)
* Outlier Detection
* IDness Detection
* Biasedness Detection
* Invalid Entries Detection

In [8]:
from anovos.data_analyzer.quality_checker import nullColumns_detection, outlier_detection, IDness_detection, biasedness_detection, invalidEntries_detection

for func in [nullColumns_detection, outlier_detection, IDness_detection, biasedness_detection, invalidEntries_detection]:
    print(func.__name__,":\n")
    df, stats = func(spark, df)
    stats.show()

nullColumns_detection :

+--------------+-------------+-----------+
|     attribute|missing_count|missing_pct|
+--------------+-------------+-----------+
|     workclass|            6|     1.0E-4|
|     education|         1042|      0.016|
|          race|          628|     0.0096|
|  relationship|            8|     1.0E-4|
|           age|          122|     0.0019|
| education_num|           62|      0.001|
|        fnlwgt|           30|     5.0E-4|
|hours_per_week|          218|     0.0033|
|marital_status|          852|     0.0131|
|           sex|            8|     1.0E-4|
|    occupation|           24|     4.0E-4|
|  capital_loss|           24|     4.0E-4|
|  capital_gain|           26|     4.0E-4|
|        logfnl|        40786|     0.6263|
|         empty|        65122|        1.0|
+--------------+-------------+-----------+

outlier_detection :

+--------------+--------------+--------------+
|     attribute|lower_outliers|upper_outliers|
+--------------+--------------+-----------


<!-- * anovos.data_analyzer.stats_generator generates descriptive statistics related to the ingested data.
* Other available attribute-wise measures include: 
    * Measures of central tendency (mean, median, mode)
    * Measures of cardinality (count of unique values)
    * Measures of dispersion (stddev, variance, coefficient of variance, IQR, range)
    * Measures of shape (skewness, kurtosis)
* Detailed examples can be found in notebook: anovos_examples_stats_generator.ipynb -->

### Example: Null Detection & Treatment <br>
**1. Check the fill rate for each attribute.** <br>
* Using nullColumn_detection to detect missing values for each column
* According to the result shown below, missing_pct for columns logfnl and empty are greater 0.5 indicating that more than half of their values are null. With this acquired information, we decide to remove columns with high missing values.

In [9]:
from anovos.data_analyzer.quality_checker import nullColumns_detection
df, df_stats = nullColumns_detection(spark, df)
df_stats.toPandas()

Unnamed: 0,attribute,missing_count,missing_pct
0,workclass,6,0.0001
1,education,1042,0.016
2,race,628,0.0096
3,relationship,8,0.0001
4,age,122,0.0019
5,education_num,62,0.001
6,fnlwgt,30,0.0005
7,hours_per_week,218,0.0033
8,marital_status,852,0.0131
9,sex,8,0.0001


**2. Remove attributes with high missing values.** <br>
* We will use nullColumns_detection function to remove columns with more than 50% of null values.
* According to the result shown below, 2 columns: logfnl and empty are removed.

In [10]:
from anovos.data_analyzer.quality_checker import nullColumns_detection
df, df_stats = nullColumns_detection(spark, idf = df, treatment=True, treatment_method='column_removal', 
                                                 treatment_configs={'treatment_threshold':0.5}, print_impact=True)

Removed Columns:  ['logfnl', 'empty']


**3. Imputation for the rest of null values**
* imputation_MMM (from anovos.data_transformer.transformers) can be used for missing value imputation.
* By default, median is used for numerical attributes and mode is used for categorical attributes.
* Alternate approach is to remove rows with missing values using treatment_method = "row_removal"

In [11]:
df, df_stats = nullColumns_detection(spark, idf = df, treatment=True, treatment_method="MMM",print_impact=True)

+--------------+-------------------+------------------+
|     attribute|missingCount_before|missingCount_after|
+--------------+-------------------+------------------+
|     workclass|                  6|                 0|
|     education|               1042|                 0|
|          race|                628|                 0|
|  relationship|                  8|                 0|
|           age|                122|                 0|
| education_num|                 62|                 0|
|        fnlwgt|                 30|                 0|
|hours_per_week|                218|                 0|
|marital_status|                852|                 0|
|           sex|                  8|                 0|
|    occupation|                 24|                 0|
|  capital_loss|                 24|                 0|
|  capital_gain|                 26|                 0|
+--------------+-------------------+------------------+



### Example: Outlier Detection & Treatment <br>

* Function outlier_detection (from anovos.data_analyzer.quality_checker) can be used to detect and treat outliers for numerical columns.
* Outlier values can be replaced by null so that it can be imputed by a reliable imputation methodology (null_replacement). It can also be replaced by maximum or minimum permissible value for an attribute (value_replacement). Lastly, rows can be removed if it is identified with any outlier (row_removal). In this example, we will show outlier treatment by value_replacement.
* (Exception) Outlier treatment for highly skewed attributes may result in a column with all null values or single value. Such highly skewed columns are highlighted and dropped from the treatment, as was the case with capital_loss column in below example.

In [12]:
from anovos.shared.utils import attributeType_segregation
num_cols = attributeType_segregation(df)[0]
df.select(num_cols).describe().toPandas()

Unnamed: 0,summary,age,fnlwgt,capital_loss,hours_per_week,capital_gain,education_num
0,count,65122.0,65122.0,65122.0,65122.0,65122.0,65122.0
1,mean,38.50367003470409,189776.50327078407,87.303829734959,40.248886704953776,1077.2656859433064,10.080894321427474
2,stddev,13.495891564630757,105538.22737200318,402.9571247026128,11.894295761101466,7385.124786724627,2.571266876312627
3,min,17.0,12285.0,0.0,1.0,0.0,1.0
4,max,85.0,1484705.0,4356.0,94.0,99999.0,16.0


In [13]:
from anovos.data_analyzer.quality_checker import outlier_detection

df, df_stats = outlier_detection(spark, idf = df, treatment=True, treatment_method="value_replacement", print_impact=True)
df.select(num_cols).describe().toPandas()

  "Columns dropped from outlier treatment due to highly skewed distribution: " + (',').join(skewed_cols))


+--------------+--------------+--------------+
|     attribute|lower_outliers|upper_outliers|
+--------------+--------------+--------------+
|  capital_loss|             0|          3038|
|  capital_gain|             0|          3816|
|        fnlwgt|             0|          2160|
|           age|             0|           386|
|hours_per_week|             0|          2010|
| education_num|             0|             0|
+--------------+--------------+--------------+



Unnamed: 0,summary,age,fnlwgt,capital_loss,hours_per_week,capital_gain,education_num
0,count,65122.0,65122.0,65122.0,65122.0,65122.0,65122.0
1,mean,38.48381499339701,186638.81956942356,87.303829734959,39.8794877307208,292.754921531894,10.080894321427474
2,stddev,13.437477144264315,94667.73725774506,402.9571247026128,10.991152220014046,996.6889191376254,2.571266876312627
3,min,17.0,12285.0,0.0,1.0,0.0,1.0
4,max,75.5,409916.5,4356.0,60.0,3908.0,16.0


## Attribute Associations

* We can quantitatively measure the degree of relationship between different attributes (via correlation, variable clustering) and/or between an attribute & the binary target variable (via Information Gain, Information Value).
* Refer **example_notebook_2.3_association_evaluator.ipynb** for the full list of supported functions and detailed examples.

### Attribute-Attribute Association
It is recommended to remove ID column such as 'ifa' from the analysis.

In [14]:
from anovos.data_analyzer.association_evaluator import correlation_matrix

corr = correlation_matrix(spark, df, drop_cols = 'ifa')
corr.toPandas()

2021-11-05 09:23:17,220 INFO [histogram_filler_base]: Filling 120 specified histograms. auto-binning.
100%|█████████████████████████████████████████████████████████████| 120/120 [00:07<00:00, 17.08it/s]


Unnamed: 0,age,capital_gain,capital_loss,education,education_num,fnlwgt,hours_per_week,income,marital_status,native_country,occupation,race,relationship,sex,workclass,attribute
0,1.0,0.180816,0.132076,0.316594,0.317617,0.121418,0.461348,0.365258,0.559902,0.134247,0.308361,0.067485,0.513414,0.172005,0.289748,age
1,0.180816,1.0,0.022663,0.155519,0.158197,0.05818,0.135023,0.387702,0.174997,0.107153,0.136402,0.01741,0.179372,0.114395,0.120407,capital_gain
2,0.132076,0.022663,1.0,0.118881,0.119101,0.051659,0.082616,0.237558,0.14234,0.09776,0.09036,0.003155,0.140909,0.095004,0.068748,capital_loss
3,0.316594,0.155519,0.118881,1.0,0.999785,0.094612,0.22172,0.465532,0.613498,0.419918,0.490598,0.252035,0.456012,0.144319,0.228391,education
4,0.317617,0.158197,0.119101,0.999785,1.0,0.094117,0.222831,0.467338,0.191516,0.427727,0.492333,0.122495,0.283807,0.124135,0.230485,education_num
5,0.121418,0.05818,0.051659,0.094612,0.094117,1.0,0.070784,0.060557,0.075357,0.218384,0.103932,0.191767,0.070877,0.062808,0.0918,fnlwgt
6,0.461348,0.135023,0.082616,0.22172,0.222831,0.070784,1.0,0.310803,0.270612,0.127192,0.351546,0.115591,0.328746,0.333822,0.28735,hours_per_week
7,0.365258,0.387702,0.237558,0.465532,0.467338,0.060557,0.310803,1.0,0.408186,0.120112,0.385544,0.096856,0.599361,0.13073,0.187006,income
8,0.559902,0.174997,0.14234,0.613498,0.191516,0.075357,0.270612,0.408186,1.0,0.097251,0.282907,0.23634,0.707997,0.433531,0.170804,marital_status
9,0.134247,0.107153,0.09776,0.419918,0.427727,0.218384,0.127192,0.120112,0.097251,1.0,0.236265,0.640655,0.181705,0.103353,0.479582,native_country


In [15]:
from anovos.data_analyzer.association_evaluator import variable_clustering

vc = variable_clustering(spark, df, drop_cols = 'ifa')
vc.toPandas()

Unnamed: 0,Cluster,Attribute,RS_Ratio
0,0,relationship,0.3577
1,0,sex,0.3462
2,0,marital_status,0.5203
3,0,hours_per_week,0.8109
4,1,capital_loss,0.9171
5,1,occupation,0.4071
6,1,education_num,0.3979
7,1,fnlwgt,0.9845
8,2,native_country,0.3733
9,2,race,0.3776


### Attribute-Target Association
* Supports only binary target variable.
* In below example, we will calculation Information Value (IV) for each attribute against the target variable. 
    * IV measures how well an attribute is able to distinguish between a binary target variable i.e. label 0 from label 1, and hence helps in ranking attributes on the basis of their importance. 
    * Similar analysis can be done using Information Gain (IG).

In [16]:
from anovos.data_analyzer.association_evaluator import IV_calculation

IV = IV_calculation(spark, df, label_col="income", event_label=">50K")
IV.toPandas()

Unnamed: 0,attribute,iv
0,relationship,1.5348
1,marital_status,1.2958
2,age,1.0735
3,occupation,0.7772
4,education,0.7228
5,education_num,0.6984
6,hours_per_week,0.456
7,capital_gain,0.3138
8,sex,0.3036
9,workclass,0.162


# Drift & Stability Analysis using ANOVOS
Examines the dataset stability wrt the baseline dataset (via computing drift statistics) and/or wrt the historical datasets (via computing stability index).

Refer **example_notebook_3_data_drift.ipynb** for the full list of supported functions and detailed examples.

## Drift Statistics
* Drift analysis is primarily done to measure the change in distribution of the baseline dataset on which the model is trained (source distribution) and the input data (target distribution) on which prediction is to be made.
* Change in distributions is measured by drift statistics such as Population Stability Index (PSI), Jensen-Shannon Divergence (JS), Kolmogorov-Smirnov (KS), Hellinger Distance (HD). In default settings, Only PSI is calculated.
* List of columns for drift analysis must exists in the baseline dataset.

In [17]:
targetPath = "../data/income_dataset/csv"
sourcePath  = "../data/income_dataset/source"

In [18]:
from anovos.data_ingest.data_ingest import read_dataset

In [19]:
df_source = read_dataset(spark, file_path = sourcePath, file_type = "csv", file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df_target = read_dataset(spark, file_path = targetPath, file_type = "csv", file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})

In [20]:
from anovos.data_drift.drift_detector import drift_statistics
drift_statistics(spark, df_target, df_source, method_type="all").toPandas()

Unnamed: 0,attribute,PSI,JSD,HD,KS,flagged
0,ifa,1.6802,0.2001,0.4527,1.6295,1
1,marital-status,0.0002,0.0,0.0045,0.0024,0
2,capital-loss,0.0001,0.0,0.0029,0.0003,0
3,relationship,0.0002,0.0,0.0044,0.0042,0
4,capital-gain,0.0001,0.0,0.0038,0.0008,0
5,income,0.0,0.0,0.0021,0.0025,0
6,hours-per-week,0.0002,0.0,0.0052,0.0015,0
7,sex,0.0,0.0,0.0016,0.0019,0
8,native-country,0.0021,0.0003,0.0162,0.0023,0
9,fnlwgt,0.0001,0.0,0.0031,0.001,0


## Stability Index
* Summarise the stability of NUMERICAL attribute over multiple time periods by calculating its Stability Index. 
* Interpretation of Stability Index: <br>
    0-1: Very Unstable <br>
    1-2: Unstable <br>
    2-3: Marginally Stable <br>
    3-3.5: Stable <br>
    3.5-4: Very Stable
* List of columns for stability analysis must exists in the all datasets.

In [21]:
from anovos.data_ingest.data_ingest import read_dataset

df1 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/1", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df2 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/2", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df3 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/3", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df4 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/4", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df5 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/5", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df6 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/6", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df7 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/7", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df8 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/8", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})
df9 = read_dataset(spark, file_path = "../data/income_dataset/stability_index/9", file_type = "csv", file_configs = {"header": "True", "delimiter": "," ,"inferSchema": "True"})

In [22]:
from anovos.data_drift.drift_detector import stabilityIndex_computation
stabilityIndex_computation(spark, df1,df2,df3,df4,df5,df6,df7,df7,df9).toPandas()

Unnamed: 0,attribute,mean_cv,stddev_cv,kurtosis_cv,mean_si,stddev_si,kurtosis_si,stability_index,flagged
0,capital_loss,0.3217,0.1766,0.4372,1,2,1,1.3,-
1,capital_gain,0.4413,0.2598,0.5383,1,1,0,0.8,-
2,gender_label,0.1307,0.0297,0.1631,2,4,2,2.6,-
3,fnlwgt,0.0429,0.062,0.2494,3,3,1,2.6,-
4,age,0.2324,0.1943,0.0719,1,2,3,1.7,-
5,income_label,0.521,0.2871,1.6823,0,1,0,0.3,-
6,hours_per_week,0.0524,0.0462,0.1059,3,3,2,2.8,-
7,logfnl,0.0034,0.0319,0.0373,4,3,3,3.5,-
8,education_num,0.0224,0.0791,0.111,4,3,2,3.3,-


# Summary
Above data analysis and preprocessing steps are commonly seen in machine learning model training. ANOVOS attempts to simplify the steps required and save the time spent on data wrangling and feature engineering.