<h1>ANOVOS - Statistic Generator<span class="tocSkip"></span></h1>
<p> Following notebook shows the list of functions related to "stats generator" module provided under ANOVOS package and how it can be invoked accordingly</p>
<div class="toc"><ul class="toc-item"><li><span><a href="#Global-Summary" data-toc-modified-id="Global-Summary-1">Global Summary</a></span></li><li><span><a href="#Measures-of-Counts" data-toc-modified-id="Measures-of-Counts-2">Measures of Counts</a></span></li><li><span><a href="#Measures-of-Central-Tendency" data-toc-modified-id="Measures-of-Central-Tendency-3">Measures of Central Tendency</a></span></li><li><span><a href="#Measures-of-Cardinality" data-toc-modified-id="Measures-of-Cardinality-4">Measures of Cardinality</a></span></li><li><span><a href="#Measures-of-Dispersion" data-toc-modified-id="Measures-of-Dispersion-5">Measures of Dispersion</a></span></li><li><span><a href="#Measures-of-Percentiles" data-toc-modified-id="Measures-of-Percentiles-6">Measures of Percentiles</a></span></li><li><span><a href="#Measures-of-Shape" data-toc-modified-id="Measures-of-Shape-7">Measures of Shape</a></span></li></ul></div>

**Setting Spark Session**

In [1]:
from anovos.shared.spark import *

**Input/Output Path**

In [2]:
inputPath = "../data/income_dataset/csv"
outputPath = "../output/income_dataset/data_analyzer"

In [3]:
from anovos.data_ingest.data_ingest import read_dataset

In [4]:
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Global Summary
- API specification of function **global_summary** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.global_summary">here</a>

In [5]:
from anovos.data_analyzer.stats_generator import global_summary

In [6]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = global_summary(spark, df)
odf.toPandas()

No. of Rows: 32,561
No. of Columns: 18
Numerical Columns: 7
['fnlwgt', 'hours-per-week', 'capital-loss', 'education-num', 'logfnl', 'capital-gain', 'age']
Categorical Columns: 11
['workclass', 'race', 'relationship', 'native-country', 'income', 'sex', 'occupation', 'education', 'ifa', 'marital-status', 'empty']


Unnamed: 0,metric,value
0,rows_count,32561
1,columns_count,18
2,numcols_count,7
3,numcols_name,"fnlwgt, hours-per-week, capital-loss, educatio..."
4,catcols_count,11
5,catcols_name,"workclass, race, relationship, native-country,..."
6,othercols_count,0
7,othercols_name,


In [7]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = global_summary(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf.toPandas()

No. of Rows: 32,561
No. of Columns: 17
Numerical Columns: 7
['fnlwgt', 'hours-per-week', 'capital-loss', 'education-num', 'logfnl', 'capital-gain', 'age']
Categorical Columns: 10
['workclass', 'race', 'relationship', 'native-country', 'income', 'sex', 'occupation', 'education', 'marital-status', 'empty']


Unnamed: 0,metric,value
0,rows_count,32561
1,columns_count,17
2,numcols_count,7
3,numcols_name,"fnlwgt, hours-per-week, capital-loss, educatio..."
4,catcols_count,10
5,catcols_name,"workclass, race, relationship, native-country,..."
6,othercols_count,0
7,othercols_name,


In [8]:
# Example 3 - selected columns
odf = global_summary(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf.toPandas()

No. of Rows: 32,561
No. of Columns: 5
Numerical Columns: 2
['fnlwgt', 'age']
Categorical Columns: 3
['workclass', 'race', 'sex']


Unnamed: 0,metric,value
0,rows_count,32561
1,columns_count,5
2,numcols_count,2
3,numcols_name,"fnlwgt, age"
4,catcols_count,3
5,catcols_name,"workclass, race, sex"
6,othercols_count,0
7,othercols_name,


# Measures of Counts

- API specification of function **measures_of_counts** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.measures_of_counts">here</a>
- Non zero count/% calculated only for numerical columns

In [9]:
from anovos.data_analyzer.stats_generator import measures_of_counts, nonzeroCount_computation

In [10]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = measures_of_counts(spark, df)
odf.toPandas()

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,ifa,32561,1.0,0,0.0,,
1,education-num,32530,0.999,31,0.001,32530.0,0.999
2,workclass,32558,0.9999,3,0.0001,,
3,education,32040,0.984,521,0.016,,
4,race,32247,0.9904,314,0.0096,,
5,relationship,32557,0.9999,4,0.0001,,
6,capital-gain,32548,0.9996,13,0.0004,2710.0,0.0832
7,capital-loss,32549,0.9996,12,0.0004,1519.0,0.0467
8,income,32561,1.0,0,0.0,,
9,age,32500,0.9981,61,0.0019,32500.0,0.9981


In [11]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = measures_of_counts(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf.toPandas()

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,education-num,32530,0.999,31,0.001,32530.0,0.999
1,workclass,32558,0.9999,3,0.0001,,
2,education,32040,0.984,521,0.016,,
3,race,32247,0.9904,314,0.0096,,
4,relationship,32557,0.9999,4,0.0001,,
5,capital-gain,32548,0.9996,13,0.0004,2710.0,0.0832
6,capital-loss,32549,0.9996,12,0.0004,1519.0,0.0467
7,income,32561,1.0,0,0.0,,
8,age,32500,0.9981,61,0.0019,32500.0,0.9981
9,hours-per-week,32452,0.9967,109,0.0033,32452.0,0.9967


In [12]:
# Example 3 - selected columns
odf = measures_of_counts(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf.toPandas()

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,workclass,32558,0.9999,3,0.0001,,
1,race,32247,0.9904,314,0.0096,,
2,age,32500,0.9981,61,0.0019,32500.0,0.9981
3,fnlwgt,32546,0.9995,15,0.0005,32546.0,0.9995
4,sex,32557,0.9999,4,0.0001,,


In [13]:
# Example 4 - only numerical columns
odf = measures_of_counts(spark, idf = df, list_of_cols= ['age','education-num','capital-gain'])
odf.toPandas()

Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,education-num,32530,0.999,31,0.001,32530,0.999
1,capital-gain,32548,0.9996,13,0.0004,2710,0.0832
2,age,32500,0.9981,61,0.0019,32500,0.9981


In [14]:
# Example 5 - only categorical columns (user warning is shown as nonon-zero computation didn't happen due to absence of any numerical column)
odf = measures_of_counts(spark, idf = df, list_of_cols= ['sex','race','workclass'])
odf.toPandas()



Unnamed: 0,attribute,fill_count,fill_pct,missing_count,missing_pct,nonzero_count,nonzero_pct
0,workclass,32558,0.9999,3,0.0001,,
1,race,32247,0.9904,314,0.0096,,
2,sex,32557,0.9999,4,0.0001,,


# Measures of Central Tendency

- API specification of function **measures_of_centralTendency** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.measures_of_centralTendency">here</a>
- Mode & Mode% calculated only for discrete columns (string + integer datatypes)

In [15]:
from anovos.data_analyzer.stats_generator import measures_of_centralTendency

In [16]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = measures_of_centralTendency(spark, df)
odf.toPandas()

Unnamed: 0,attribute,mean,median,mode,mode_pct
0,ifa,,,99a,0.0
1,education-num,10.081,10.0,9,0.3225
2,workclass,,,Private,0.6968
3,education,,,HS-grad,0.3274
4,race,,,White,0.8618
5,relationship,,,Husband,0.405
6,capital-gain,1077.696,0.0,0,0.9167
7,capital-loss,87.336,0.0,0,0.9533
8,income,,,<=50K,0.7592
9,age,38.5065,37.0,36,0.0276


In [17]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = measures_of_centralTendency(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf.toPandas()

Unnamed: 0,attribute,mean,median,mode,mode_pct
0,education-num,10.081,10.0,9,0.3225
1,workclass,,,Private,0.6968
2,education,,,HS-grad,0.3274
3,race,,,White,0.8618
4,relationship,,,Husband,0.405
5,capital-gain,1077.696,0.0,0,0.9167
6,capital-loss,87.336,0.0,0,0.9533
7,income,,,<=50K,0.7592
8,age,38.5065,37.0,36,0.0276
9,hours-per-week,40.2497,40.0,40,0.4688


In [18]:
# Example 3 - selected columns
odf = measures_of_centralTendency(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf.toPandas()

Unnamed: 0,attribute,mean,median,mode,mode_pct
0,workclass,,,Private,0.6968
1,race,,,White,0.8618
2,age,38.5065,37.0,36,0.0276
3,fnlwgt,189781.8318,178353.0,164190,0.0004
4,sex,,,Male,0.6691


In [19]:
# Example 4 - only numerical columns
odf = measures_of_centralTendency(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'])
odf.toPandas()

Unnamed: 0,attribute,mean,median,mode,mode_pct
0,education-num,10.081,10.0,9.0,0.3225
1,capital-gain,1077.696,0.0,0.0,0.9167
2,age,38.5065,37.0,36.0,0.0276
3,logfnl,5.2055,5.2524,,


In [20]:
# Example 5 - only categorical columns
odf = measures_of_centralTendency(spark, idf = df, list_of_cols= ['sex','race','workclass'])
odf.toPandas()

Unnamed: 0,attribute,mean,median,mode,mode_pct
0,workclass,,,Private,0.6968
1,race,,,White,0.8618
2,sex,,,Male,0.6691


# Measures of Cardinality

- API specification of function **measures_of_cardinality** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.measures_of_cardinality">here</a>
- Calculated only for discrete columns (string + integer datatypes)

In [21]:
from anovos.data_analyzer.stats_generator import measures_of_cardinality

In [22]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = measures_of_cardinality(spark, df)
odf.toPandas()

Unnamed: 0,attribute,unique_values,IDness
0,ifa,32561,1.0
1,education-num,16,0.0005
2,workclass,11,0.0003
3,education,16,0.0005
4,race,9,0.0003
5,relationship,8,0.0002
6,capital-gain,119,0.0037
7,capital-loss,92,0.0028
8,income,2,0.0001
9,age,69,0.0021


In [23]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = measures_of_cardinality(spark, idf = df, list_of_cols='all', drop_cols=['ifa'])
odf.toPandas()

Unnamed: 0,attribute,unique_values,IDness
0,education-num,16,0.0005
1,workclass,11,0.0003
2,education,16,0.0005
3,race,9,0.0003
4,relationship,8,0.0002
5,capital-gain,119,0.0037
6,capital-loss,92,0.0028
7,income,2,0.0001
8,age,69,0.0021
9,hours-per-week,89,0.0027


In [24]:
# Example 3 - selected columns
odf = measures_of_cardinality(spark, idf = df, list_of_cols= ['age','sex','race','workclass','fnlwgt'])
odf.toPandas()

Unnamed: 0,attribute,unique_values,IDness
0,workclass,11,0.0003
1,race,9,0.0003
2,age,69,0.0021
3,fnlwgt,21640,0.6649
4,sex,3,0.0001


In [25]:
# Example 4 - only numerical columns
odf = measures_of_cardinality(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'])
odf.toPandas()

Unnamed: 0,attribute,unique_values,IDness
0,education-num,16,0.0005
1,capital-gain,119,0.0037
2,age,69,0.0021
3,logfnl,10036,0.8248


In [26]:
# Example 5 - only categorical columns
odf = measures_of_cardinality(spark, idf = df, list_of_cols= ['sex','race','workclass'])
odf.toPandas()

Unnamed: 0,attribute,unique_values,IDness
0,workclass,11,0.0003
1,race,9,0.0003
2,sex,3,0.0001


# Measures of Dispersion

- API specification of function **measures_of_dispersion** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.measures_of_dispersion">here</a>
- Supports only numerical columns

In [27]:
from anovos.data_analyzer.stats_generator import measures_of_dispersion

In [28]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = measures_of_dispersion(spark, df)
odf.toPandas()

Unnamed: 0,attribute,stddev,variance,cov,IQR,range
0,education-num,2.5725,6.6178,0.2552,3.0,15.0
1,capital-gain,7386.6249,54562230.0,6.8541,0.0,99999.0
2,capital-loss,403.031,162434.0,4.6147,0.0,4356.0
3,age,13.5085,182.4796,0.3508,20.0,68.0
4,hours-per-week,11.9143,141.9505,0.296,5.0,93.0
5,fnlwgt,105563.0645,11143560000.0,0.5562,119159.0,1472420.0
6,logfnl,0.2742,0.0752,0.0527,0.3052,1.8051


In [29]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = measures_of_dispersion(spark, idf = df, list_of_cols='all', drop_cols=['capital-loss'])
odf.toPandas()

Unnamed: 0,attribute,stddev,variance,cov,IQR,range
0,education-num,2.5725,6.6178,0.2552,3.0,15.0
1,capital-gain,7386.6249,54562230.0,6.8541,0.0,99999.0
2,age,13.5085,182.4796,0.3508,20.0,68.0
3,hours-per-week,11.9143,141.9505,0.296,5.0,93.0
4,fnlwgt,105563.0645,11143560000.0,0.5562,119159.0,1472420.0
5,logfnl,0.2742,0.0752,0.0527,0.3052,1.8051


In [30]:
# Example 3 - selected numerical columns
odf = measures_of_dispersion(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'])
odf.toPandas()

Unnamed: 0,attribute,stddev,variance,cov,IQR,range
0,education-num,2.5725,6.6178,0.2552,3.0,15.0
1,capital-gain,7386.6249,54562230.0,6.8541,0.0,99999.0
2,age,13.5085,182.4796,0.3508,20.0,68.0
3,logfnl,0.2742,0.0752,0.0527,0.3052,1.8051


# Measures of Percentiles

- API specification of function **measures_of_percentiles** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.measures_of_percentiles">here</a>
- Supports only numerical columns

In [31]:
from anovos.data_analyzer.stats_generator import measures_of_percentiles

In [32]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = measures_of_percentiles(spark, df)
odf.toPandas()

Unnamed: 0,attribute,min,1%,5%,10%,25%,50%,75%,90%,95%,99%,max
0,education-num,1.0,3.0,5.0,7.0,9.0,10.0,12.0,13.0,14.0,16.0,16.0
1,capital-gain,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5013.0,15024.0,99999.0
2,capital-loss,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1980.0,4356.0
3,age,17.0,17.0,19.0,22.0,28.0,37.0,48.0,58.0,63.0,73.0,85.0
4,hours-per-week,1.0,8.0,18.0,24.0,40.0,40.0,45.0,55.0,60.0,72.0,94.0
5,fnlwgt,12285.0,27162.0,39388.0,65624.0,117833.0,178353.0,236992.0,329026.0,379522.0,509866.0,1484705.0
6,logfnl,4.2836,4.4322,4.5937,4.8203,5.0729,5.2524,5.3781,5.5178,5.5768,5.7073,6.0887


In [33]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = measures_of_percentiles(spark, idf = df, list_of_cols='all', drop_cols=['capital-gain'])
odf.toPandas()

Unnamed: 0,attribute,min,1%,5%,10%,25%,50%,75%,90%,95%,99%,max
0,education-num,1.0,3.0,5.0,7.0,9.0,10.0,12.0,13.0,14.0,16.0,16.0
1,capital-loss,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1980.0,4356.0
2,age,17.0,17.0,19.0,22.0,28.0,37.0,48.0,58.0,63.0,73.0,85.0
3,hours-per-week,1.0,8.0,18.0,24.0,40.0,40.0,45.0,55.0,60.0,72.0,94.0
4,fnlwgt,12285.0,27162.0,39388.0,65624.0,117833.0,178353.0,236992.0,329026.0,379522.0,509866.0,1484705.0
5,logfnl,4.2836,4.4322,4.5937,4.8203,5.0729,5.2524,5.3781,5.5178,5.5768,5.7073,6.0887


In [34]:
# Example 3 - selected numerical columns
odf = measures_of_percentiles(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'])
odf.toPandas()

Unnamed: 0,attribute,min,1%,5%,10%,25%,50%,75%,90%,95%,99%,max
0,education-num,1.0,3.0,5.0,7.0,9.0,10.0,12.0,13.0,14.0,16.0,16.0
1,capital-gain,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5013.0,15024.0,99999.0
2,age,17.0,17.0,19.0,22.0,28.0,37.0,48.0,58.0,63.0,73.0,85.0
3,logfnl,4.2836,4.4322,4.5937,4.8203,5.0729,5.2524,5.3781,5.5178,5.5768,5.7073,6.0887


# Measures of Shape

- API specification of function **measures_of_shape** can be found <a href="../api_specification/anovos/data_analyzer/stats_generator.html#anovos.data_analyzer.stats_generator.measures_of_shape">here</a>
- Supports only numerical columns

In [35]:
from anovos.data_analyzer.stats_generator import measures_of_shape

In [36]:
# Example 1 - with manadatory arguments (rest arguments have default values)
odf = measures_of_shape(spark, df)
odf.toPandas()

Unnamed: 0,attribute,skewness,kurtosis
0,fnlwgt,1.447,6.217
1,hours-per-week,-0.0756,1.9953
2,capital-loss,4.5935,20.3642
3,education-num,-0.3116,0.6236
4,logfnl,-0.854,0.8365
5,capital-gain,11.9516,154.7243
6,age,0.5128,-0.3418


In [37]:
# Example 2 - 'all' columns (excluding drop_cols)
odf = measures_of_shape(spark, idf = df, list_of_cols='all', drop_cols=['capital-gain'])
odf.toPandas()

Unnamed: 0,attribute,skewness,kurtosis
0,fnlwgt,1.447,6.217
1,hours-per-week,-0.0756,1.9953
2,capital-loss,4.5935,20.3642
3,education-num,-0.3116,0.6236
4,logfnl,-0.854,0.8365
5,age,0.5128,-0.3418


In [39]:
# Example 3 - selected numerical columns
odf = measures_of_shape(spark, idf = df, list_of_cols= ['age','education-num','capital-gain','logfnl'])
odf.toPandas()

Unnamed: 0,attribute,skewness,kurtosis
0,education-num,-0.3116,0.6236
1,logfnl,-0.854,0.8365
2,age,0.5128,-0.3418
3,capital-gain,11.9516,154.7243
