## cat_var_stats

This tutorial will go through how to best use cat_var_stats function in eda_mds package. For this tutorial let's use titanic dataset from seaborn-data.The dataset contains data on passengers in the Titanic and whether they survived the accident. cat_var_stats function is designed to take pandas dataframe as argument.

In [4]:
import pandas as pd
from src.eda_mds.cat_var_stats import cat_var_stats
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv') # Let's read the dataframe

After importing the dataset let's run our cat_var_stats function

In [5]:
cat_var_stats(df)

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------


Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------


Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------


Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------


Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------


Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------


Column: embark_town
Nu

cat_var_stats iterates over each categorical column and gives out certain information. An example output for column 'sex' can be seen below
```console 
Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
```
It outputs the column name in question. The number of unique values and finally, the percentage of each unique value.

For columns that has values that are underrepresented it also gives binning suggestions according to a threshold. This suggestion can be seen for the 'deck' column for the titanic dataset.
```console 
Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
```
This output was generated according to the default value of 2 percent but a user can define its own threshold with using binning_threshold argument in the function. 

In [7]:
cat_var_stats(df, binning_threshold=4)  # Let's run the function again with a user defined threshold

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------


Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------


Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------


Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------


Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------


Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------


Column: embark_t

According to our newly defined threshold value the binning recommendation included 'E' and 'D' too.
```console 
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
```
