# Example usage

In this demonstration, we will show how to use `eda_mds` for conducting Exploratory Data Analysis (EDA). 

Imagine we are beginning a new data science project. 
As with any project, exploratory data analysis (EDA) is a crucial first step to understand the nature of the data you are working with. `eda_mds` helps with this by: 
- characterizing `null` values using `info_na`
- highlighting outliers with `describe_outliers`
- summarizing categorical variables with `cat_var_stats`
- calculating variable correlations with `cor_eda`

We will walk through each of these steps using the `titanic` dataset from [`seaborn-datasets`](https://github.com/mwaskom/seaborn-data), which is a messy dataset containing information about survivors from the [RMS Titanic](https://en.wikipedia.org/wiki/Titanic).


In [1]:
# import modules
import pandas as pd
import numpy as np

from eda_mds import info_na, describe_outliers, cat_var_stats, cor_eda

In [2]:
# import the titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')

## `info_na()` 

In this section, we will explore the functionality of `info_na()`, a function within `eda_mds` that expands the behaviour of `pd.DataFrame.info()`. 
We will do so by beginning the Exploratory Data Analysis process using both functions, and compare the output and necessary steps to acquire the same information, motivating its use.  

Missing datapoints can significantly affect model performance, largely causing them to break, and characterizing these values is essential to quantifying data quality. 
This will inform strategies to either remove, imput, or otherwise replace data with missing values. 
In some cases, specific rows or columns will be fragmented. 
Let's see how we can achieve this functionality using base `pandas`: 


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


`pandas.DataFrame.info` shows us how many values in a dataset are non-null by column, alongside the data types. 
Here, we can see that some columns, particularly `deck`, are missing significant amounts of data. 

While this may seem like enough information at first glance, there are more questions to ask: 
- What about rows of data? 
- How much data will be lost if we remove, say, all rows with null values? 
- Is missing data randomly dispersed or is it focused in some rows? 

Let's see if we can answer these questions: 

In [4]:
n_rows_any_null = df.isna().any(axis=1).sum()
n_rows = df.shape[0]
print(f"{n_rows_any_null} rows with any null value. ({n_rows_any_null / n_rows * 100:.2f}%)")

709 rows with any null value. (79.57%)


If we remove all rows with null values, we will lose 80% of our datset! 
Thankfully, we can see that this is mostly in the column `deck`. 

Are there any rows that have more than one null value, or all null values?

In [5]:
n_rows_all_null = df.isna().all(axis=1).sum()
mean_null_rows = df.isna().sum(axis=1).mean().round(2)
max_null_rows = df.isna().sum(axis=1).max()

print(f"{n_rows_all_null} rows have all-null values")
print(f"{mean_null_rows:0.2f}: average null values per row")
print(f"{max_null_rows}: max number of null values in a row")

0 rows have all-null values
0.98: average null values per row
2: max number of null values in a row


It appears that `deck` is the primary contributor for null values. 

In this case, we can see that the most amount of null values in any of the rows is two, and on average, we're missing one value in each row. 


This exercise shows the extra steps needed to more fully characterize a dataset. 
While this is only a few extra lines of code, it becomes tedious over time. 
`info_na` simplifies this process: 

In [6]:
info_na(df)


type: <class 'pandas.core.frame.DataFrame'>
shape: (891, 15)
memory usage: 398.1 KB
--------
columns:
 #      column  null count  null %   dtype
 0    survived           0    0.00   int64
 1      pclass           0    0.00   int64
 2         sex           0    0.00  object
 3         age         177   19.87 float64
 4       sibsp           0    0.00   int64
 5       parch           0    0.00   int64
 6        fare           0    0.00 float64
 7    embarked           2    0.22  object
 8       class           0    0.00  object
 9         who           0    0.00  object
10  adult_male           0    0.00    bool
11        deck         688   77.22  object
12 embark_town           2    0.22  object
13       alive           0    0.00  object
14       alone           0    0.00    bool
-----
rows:
total rows            891.00
any null count        709.00
any null %             79.57
all null count          0.00
all null %              0.00
mean null count         0.98
std.dev null count     

We can see that many of the values we computed before are provided, alongside the information given by `pandas.DataFrame.info`. 

This summarizes the primary use case of `info_na()`: characterizing missing values in a dataset in more detail - an essential task in most data science projects.

## `describe_outliers()`
### Numerical Insights
We'll use `describe_outliers()` to first observe the distributions of each numeric columns in the titanic dataset. This can simply be done by passing in our dataframe, `df`, without any additional parameters.

In [7]:
describe_outliers(df)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
dtype,int64,int64,float64,int64,int64,float64
Non-null count,891,891,714,891,891,891
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
standard deviation,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min value,0.0,1.0,0.42,0.0,0.0,0.0
25% percentile,0.0,2.0,20.125,0.0,0.0,7.9104
50% (median),0.0,3.0,28.0,0.0,0.0,14.4542
75% percentile,1.0,3.0,38.0,1.0,0.0,31.0
max value,1.0,3.0,80.0,8.0,6.0,512.3292
lower-tail outliers,0,0,0,0,0,0


The output resembles the result of `pandas.Dataframe.describe(df)`. It additionally includes counts of lower-tail and upper-tail outliers, along with data types for each column.

Looking at `float64` data columns, we can see that `age` has some null values and 11 upper-tail outliers. 
From this and the mean, median, and standard deviation, we have a better idea of the dataset shape: a right-skew. 
Similarly, `fare` was more heavily right-skewed with even more upper-tail outliers.
These distributions could be explored further, including possible correlations. 

#### Adjusting Outlier Detection

Adjusting the `threshold` argument allows for tuning the sensitivity of outlier detection. A higher value (above the default of 1.5) decreases sensitivity. In the example below, the upper-tail outliers for age reduce from 11 to 5 with an increased threshold.

*Note that outlier detection uses this standard formula: `Lower <= Q1 - threshold*IQR`, `Upper >= Q3 + threshold*IQR`*

In [8]:
describe_outliers(df, threshold=1.8)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
dtype,int64,int64,float64,int64,int64,float64
Non-null count,891,891,714,891,891,891
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
standard deviation,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min value,0.0,1.0,0.42,0.0,0.0,0.0
25% percentile,0.0,2.0,20.125,0.0,0.0,7.9104
50% (median),0.0,3.0,28.0,0.0,0.0,14.4542
75% percentile,1.0,3.0,38.0,1.0,0.0,31.0
max value,1.0,3.0,80.0,8.0,6.0,512.3292
lower-tail outliers,0,0,0,0,0,0


### Options for Categorical Columns

While these summary statistics are primarily important for numerical columns, the option to return non-numerical columns is possible through the use of the `numeric` argument. 

In [9]:
describe_outliers(df, threshold=1.8, numeric=False)

Unnamed: 0,adult_male,age,alive,alone,class,deck,embark_town,embarked,fare,parch,pclass,sex,sibsp,survived,who
dtype,bool,float64,object,bool,object,object,object,object,float64,int64,int64,object,int64,int64,object
Non-null count,891,714,891,891,891,203,889,889,891,891,891,891,891,891,891
mean,,29.699118,,,,,,,32.204208,0.381594,2.308642,,0.523008,0.383838,
standard deviation,,14.526497,,,,,,,49.693429,0.806057,0.836071,,1.102743,0.486592,
min value,,0.42,,,,,,,0.0,0.0,1.0,,0.0,0.0,
25% percentile,,20.125,,,,,,,7.9104,0.0,2.0,,0.0,0.0,
50% (median),,28.0,,,,,,,14.4542,0.0,3.0,,0.0,0.0,
75% percentile,,38.0,,,,,,,31.0,0.0,3.0,,1.0,1.0,
max value,,80.0,,,,,,,512.3292,6.0,3.0,,8.0,1.0,
lower-tail outliers,,0.0,,,,,,,0.0,0.0,0.0,,0.0,0.0,


This displays all columns in the dataset, sorted alphabetically by column name. Examining the dtypes of both numeric and categorical columns is essential to verify correct encoding in case modifications are necessary.

Regarding categorical columns, a couple of notable observations are: two columns are encoded as booleans, and the `deck` column predominantly consists of `NaN` values. Further exploration of categorical columns can be accomplished using the `cat_var_stats()` function.

## `cat_var_stats()`
This section will go through how to best use `cat_var_stats` function in `eda_mds` package. This function is designed to take `pandas.DataFrame` as argument.

After importing the dataset let's run our `cat_var_stats` function

In [10]:
cat_var_stats(df)

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------


Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------


Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------


Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------


Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------


Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------


Column: embark_town
Nu

`cat_var_stats` iterates over each categorical column and gives out certain information. An example output for column 'sex' can be seen below:
```console 
Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
```
It outputs the column name in question. The number of unique values and finally, the percentage of each unique value.

For columns that have values that are underrepresented it also gives binning suggestions according to a threshold. This suggestion can be seen for the 'deck' column for the titanic dataset.
```console 
Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
G, A, F values can be binned into "other" category as they are lower than binning threshold
```
This output was generated according to the default binning threshold of 2% but a user can define their own threshold with the `binning_threshold` argument. 

In [11]:
cat_var_stats(df, binning_threshold=4)  # Let's run the function again with a user defined threshold

Column: sex
Number of unique values: 2
Frequency of values:
male: 64.76%
female: 35.24%
------------------------------------


Column: embarked
Number of unique values: 3
Frequency of values:
S: 72.28%
C: 18.86%
Q: 8.64%
nan: 0.22%
------------------------------------


Column: class
Number of unique values: 3
Frequency of values:
Third: 55.11%
First: 24.24%
Second: 20.65%
------------------------------------


Column: who
Number of unique values: 3
Frequency of values:
man: 60.27%
woman: 30.42%
child: 9.32%
------------------------------------


Column: adult_male
Number of unique values: 2
Frequency of values:
True: 60.27%
False: 39.73%
------------------------------------


Column: deck
Number of unique values: 7
Frequency of values:
nan: 77.22%
C: 6.62%
E: 3.59%
G: 0.45%
D: 3.70%
A: 1.68%
B: 5.27%
F: 1.46%
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
------------------------------------


Column: embark_t

According to our newly defined threshold value the binning recommendation included 'E' and 'D' too.
```console 
Binning recommendations:
E, G, D, A, F values can be binned into "other" category as they are lower than binning threshold
```

## `cor_eda()`

Calling the correlation function (`cor_eda`) leads to the creation of a data frame structured as a correlation matrix. This matrix delineates the correlation coefficients at the intersections of its rows and columns, corresponding to the pairwise correlations among the data frame's numerical attributes. Essentially, it quantitatively expresses the strength and direction of relationships between the data's specific numerical features. 

In [12]:
cor_eda(df)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
survived,1.0,-0.359653,-0.077221,-0.017358,0.093317,0.268189
pclass,-0.359653,1.0,-0.369226,0.067247,0.025683,-0.554182
age,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
sibsp,-0.017358,0.067247,-0.308247,1.0,0.38382,0.138329
parch,0.093317,0.025683,-0.189119,0.38382,1.0,0.205119
fare,0.268189,-0.554182,0.096067,0.138329,0.205119,1.0


This function performs the same actions as the one above but changes the handling of NA defaults to replace NAs with the mean of the column, instead of merely dropping them.

In [13]:
cor_eda(df, na_handling="mean")

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
survived,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
pclass,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
age,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
sibsp,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
parch,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
fare,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


Notice that the values of the correlation function are slightly different when the NA handling method is changed. This indicates that our numerical data contained NA values, and the method we choose to handle them will affect the outcome of this function.

This function changes the handling of NA defaults to replace NAs with the median value of the column, instead of merely dropping them.

In [14]:
cor_eda(df, na_handling="median")

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
survived,1.0,-0.338481,-0.06491,-0.035322,0.081629,0.257307
pclass,-0.338481,1.0,-0.339898,0.083081,0.018443,-0.5495
age,-0.06491,-0.339898,1.0,-0.233296,-0.172482,0.096688
sibsp,-0.035322,0.083081,-0.233296,1.0,0.414838,0.159651
parch,0.081629,0.018443,-0.172482,0.414838,1.0,0.216225
fare,0.257307,-0.5495,0.096688,0.159651,0.216225,1.0


We can see that, compared to using the mean for NA handling, some values change slightly, while others remain the same. This is because, in some numerical columns, the mean and median are very similar.