In [1]:
from eda_mds.describe_outliers import describe_outliers
import pandas as pd

In [2]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## describe_outliers


### Numerical Insights
We'll use `describe_outliers()` to first observe the distributions of each numeric columns in the titanic dataset. This can simply be done by passing in our dataframe, `df`, without any additional parameters.  

In [9]:
describe_outliers(df)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
dtype,int64,int64,float64,int64,int64,float64
Non-null count,891,891,714,891,891,891
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
standard deviation,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min value,0.0,1.0,0.42,0.0,0.0,0.0
25% percentile,0.0,2.0,20.125,0.0,0.0,7.9104
50% (median),0.0,3.0,28.0,0.0,0.0,14.4542
75% percentile,1.0,3.0,38.0,1.0,0.0,31.0
max value,1.0,3.0,80.0,8.0,6.0,512.3292
lower-tail outliers,0,0,0,0,0,0


The output resembles the result of `pandas.Dataframe.describe(df)`. It additionally includes counts of lower-tail and upper-tail outliers, along with data types for each column.

Let's focus on the `float64` data type columns for now. `age`, which contains a considerable number of null values, has no lower-tail outliers and 11 upper-tail outliers. The mean slightly surpasses the median and the standard deviation is relatively large. This suggests that the ages of the people on the Titanic were quite spread out, with a handful of people who were exceptionally older than the majority. We see a similar right-skewed trend in `fare`, but more exaggerated. `fare` has a considerable amount of upper-tail outliers, with a much larger mean than median. These distributions could be explored further, including possible correlations. 


#### Adjusting Outlier Detection

Adjusting the `threshold` argument allows for tuning the sensitivity of outlier detection. A higher value (above the default of 1.5) decreases sensitivity. In the example below, the upper-tail outliers for age reduce from 11 to 5 with an increased threshold.

In [7]:
describe_outliers(df, threshold=1.8)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
dtype,int64,int64,float64,int64,int64,float64
Non-null count,891,891,714,891,891,891
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
standard deviation,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min value,0.0,1.0,0.42,0.0,0.0,0.0
25% percentile,0.0,2.0,20.125,0.0,0.0,7.9104
50% (median),0.0,3.0,28.0,0.0,0.0,14.4542
75% percentile,1.0,3.0,38.0,1.0,0.0,31.0
max value,1.0,3.0,80.0,8.0,6.0,512.3292
lower-tail outliers,0,0,0,0,0,0


### Options for Categorical Columns

While these summary statistics are primarily important for numerical columns, the option to return non-numerical columns is possible through the use of the `numeric` argument. 

In [32]:
describe_outliers(df, threshold=1.8, numeric=False)

Unnamed: 0,adult_male,age,alive,alone,class,deck,embark_town,embarked,fare,parch,pclass,sex,sibsp,survived,who
dtype,bool,float64,object,bool,object,object,object,object,float64,int64,int64,object,int64,int64,object
Non-null count,891,714,891,891,891,203,889,889,891,891,891,891,891,891,891
mean,,29.699118,,,,,,,32.204208,0.381594,2.308642,,0.523008,0.383838,
standard deviation,,14.526497,,,,,,,49.693429,0.806057,0.836071,,1.102743,0.486592,
min value,,0.42,,,,,,,0.0,0.0,1.0,,0.0,0.0,
25% percentile,,20.125,,,,,,,7.9104,0.0,2.0,,0.0,0.0,
50% (median),,28.0,,,,,,,14.4542,0.0,3.0,,0.0,0.0,
75% percentile,,38.0,,,,,,,31.0,0.0,3.0,,1.0,1.0,
max value,,80.0,,,,,,,512.3292,6.0,3.0,,8.0,1.0,
lower-tail outliers,,0.0,,,,,,,0.0,0.0,0.0,,0.0,0.0,


This displays all columns in the dataset, sorted alphabetically by column name. Examining the dtypes of both numeric and categorical columns is essential to verify correct encoding in case modifications are necessary.

Regarding categorical columns, a couple of notable observations are: two columns are encoded as booleans, and the `deck` column predominantly consists of `NaN` values. Further exploration of categorical columns can be accomplished using the `cat_var_stats()` function.