In [24]:
from eda_mds.describe_outliers import describe_outliers
import pandas as pd

In [25]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")

In [28]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Describe Outliers

We'll use describe_outliers() to give us more insight into the data, including its spread and outliers. 

In [27]:
describe_outliers(df)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
dtype,int64,int64,float64,int64,int64,float64
Non-null count,891,891,714,891,891,891
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
standard deviation,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min value,0.0,1.0,0.42,0.0,0.0,0.0
25% percentile,0.0,2.0,20.125,0.0,0.0,7.9104
50% (median),0.0,3.0,28.0,0.0,0.0,14.4542
75% percentile,1.0,3.0,38.0,1.0,0.0,31.0
max value,1.0,3.0,80.0,8.0,6.0,512.3292
lower-tail outliers,0,0,0,0,0,0


As seen, this returns similar output as `pandas.Dataframe.describe()` for numeric columns, but additionally provides a count of lower-tail and upper-tail outliers, along number of non-null columns and the data types. 

Notably, `age` no lower-tail outliers but 11 upper-tail outliers. This may suggest that the majority of ages are concentrated around the higher values, with some expectionally high values compared to the majority. Additionally, `fare` appears to have a considerable amount of upper-tail outliers. This may indicate that `fare` is right-skewed, which could be explored further. 

It is possible to change the sensitivity of the outlier dectection using the `threshold` agrument. A higher value (above the default of 1.5) will reduce the sensitivity of the outlier detection. We can see this in the example below, with upper-tail outliers for `age` now being 5. 

In [31]:
describe_outliers(df, threshold=1.8)

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
dtype,int64,int64,float64,int64,int64,float64
Non-null count,891,891,714,891,891,891
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
standard deviation,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min value,0.0,1.0,0.42,0.0,0.0,0.0
25% percentile,0.0,2.0,20.125,0.0,0.0,7.9104
50% (median),0.0,3.0,28.0,0.0,0.0,14.4542
75% percentile,1.0,3.0,38.0,1.0,0.0,31.0
max value,1.0,3.0,80.0,8.0,6.0,512.3292
lower-tail outliers,0,0,0,0,0,0


While these summary statistics are primarily important for numerical columns, the option to return non-numerical columns is possible through the use of the `numeric` argument. 

In [32]:
describe_outliers(df, threshold=1.8, numeric=False)

Unnamed: 0,adult_male,age,alive,alone,class,deck,embark_town,embarked,fare,parch,pclass,sex,sibsp,survived,who
dtype,bool,float64,object,bool,object,object,object,object,float64,int64,int64,object,int64,int64,object
Non-null count,891,714,891,891,891,203,889,889,891,891,891,891,891,891,891
mean,,29.699118,,,,,,,32.204208,0.381594,2.308642,,0.523008,0.383838,
standard deviation,,14.526497,,,,,,,49.693429,0.806057,0.836071,,1.102743,0.486592,
min value,,0.42,,,,,,,0.0,0.0,1.0,,0.0,0.0,
25% percentile,,20.125,,,,,,,7.9104,0.0,2.0,,0.0,0.0,
50% (median),,28.0,,,,,,,14.4542,0.0,3.0,,0.0,0.0,
75% percentile,,38.0,,,,,,,31.0,0.0,3.0,,1.0,1.0,
max value,,80.0,,,,,,,512.3292,6.0,3.0,,8.0,1.0,
lower-tail outliers,,0.0,,,,,,,0.0,0.0,0.0,,0.0,0.0,


This allows us to see all columns, alphabetically by column name. Observing the dtypes of both numeric and categorical columns can help ensure they are encoded correctly, incase any changes need to be made. 

In [None]:
H