# Example usage

Here we will demonstrate how to use `outliers` to deal with the outliers in a dataset and plot the distribution of the dataset:

## Imports

In [1]:
from outliers.outliers import outlier_identifier
from outliers.trim_outliers import trim_outliers
from outliers.visualize_outliers import visualize_outliers

In [4]:
import pandas as pd
import altair as alt

## Create a dataframe
We need to create a dataframe to work with. 

In [5]:
df = pd.DataFrame({ 'SepalLengthCm' : [5.1, 4.9, 4.7, 5.5, 5.1, 50, 54, 5.0, 5.2, 5.3, 5.1],
                        'SepalWidthCm' :  [1.4, 1.4, 20, 2.0, 0.7, 1.6, 1.2, 1.4, 1.8, 1.5, 2.1],
                        'PetalWidthCm' :  [0.2, 0.2, 0.2, 0.3, 0.4, 0.5, 0.5, 0.6, 0.4, 0.2, 5],
                        'class': ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica', 'Iris Setosa', 'Iris Versicolour', 'Iris Virginica', 'Iris Virginica', 
                                'Iris Setosa', 'Iris Versicolour', 'Iris Setosa', 'Iris Versicolour']
})
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalWidthCm,class
0,5.1,1.4,0.2,Iris Setosa
1,4.9,1.4,0.2,Iris Versicolour
2,4.7,20.0,0.2,Iris Virginica
3,5.5,2.0,0.3,Iris Setosa
4,5.1,0.7,0.4,Iris Versicolour
5,50.0,1.6,0.5,Iris Virginica
6,54.0,1.2,0.5,Iris Virginica
7,5.0,1.4,0.6,Iris Setosa
8,5.2,1.8,0.4,Iris Versicolour
9,5.3,1.5,0.2,Iris Setosa


## Identify outliers
We can identify outliers using `outlier_identifier`. Note that this function will return a dataframe with the summary of the outlier identified by the method, with an additional column having if row has outlier or not if return_df = True.

In [7]:
outlier_identifier(df, columns=['SepalLengthCm', 'SepalWidthCm'], identifier = 'IQR', return_df=True)

Unnamed: 0,SepalLengthCm,SepalWidthCm,outlier
0,5.1,1.4,False
1,4.9,1.4,False
2,4.7,20.0,True
3,5.5,2.0,False
4,5.1,0.7,False
5,50.0,1.6,True
6,54.0,1.2,True
7,5.0,1.4,False
8,5.2,1.8,False
9,5.3,1.5,False


## Trim outliers
We can trim outliers using `trim_outliers`. This function will return a dataframe which the outlier has already process by the chosen method.

In [8]:
trim_outliers(df, columns=['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm'],identifier='Z_score', method='trim')

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalWidthCm,class
0,5.1,1.4,0.2,Iris Setosa
1,4.9,1.4,0.2,Iris Versicolour
3,5.5,2.0,0.3,Iris Setosa
4,5.1,0.7,0.4,Iris Versicolour
5,50.0,1.6,0.5,Iris Virginica
6,54.0,1.2,0.5,Iris Virginica
7,5.0,1.4,0.6,Iris Setosa
8,5.2,1.8,0.4,Iris Versicolour
9,5.3,1.5,0.2,Iris Setosa
10,5.1,2.1,5.0,Iris Versicolour


## Visualize outliers
We can trim outliers using `visualize_outliers`. This function will return an altair plot of data distribution with given method.

In [10]:
visualize_outliers(df, columns=None, type='violin')