# Outliers detection

**What is an outlier**

An Outlier is a data item that deviates significantly from the rest of the (so-called normal) objects. Identifying outliers is important in statistics and data analysis because they can have a significant impact on the results of statistical analyses.

Outliers can modify the mean (average) and affect measures of central tendency, as well as influence the results of tests of statistical significance.

**How Ouliers are caused?**
- Measurement errors: Errors in data collection or measurement.
- Natural variability: Inherent variability in certain phenomena.
- Data entry errors: Human errors during data entry.
- Experimental errors: In experimental settings, anomalies may occur due to uncontrolled factors, equipment malfunctions, or unexpected events.
- ...

**How to identify an outlier**

There are many ways to identify outliers:
- Z-Score: also called 'standar scores'. Zscore is a measure that indicates how far is the data from the mean.
- **IQR** (Inter Quartile Range): the outlier base values are defined based on 1.5*IQR 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# We load a dataset with example data: length of rivers, in miles.
rios = pd.read_csv('../datasets/rios.csv',index_col=0)
rios

In [None]:
# We show the distribution of the data with a histogram.
sns.displot(data=rios,x='Millas')

In [None]:
# We show the data distribution with a boxplot.
# The "whiskers" indicate the percentages
# The points that are left out... are outliers?
sns.boxplot(data=rios,x='Millas')

In [None]:
# Show statistics
rios.describe()

In [None]:
Q1 = rios.Millas.quantile(0.25)
Q3 = rios.Millas.quantile(0.75)
IQR = Q3 - Q1
mediana = rios.Millas.median()
minimo = rios.Millas.min()
maximo = rios.Millas.max()

print(f'minimo: {minimo}')
print(f'Q1: {Q1}')
print(f'mediana: {mediana}')
print(f'Q3: {Q3}')
print(f'máximo: {maximo}')
print(f'Rango intercuartílico: {IQR}')

In [None]:
# Calculate the "whiskers"
BI = (Q1 - 1.5 * IQR)
BS = (Q3 + 1.5 * IQR)

print(f'Bigote Inferior: {BI}')
print(f'Bigote Superior: {BS}')

In [None]:
# The values of the whiskers are limited to the reach of the minimum and maximum values, which cannot be exceeded.

In [None]:
# The outliers will be the values that are outside the interval that define the whiskers.
outliers = rios[(rios.Millas < BI) | (rios.Millas > BS)].sort_values('Millas')
outliers

In [None]:
# We create a new Dataframe without outliers
rios_sen_outliers = rios[(rios.Millas >= BI) & (rios.Millas <= BS)].sort_values('Millas')
rios_sen_outliers

In [None]:
# We draw a new boxplot
sns.boxplot(data=rios_sen_outliers,x='Millas')

In [None]:
# We could repeat the process until the outliers disappear completely.