# Extreme Value Analysis for Outliers

In this notebook, we will explore **univariate methods** for detecting outliers in data. Outlier detection is an important data preprocessing step in machine learning and can also serve as an analytical method for identifying anomalies in datasets. Common use cases include:
- Detecting fraud
- Identifying equipment failure
- Monitoring cybersecurity events

We will use the **Tukey method** for detecting outliers, which can be visualized through boxplots or calculated mathematically.

## Step 1: Import necessary libraries

We'll use the standard Python libraries for data manipulation and visualization: `numpy`, `pandas`, `matplotlib`. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams

%matplotlib inline
rcParams['figure.figsize'] = 5,4

## Step 2: Load the dataset

We will use the famous **Iris dataset** for demonstration. This dataset contains measurements of sepal length, sepal width, petal length, petal width, and the species of iris.

<details>
<summary>Hint</summary>
Use `pd.read_csv()` to load the dataset. Assign column names for easier access.
</details>

In [None]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/iris.data.csv'
df = pd.read_csv(filepath_or_buffer=address, header=None, sep=',')

df.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']

# Split features and target
x = df.iloc[:,0:4].values
y = df.iloc[:,4].values

df.head()

## Step 3: Identify outliers using Tukey Boxplot

A **Tukey boxplot** uses the interquartile range (IQR) to identify potential outliers. Points beyond 1.5 times the IQR from the quartiles are considered outliers.

### Visualization:
- The **box** represents the interquartile range (25th to 75th percentile)
- The **whiskers** extend to 1.5×IQR
- Points beyond whiskers are potential outliers

<details>
<summary>Hint</summary>
Use `df.boxplot(return_type='dict')` to create the boxplot and `plt.plot()` to display it.
</details>

In [None]:
df.boxplot(return_type='dict')
plt.plot()

## Step 4: Filter outliers from a specific column

We can use comparison operators to isolate outliers for further inspection. For example, let's check **Sepal Width** for extreme values.

In [None]:
# Outliers greater than 4
Sepal_width = x[:,1]
iris_outliers = (Sepal_width > 4)
df[iris_outliers]

In [None]:
# Outliers less than 2.05
iris_outliers = (Sepal_width < 2.05)
df[iris_outliers]

## Step 5: Apply Tukey outlier labeling method

Instead of visually inspecting a boxplot, we can calculate the Tukey outlier thresholds mathematically:
- Compute Q1 (25th percentile) and Q3 (75th percentile)
- Compute IQR = Q3 - Q1
- Outliers are values outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR]

In [None]:
pd.options.display.float_format = '{:.1f}'.format
X_df = pd.DataFrame(x, columns=['Sepal Length','Sepal Width','Petal Length','Petal Width'])
X_df.describe()

### Step 6: Self-checking

Compare the boxplot visualization and the mathematical calculation of outliers. Ensure all extreme points beyond whiskers correspond to values outside the calculated Tukey thresholds.

<details>
<summary>Solution hint</summary>
Use the describe() output to find Q1 and Q3, then compute the IQR. Values below Q1-1.5*IQR or above Q3+1.5*IQR are outliers.
</details>