# Exercise: Extreme Value Analysis for Outliers

In this exercise, you will perform **univariate outlier detection** using the **Tukey method**.

You will:
1. Load the Iris dataset.
2. Visualize numeric variables using boxplots.
3. Identify potential outliers in Sepal Width.
4. Apply Tukey outlier labeling method.

Use the hints if you get stuck, and check your work using the solution section at the end.

## Step 1: Import libraries
<details>
<summary>Hint</summary>
Use `import numpy as np`, `import pandas as pd`, and `import matplotlib.pyplot as plt`. Also import `rcParams` from pylab and set `%matplotlib inline`.
</details>

In [None]:
# YOUR CODE HERE

## Step 2: Load the Iris dataset
<details>
<summary>Hint</summary>
Use `pd.read_csv()` to load the dataset from the file path. Assign the columns `['Sepal Length','Sepal Width','Petal Length','Petal Width','Species']`. Split the first 4 columns as `X` and the last column as `y`.
</details>

In [None]:
# YOUR CODE HERE

## Step 3: Create a Tukey boxplot for numeric variables
<details>
<summary>Hint</summary>
Use `df.boxplot(return_type='dict')` and `plt.plot()` to display the boxplot. Observe points beyond whiskers as potential outliers.
</details>

In [None]:
# YOUR CODE HERE

## Step 4: Identify outliers in Sepal Width using comparison operators
<details>
<summary>Hint</summary>
Select `Sepal Width` from X, then create boolean masks for values greater than Q3+1.5*IQR and less than Q1-1.5*IQR. Filter the dataframe using these masks.
</details>

In [None]:
# YOUR CODE HERE

## Step 5: Apply Tukey outlier labeling method
<details>
<summary>Hint</summary>
Use `X_df.describe()` to calculate Q1, Q3, and IQR. Then, define outliers as values outside `[Q1 - 1.5*IQR, Q3 + 1.5*IQR]`.
</details>

In [None]:
# YOUR CODE HERE

## Step 6: Self-Check (Collapsed Solution)
<details>
<summary>Click to expand solution</summary>

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 5,4

address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/iris.data.csv'
df = pd.read_csv(address, header=None, sep=',')
df.columns = ['Sepal Length','Sepal Width','Petal Length','Petal Width','Species']
x = df.iloc[:,0:4].values
y = df.iloc[:,4].values

# Boxplot
df.boxplot(return_type='dict')
plt.plot()

# Outliers in Sepal Width
Sepal_width = x[:,1]
outliers_high = Sepal_width > 4
outliers_low = Sepal_width < 2.05
df[outliers_high | outliers_low]

# Tukey method
X_df = pd.DataFrame(x, columns=['Sepal Length','Sepal Width','Petal Length','Petal Width'])
desc = X_df.describe()
Q1 = desc.loc['25%']
Q3 = desc.loc['75%']
IQR = Q3 - Q1
tukey_outliers = (X_df < (Q1 - 1.5*IQR)) | (X_df > (Q3 + 1.5*IQR))
tukey_outliers
```
</details>