# Multivariate Analysis for Outlier Detection

In this notebook, we will explore methods to detect outliers in datasets using **multivariate analysis**. Multivariate outliers are data points that appear unusual **only when considering multiple variables together**, rather than individually.

We will use the **Iris dataset** as an example and explore:
- Boxplots for visual detection of outliers
- Scatterplot matrices for multivariate visualization
- Tukey's method for manual outlier detection

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sns

## Setting up plotting parameters

We'll define some global plotting settings for better visualization.
- `figure.figsize` sets the size of plots.
- `sns.set_style('whitegrid')` adds a grid background for clarity.

In [None]:
%matplotlib inline
rcParams['figure.figsize'] = 5, 4
sns.set_style('whitegrid')

## Loading the Iris dataset

We will load the Iris dataset, add column names, and inspect the first few rows.
- `Sepal Length`, `Sepal Width`, `Petal Length`, `Petal Width`, `Species`
- We also separate feature values (`x`) and target labels (`y`).

In [None]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/iris.data.csv'
df = pd.read_csv(filepath_or_buffer=address, header=None, sep=',')

df.columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']
x = df.iloc[:,0:4].values
y = df.iloc[:,4].values
df.head()

## Visual Inspection Using Boxplots

A **boxplot** allows us to see the distribution of data and identify potential outliers. Outliers appear as points beyond the whiskers of the box.

Here we plot **Sepal Length vs Species**, coloring by species.

In [None]:
sns.boxplot(x='Species', y='Sepal Length', data=df, hue='Species', palette='hls', legend=False)

### Observations from Boxplot

- We are plotting two variables in one plot: `Sepal Length` and `Species`.
- Points beyond the whiskers are potential outliers. For example, in the `Virginica` species, any point beyond the whiskers is suspicious and worth investigating.

## Scatterplot Matrix for Multivariate Outlier Detection

A **scatterplot matrix** allows us to visualize pairwise relationships between variables.
- Multivariate outliers may not be visible in a single variable but can appear as unusual points in scatterplots between two variables.
- Here, we use Seaborn's `pairplot` function to plot all features against each other and color by species.

In [None]:
sns.pairplot(df, hue='Species', palette='hls')

### Observations from Scatterplot Matrix

- Check each scatterplot for points that do not fit the general clusters.
- For instance, a red point in `Sepal Width` might appear outside the main cluster.
- This point corresponds to record **41**, which may be a multivariate outlier worth further investigation.

## Tukey Outlier Labeling (Manual Method)

Tukey's method identifies outliers based on the **interquartile range (IQR)**.
- Calculate Q1 (25th percentile) and Q3 (75th percentile).
- Compute `IQR = Q3 - Q1`.
- Outliers are values below `Q1 - 1.5*IQR` or above `Q3 + 1.5*IQR`.

This is useful when you want to detect outliers without relying solely on visualizations.

In [None]:
# Set pandas display options for better readability
pd.options.display.float_format = '{:.1f}'.format

# Create a DataFrame of features
X_df = pd.DataFrame(x, columns=['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])
print(X_df.describe())

### Example Calculation of Outlier Thresholds

- Suppose `Sepal Width` has Q1 = 2.8 and Q3 = 3.3
- IQR = 3.3 - 2.8 = 0.5
- Lower bound = Q1 - 1.5*IQR = 2.8 - 0.75 = 2.05
- Upper bound = Q3 + 1.5*IQR = 3.3 + 0.75 = 4.05

Any values outside this range are considered outliers. Observing `min` and `max` in the describe table can help us spot suspicious values.