# **Lab experience #09 (STUDENTS): Anomaly Detection using proximity-based approaches vs PCA**

This nineth lab session aims **to compare anomaly detection based on proximity-based approaches vs reconstruction-based methods (i.e., PCA)**. This lab session refers to all Prof. Stella's lectures on "Introduction to anomaly detection", "Nearest neighbor based anomaly detection", "Clustering Based, Statistical Approaches and Reconstruction Based".

In this lab session, you will **re-use code already developed in the previous labs** and better explore [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

The main task is to idenfity outliers in _a completely new dataset_ using kNN, LOF, and PCA. Verify agreement or the degree of mismatch among the three different methods. Then, remove the outliers (choosing the best performing method, at your choice), and then discuss on the quality of the remaining dataset.

In [None]:
# Useful packages that you might want to use
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import pdist as pdist
from scipy.spatial.distance import squareform as sf
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, Normalizer

# **Step 1**: Data loading, visual inspection, and pre-processing

In [None]:
# Load the dataset
X = np.load('Dataset_lab09.npy')
[N,M] = np.shape(X)

Here, you are free to explore and pre-process your dataset, based on your experience and the guess you might make on the specific dataset. Hint: visualize the dataset in 2D, compute the proximity matrix, ...

In [None]:
# Visualization of data and proximity matrix pre-/post-scaling
#

# **Step 2**: Investigation on outliers - Proximity-based approaches


Useful references:
- [NN](https://scikit-learn.org/stable/modules/neighbors.html) see section 1.6.1.1. "Finding the Nearest Neighbors" and also check previous lab solutions (e.g., Lab07, Lab08).

- [LOF](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor) [and also slide no.8 and [WIki](https://en.wikipedia.org/wiki/Local_outlier_factor)]. Here, you don't have to manually implement the algorithm, but only to correctly use the sklearn package to obtain LOF.

**2.1 Nearest neighbor (NN)**
- Choose the value of the design parameters.
- Apply the algorithm.
- Discover, count, and label the outliers.
- Remove the outliers from the dataset, obtaining a _clean_ dataset (X_clean_NN).

In [None]:
# Apply the algorithm
from sklearn.neighbors import NearestNeighbors as knn
neighborhood_order = ..

# Find neighborhood
neighborhood_set   = knn( n_neighbors=neighborhood_order, algorithm='ball_tree').fit(YOURDATA)
distances, indices = neighborhood_set.kneighbors(YOURDATA)

# compute distances from kth nearest neighbors and sort them
dk_sorted     = np.sort(distances[:,-1])
dk_sorted_ind = np.argsort(distances[:,-1])


# Identify the outliers as those points with too high distance from their own kth nearest neighbor. Hint: choose one of the possible solutions of Lab09:
# 1: use the knee point                                 --> knee_x, knee_y
# 2: decide a percentage of outliers a-priori (n%)      --> n [%]
# 3: use a threshold from the above plot, left panel    --> dk_th = k
#
#
#
#

In [None]:
# Verify the outlier detection, count and label the outliers
figKNN = plt.figure('kNN distance values', figsize=(8,5))
# your plot
# add axes labels, title, legend (if needed), grid
plt.show()

# Label=-1 to outliers
KNN_labels = np.ones(N)
KNN_labels[OUTLIERS_INDECES] = -1

# Count the no. of outliers
countKNN = ...
print(countKNN)

In [None]:
# Visualize the solution in a 2D scatterplot or PCA/tSNE plot (use BLACK for OUTLIERS, yellow for inliers)
#

**2.2 Local Outlier Factor (LOF)**

- Choose the value of the design parameters.
- Apply the algorithm.
- Discover, count, and label the outliers.
- Remove the outliers from the dataset, obtaining a _clean_ dataset (X_clean_LOF).

In [None]:
# Apply the algorithm
from sklearn.neighbors import LocalOutlierFactor

lof_model  = LocalOutlierFactor(n_neighbors  = ...,
                                algorithm='ball_tree',
                                metric=...,
                                p=...,
                                metric_params = None,
                                contamination = ...)

LOF_labels = lof_model.fit_predict(YOURDATA)     # predict the labels (1 inlier, -1 outlier) of X according to LOF
LOF_values = lof_model.negative_outlier_factor_

In [None]:
# Verify the outlier detection, count and label the outliers
figLOF = plt.figure('LOF values', figsize=(8,5))
# your plot
# add axes labels, title, legend (if needed), grid
plt.show()

# Count the no. of outliers
countLOF = ...
print(countLOF)

In [None]:
# Visualize the solution in a 2D scatterplot or PCA/tSNE plot (use BLACK for OUTLIERS, yellow for inliers)
#
#
#

# **Step 3**: Investigation on outliers - Principal Component Analysis (PCA)

**Steps:**

- Decide on the number of principal components (q)
- Run PCA. You can use [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- Reconstruct the dataset with (only) q components
- Compute the reconstruction error (RE) for every data points (from the q-dimensional space to the N-dimensional space)
- Identify, count, and label the outliers.
- Remove the outliers from the dataset, obtaining a _clean_ dataset (X_clean_PCA).

_Note: sklearn.decomposition.PCA implements linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD._

```q ``` is the same as ```NCOMP```.

You can identify the outliers as those points too deviating from the main PCs.

You learnt **two possible methods** to implement this idea (see Prof. Stella's slides nn.13-15):

1.   use a **threshold on the reconstruction errors** --> "Objects with large reconstruction errors are anomalies"
2.   use the **Chi-square distribution**  --> "An observation is anomalous if, for a given significance level alpha, the sum of the squared values of its projection on the first q PCs is larger than X^2_q(alpha).


Here, you can try either one of the two.

In [None]:
from sklearn.decomposition import PCA

# Design parameters
NCOMP = ...    # number of components

# Apply the algo
pca = PCA(n_components=NCOMP)
pca_result = pca.fit_transform(YOURDATA)
print('PCA: explained variation per principal component: {}'.format(pca.explained_variance_ratio_.round(2)))

# Compute the reconstruction error for every data point
#
#

In [None]:
# Identify the outliers as those points too deviating from the main PCs. Hint: see Prof. Stella's slides nn.13-15
# 1: use a threshold                  --> "Objects with large reconstruction errors are anomalies"
# 2: use the Chi-square distribution  --> "An observation is anomalous if, for a given significance level alpha, the sum of the squared values of its projection on the first q PCs
#                                         (normalized over the corresponding eigenvalues) is greater that the value of the Chi-square distribution in alpha."
#
#
#
#

In [None]:
# Verify the outlier detection, count and label the outliers
figPCA = plt.figure('PCA decomposition', figsize=(8,5))
# your plot
# add axes labels, title, legend (if needed), grid
plt.show()

In [None]:
# Labelling
PCA_labels =

In [None]:
# Count the no. of outliers
countPCA = ...
print(countPCA)

In [None]:
# Visualize the solution in a 2D scatterplot or PCA/tSNE plot (use BLACK for OUTLIERS, yellow for inliers)
#
#
#

# **Step 4**: Compare the above solutions
Hint: verify agreement or mismatches among the different algorithms (e.g., visual inspection, Rand index on labels, ...).


# **Step 5**: Validate your solutions using the TRUE labels
Hint: use visual inspection, Rand index on labels, confusion matrix.

In [None]:
# Load TRUE labels
#

In [None]:
# Supervised validation
#

# _This it the end of Lab session #09_ ✅


## Utility functions

Retrieve utility functions from previous labs, if needed.