# Anomaly Detection Notebook: LOF Orignal Response
### Date Started:   12 June 2024
### Latest Update: 17 July 2024
### Status: Rebuilt on top of the SVM Notebook to accumulate results 
#### Note: Added plot of metrics. Plot of Conf Matrix only produces for the first model, and quadrants out of standard order

### Notebook Developed Cumulatively to Accumulate Results as Further Models are Added

#### A note on evaluation metrics
##### Data Scientists working on skewed data problems need to keep in mind a) Defining & capturing the right metrics and b) Interpreting metrics

Defining & capturing the right metrics: In the custom classification metrics function, we have intentionally captured the macro averaged metrics for the Precision, Recall, F1, Average Precision scores. Specifying the macro version ensures that sklearn corrects for the skewed data and weights both classes equally.

Interpreting metrics: The correct interpretation relies on the use case and there is no one size fits all. In the case of outlier detection like Thyroid detection, we would typically want as much of the positive instances (in this case the outlier class) to be captured at the expense of having more False Positives This is because even if there are False Positives, a detailed screening can clear them out, but if a patient's case is not detected at all (i.e. False Negatives), then they are sent home and lost to follow-up evaluation.

### Chapter 2: One-Class SVM
#### Introduction to semi-supervised approaches
The main problem solved by OneClassSVM is for novelty detection using an semi-supervised approach

As Leo Tolstoy said ‘All happy families are alike; each unhappy family is unhappy in its own way.’ The implication for our use case is that OneClassSVM excels in situations where the inliers are all alike but the outliers are all special snowflakes. Datasets that have such as issue prevents the use of traditional 2-class binary classification problems and instead we use a 'unary' classification problem. Only the statistics of normal operation are known from the inliers

We will be using OneClassSVM in a similar manner as unsupervised clustering. The approach processes the data as a static distribution, pinpoints the most remote points, and flags them as potential outliers.

The main assumption on the data is that the outliers are actually separated from the inliers and hence the OCSVM algorithm can indeed separate it out.

The main thing to note is that deleting the minority class from the input train data improves the performance when the model later sees the outliers mixed in with normal inliers. This would be a case of semi-supervised because we are using the labels but not necessarily to fit the model but to remove the minority class samples to improve the performance
#### Important hyperparameters: 
kernel type - e.g. linear, poly, rbf, sigmoid - see chapter solution
gamma - kernel coefficient for rbf, poly and sigmoid
degree - used only by polynomial kernel functions

### Chapter 3: Robust Covariance for Outlier Detection
EllipticEnvelope assumes the data is Gaussian and learns an ellipse to fit the inliers. The ellipse refers to how a 2D Gaussian distribution would look like when viewed from the top. Its called 'Robust' because learning the ellipse is not influenced by the outliers present in the data. The distance of an observation to the mode of the distribution obtained from this estimate is used to derive a measure of 'outlyingness.' 

Caveats: This method will not give good results when the number of samples does not exceed the squared of the features. In our case, we have 6 features and have 1000s of samples so we won't run into this limitation. Also, if your data is very non-Gaussian, then the method will not clearly work well.
#### Important hyperparameter: 
contamination (float) - the proportion of outliers in the dataset

### Chapter 4: Using Isolation Forest for Novelty Detection
Isolation Forest is a decision-tree based unsupervised learning algorithm. Each tree selects a subset of the features from the input data and the tree is built. The key idea is that when a randomly selected subset of data is sampled during tree-building phase, the tree will find it much easier to distinguish outliers and hence outliers will end up on shallow branches. One the other hand, the samples that are in much 'deeper' branches are from inliers. Averaging out the path lenghths across the entire forest gives confidence that the shorter paths represent the outliers and the longer paths represent the inliers. 

There are several advantages in using the Isolation Forest including,
1) linear time complexity with a low constant and a low memory requirement
2) ability to handle high dimensional problems which have a large number of irrelevant attributes, and 
3) in situations where training set does not contain any anomalies.

#### What hyperparameters can be used to train the forest?
n_estimators and the max_samples are typically used. In this milestone we are going to focus on n_estimators

#### Chapter 5. Local Outlier Factor (LOF) for Novelty Detection
LOF compares the local density of the sample with densities of its neighbors (as defined by the n_neighbors param when instantiating the LOF). Outliers are those that have 'substantially' lower local density compared to the neighbors. The score gives an indication of how 'isolated' the given sample is when compared to the neighborhood.

By default LOF uses the Minkowski distance metric to compute how far the outliers are. Further details are in the sklearn documentation available here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html and the PyOD documentation available here https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.lof

#### Syntax changes for Novelty Detection

When novelty is set to True, one can use the predict, decision_function and score_samples on new unseen data. Otherwise you are forced to use the fit_predict() |

## 1. Split Dataframe into Train and Test 

In [2]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

from sklearn.metrics import precision_score, recall_score, f1_score,  average_precision_score, confusion_matrix

import scipy
import matplotlib
import matplotlib.pyplot as plt
import seaborn

In [3]:
# Import
data_directory = "../01-Data/"
file_name = "thyroid.csv"      # Cleaned data exported from Week 1 notebook 
dfThyroid = pd.read_csv(data_directory  + file_name)

# Check
pd.DataFrame.info(dfThyroid)
dfThyroid

In [4]:
# Drop first column
dfThyroid = dfThyroid.drop(dfThyroid.columns[0], axis=1) 
dfThyroid

Unnamed: 0,0,1,2,3,4,5,0.1
0,0.774194,0.001132,0.137571,0.275701,0.295775,0.236066,0.0
1,0.247312,0.000472,0.279886,0.329439,0.535211,0.173770,0.0
2,0.494624,0.003585,0.222960,0.233645,0.525822,0.124590,0.0
3,0.677419,0.001698,0.156546,0.175234,0.333333,0.136066,0.0
4,0.236559,0.000472,0.241935,0.320093,0.333333,0.247541,0.0
...,...,...,...,...,...,...,...
3767,0.817204,0.000113,0.190702,0.287383,0.413146,0.188525,0.0
3768,0.430108,0.002453,0.232448,0.287383,0.446009,0.175410,0.0
3769,0.935484,0.024528,0.160342,0.282710,0.375587,0.200000,0.0
3770,0.677419,0.001472,0.190702,0.242991,0.323944,0.195082,0.0


In [5]:
# Convert 0 and 1 in 1st column as float to integer
# Commented line doesn't work
# dfThyroid.iloc[:, 0] = dfThyroid.iloc[:, 0].astype(int)
dfThyroid["0"] = dfThyroid["0"].astype(int)

In [6]:
pd.DataFrame.info(dfThyroid)
pd.DataFrame.describe(dfThyroid)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3772 non-null   int64  
 1   1       3772 non-null   float64
 2   2       3772 non-null   float64
 3   3       3772 non-null   float64
 4   4       3772 non-null   float64
 5   5       3772 non-null   float64
 6   0.1     3772 non-null   float64
dtypes: float64(6), int64(1)
memory usage: 206.4 KB


Unnamed: 0,0,1,2,3,4,5,0.1
count,3772.0,3772.0,3772.0,3772.0,3772.0,3772.0,3772.0
mean,0.00053,0.008983,0.186826,0.248332,0.376941,0.177301,0.024655
std,0.023024,0.043978,0.070405,0.080579,0.087382,0.054907,0.155093
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.001132,0.156546,0.203271,0.328638,0.14918,0.0
50%,0.0,0.003019,0.190702,0.241822,0.375587,0.17377,0.0
75%,0.0,0.004528,0.213472,0.28271,0.413146,0.196721,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# Splits the dataset into two after randomisation (not suitable for ts)
train_dfThyroid, test_dfThyroid = train_test_split(dfThyroid, test_size=0.2, random_state=42)

In [8]:
# Check 
print("Training dataset: \n")
pd.DataFrame.info(train_dfThyroid)
pd.DataFrame.describe(train_dfThyroid)

Training dataset: 

<class 'pandas.core.frame.DataFrame'>
Index: 3017 entries, 2661 to 3174
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3017 non-null   int64  
 1   1       3017 non-null   float64
 2   2       3017 non-null   float64
 3   3       3017 non-null   float64
 4   4       3017 non-null   float64
 5   5       3017 non-null   float64
 6   0.1     3017 non-null   float64
dtypes: float64(6), int64(1)
memory usage: 188.6 KB


Unnamed: 0,0,1,2,3,4,5,0.1
count,3017.0,3017.0,3017.0,3017.0,3017.0,3017.0,3017.0
mean,0.000331,0.009024,0.187627,0.249195,0.377464,0.177772,0.024528
std,0.018206,0.044331,0.07117,0.081646,0.087264,0.056181,0.154706
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.001132,0.156546,0.203271,0.328638,0.14918,0.0
50%,0.0,0.00283,0.190702,0.240654,0.375587,0.17377,0.0
75%,0.0,0.004528,0.213472,0.28271,0.413146,0.196721,0.0
max,1.0,0.901887,1.0,1.0,0.896714,1.0,1.0


In [9]:
print("\n")
pd.DataFrame.info(test_dfThyroid)
pd.DataFrame.describe(test_dfThyroid)



<class 'pandas.core.frame.DataFrame'>
Index: 755 entries, 270 to 543
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       755 non-null    int64  
 1   1       755 non-null    float64
 2   2       755 non-null    float64
 3   3       755 non-null    float64
 4   4       755 non-null    float64
 5   5       755 non-null    float64
 6   0.1     755 non-null    float64
dtypes: float64(6), int64(1)
memory usage: 47.2 KB


Unnamed: 0,0,1,2,3,4,5,0.1
count,755.0,755.0,755.0,755.0,755.0,755.0,755.0
mean,0.001325,0.008821,0.183626,0.244886,0.374853,0.175417,0.025166
std,0.036394,0.042565,0.067215,0.07612,0.08788,0.049478,0.156732
min,0.0,0.0,0.004744,0.006542,0.056338,0.005574,0.0
25%,0.0,0.001132,0.147059,0.200935,0.323944,0.14918,0.0
50%,0.0,0.003208,0.190702,0.242991,0.375587,0.17377,0.0
75%,0.0,0.004717,0.203985,0.281542,0.399061,0.196721,0.0
max,1.0,1.0,0.66888,0.60514,1.0,0.508197,1.0


## 2. Quantify the number of outliers

In [10]:
# Count the number of outliers
outlier_count = dfThyroid[dfThyroid.iloc[:, 0] == 1].shape[0]

print(f"Number of outliers in entire dataset: {outlier_count}")

Number of outliers in entire dataset: 2


In [11]:
# Count the number of outliers in Train and Test
train_outlier_count = train_dfThyroid[train_dfThyroid.iloc[:, 0] == 1].shape[0]
print(f"Number of outliers in training dataset: {train_outlier_count}")

test_outlier_count = test_dfThyroid[test_dfThyroid.iloc[:, 0] == 1].shape[0]
print(f"Number of outliers in training dataset: {test_outlier_count}")

Number of outliers in training dataset: 1
Number of outliers in training dataset: 1


## 3. Separate out samples corresponding to the inliers

In [12]:
# Separate training inliers and outliers records -
train_inliers = train_dfThyroid[train_dfThyroid.iloc[:, 0] == 0]
train_outliers = train_dfThyroid[train_dfThyroid.iloc[:, 0] == 1]

outliers_fraction = train_outliers.shape[0] / (train_inliers.shape[0] + train_outliers.shape[0])

# Display the shape of inliers and outliers DataFrames
print(f"Number of training inliers: {train_inliers.shape[0]}")
print(f"Number of training outliers: {train_outliers.shape[0]}")
print(f"Outlier fraction is: {outliers_fraction}")

Number of training inliers: 3016
Number of training outliers: 1
Outlier fraction is: 0.00033145508783559825


In [13]:
# Separate testing inliers and outliers
test_inliers = test_dfThyroid[test_dfThyroid.iloc[:, 0] == 0]
test_outliers = test_dfThyroid[test_dfThyroid.iloc[:, 0] == 1]

# Display the shape of inliers and outliers DataFrames
print(f"Number of inliers: {test_inliers.shape[0]}")
print(f"Number of outliers: {test_outliers.shape[0]}")

Number of inliers: 754
Number of outliers: 1


## 4. Instantiate four LOF Models

In [14]:
# Instantiate the LocalOutlierFactor objects
lof_3 = LocalOutlierFactor(n_neighbors=3, novelty=True)
lof_10 = LocalOutlierFactor(n_neighbors=10, novelty=True)
lof_20 = LocalOutlierFactor(n_neighbors=20, novelty=True)
lof_50 = LocalOutlierFactor(n_neighbors=50, novelty=True)

# Display the instantiated objects
print("LocalOutlierFactor with n_neighbors=3:", lof_3)
print("LocalOutlierFactor with n_neighbors=10:", lof_10)
print("LocalOutlierFactor with n_neighbors=20:", lof_20)
print("LocalOutlierFactor with n_neighbors=50:", lof_50)

LocalOutlierFactor with n_neighbors=3: LocalOutlierFactor(n_neighbors=3, novelty=True)
LocalOutlierFactor with n_neighbors=10: LocalOutlierFactor(n_neighbors=10, novelty=True)
LocalOutlierFactor with n_neighbors=20: LocalOutlierFactor(novelty=True)
LocalOutlierFactor with n_neighbors=50: LocalOutlierFactor(n_neighbors=50, novelty=True)


In [15]:
# Set up df with metrics

# Initialize the DataFrame with the required columns
metrics_df = pd.DataFrame(columns=[
    'Model', 
    'Precision', 
    'Recall', 
    'F1 Score', 
    'Average Precision', 
    'TN', 
    'TP', 
    'FN', 
    'FP'
])

# Display the initialized DataFrame
print(metrics_df)

Empty DataFrame
Columns: [Model, Precision, Recall, F1 Score, Average Precision, TN, TP, FN, FP]
Index: []


## 5. Fit each of the three models

In [16]:
# Drop the target (first) column when assigning X_inliers (target dataset)
X_inliers = train_inliers.iloc[:, 1:]

# Fit the Isolation Forest models to the inlier data
lof_3.fit(X_inliers)
lof_10.fit(X_inliers)
lof_20.fit(X_inliers)
lof_50.fit(X_inliers)

print("LOF models have been fitted successfully.")

LOF models have been fitted successfully.


## 6. Evaluate on the Test Features 

In [17]:
# Separate the test features and labels
X_test = test_dfThyroid.iloc[:, 1:]
y_test = test_dfThyroid.iloc[:, 0]

In [18]:
# List of models
models = [
    ('LOF 3 Neighbours', lof_3),
    ('LOF 10 Neighbours', lof_10),
    ('LOF 20 Neighbours', lof_20),
    ('LOF 50 Neighbours', lof_50)
]

exit(0)

In [19]:

# Evaluate each model
# Dictionary to hold results
results = {}

for model_name, model in models:
    # Predict
    predictions = model.predict(X_test)
    
    # Decision function
    decision_scores = model.decision_function(X_test)
    
    # Score samples
    sample_scores = model.score_samples(X_test)
    
    # Store results in the dictionary
    results[model_name] = {
        'predictions': predictions,
        'decision_scores': decision_scores,
        'sample_scores': sample_scores
    }

# Display the results for each model
for model_name, result in results.items():
    print(f"\nResults for {model_name}:")
    print("Predictions:", result['predictions'])
    print("Decision Function Scores:", result['decision_scores'])
    print("Sample Scores:", result['sample_scores'])
    # Append the new row to the metrics DataFrame
    metrics_df = pd.concat([metrics_df, new_row], ignore_index=True)

# Display the metrics DataFrame
print(metrics_df)

# Copy to a separate dataframe
dfMetrics_LOF = metrics_df
print(dfMetrics_LOF)




Results for LOF 3 Neighbours:
Predictions: [ 1  1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 -1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1
 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1  1  1 -1  1  1  1  1  1
  1 -1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1 -1  1  1  1  1  1 -1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1 

NameError: name 'new_row' is not defined

## 7. Inspect Metrics and Compare Performance 

In [None]:
# Plot Precision, Recall, and F1 Score
metrics_df.set_index('Model')[['Precision', 'Recall', 'F1 Score', 'Average Precision']].plot(kind='bar', figsize=(10, 6))
plt.title('Comparison of Classification Metrics')
plt.ylabel('Score')
plt.xticks(rotation=0)
plt.legend(loc='best')
plt.show()

# Plot confusion matrix components (TN, TP, FN, FP)
metrics_df.set_index('Model')[['TN', 'TP', 'FN', 'FP']].plot(kind='bar', figsize=(10, 6))
plt.title('Comparison of Confusion Matrix Components')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(loc='best')
plt.show()
