## Lab 4, Exercise 2

In [2]:
import numpy as np
import pandas as pd

## Load data 


In [3]:
# Load the data in the following two CSVs:
# data/exercise2/lab4_normal_data.csv
# data/exercise2/lab4_malicious_data.csv
# The first consists completely of normal data, while the second consists completely of malicious data
# Note: Both sets of data contain the same features used in Exercise 1; the data has already been preprocessed
# (i.e., you can keep all the features and there are no labels in the CSVs)

# CODE HERE
df_normal = pd.read_csv('data/exercise2/lab4_normal_data.csv')
df_malicious = pd.read_csv('data/exercise2/lab4_malicious_data.csv')

In [15]:
# Create 15 datasets, where the ith dataset consists of:
# - all normal data
# - only the ith malicious datapoint

# CODE HERE
datasets = []
for i in range(len(df_malicious)):
    df = df_normal.copy()
    datasets.append(pd.concat([df, df_malicious.iloc[[i]]]))

56001


## Anomaly detection

In [19]:
# For each dataset, run isolation forests
#
# Use the following evaluation metric:
# - rank the anomalousness of each datapoint using the isolation forest
# - record the list index of each attack datapoint when sorting from most to least unusual
#     - e.g., if the attack datapoint is at index 0 in the list, we want to record the value 0
#
# Note: don't worry about ties in ranking
# Hint: What is the difference between isolation forest's 'decision_function' and 'predict' methods? 

# CODE HERE
from sklearn.ensemble import IsolationForest

anomalousness = []

for i in range(len(datasets)):
    df = datasets[i]
    clf = IsolationForest()
    clf.fit(df)
    scores = clf.decision_function(df)
    anomalousness.append((scores[len(df_normal)], i))

# sort anomalousness by the first element of each tuple from least to greatest
anomalousness.sort(key=lambda x: x[0])
for i in range(len(anomalousness)):
    print(anomalousness[i])

(-0.12306769278176743, 4)
(-0.11853696621166065, 11)
(-0.10831943336082428, 14)
(-0.0959996446517859, 10)
(-0.08994705272120196, 0)
(-0.08873643044853996, 12)
(-0.07216347589197358, 2)
(-0.0686318205034333, 8)
(-0.05795377909685617, 13)
(-0.02323794487947306, 3)
(0.08432400194851936, 6)
(0.09011002809410179, 9)
(0.09456909324692842, 1)
(0.10300653794818693, 7)
(0.11566603824571631, 5)


## Questions:
1) Why is there no separate training and test set?

The model is focused on evaluating the ability to detect anomalies, rather than training for detecting specific situations. Since there's only one malicious data point in each dataset, the model would overfit to that specific data point if we used a training and test set.

2) What is the metric measuring?  What would be a perfect score?  Bonus: What is the expected performance of an outlier detector that assigns a random score to each datapoint?

The metric is measuring the anomaly score. A perfect score would be -1.0.
The expected performance of an outlier detector that assigns a random score to each datapoint would be poor and would result in many false positives, and about 50% of the malicious data points would be detected.

3) How well does the isolation forest perform compared to a perfect score? Bonus: How well does the isolation forest perform compared to a random detector?

The isolation forest is not as confident as a perfect score, but it scored most of the malicious data points as anomalies. The isolation forest performs better than a random detector because it detects more than half of the malicious data points.

4) What are some issues that would prevent this model from being practically deployed?

The false positive rate is high for this model, so in a real world scenario, the system would have trouble running with benign data. The model would need to be trained on more data to reduce the false positive rate.

5) What might happen if we inject five attack datapoints at a time?  What might happen if we inject 100 attack datapoints at a time?

If we inject five attack datapoints at a time, the model would be able to detect the anomalies with a slightly higher confidence and a lower false positive rate. If we inject 100 attack datapoints at a time, the model would be able to detect the anomalies with an even higher confidence and a lower false positive rate.

6) What is the effect of the parameters max_features and max_samples?  What other parameters could you adjust to change performance?

max_features: changes the number of features to consider when evaluating the split. Less features means less splits, which would make the model faster to train, but possibly less accurate (depending on the number of relevant features).
max_samples: changes the number of samples used to evaluate the split. Less samples = faster training and more sensitive to outliers.

Other parameters:
contamination: defines the proportion of outliers in the dataset, used when fitting to determine the scoring threshold.
n_estimators: number of base estimators to use in the ensemble. More estimators = slower training, but possibly more accurate.

Bonus: What are some alternative anomaly detection models one could use instead of an isolation forest? Try one of these alternatives and compare performance.

An alternative is a one-class SVM.

In [20]:
# one-class SVM
from sklearn.svm import OneClassSVM

anomalousness = []

for i in range(len(datasets)):
    df = datasets[i]
    clf = OneClassSVM()
    clf.fit(df)
    scores = clf.decision_function(df)
    anomalousness.append((scores[len(df_normal)], i))

# sort anomalousness by the first element of each tuple from least to greatest
anomalousness.sort(key=lambda x: x[0])
for i in range(len(anomalousness)):
    print(anomalousness[i])

(-3577.25244812906, 2)
(-2347.6318164347886, 10)
(-1805.0377441662731, 8)
(-1525.0443273691872, 4)
(-755.0639751976723, 0)
(-662.9233153680761, 11)
(-662.9233153680761, 12)
(5.282121258025654, 9)
(6.907916265347012, 6)
(7.988416205967951, 1)
(101.50959173818774, 3)
(427.9006711644415, 14)
(1756.2576656214605, 5)
(2127.2760778520624, 7)
(2188.0793434177667, 13)
