# Lab 4, Exercise 4

In [1]:
import numpy as np
import pandas as pd

## Load data 


In [2]:
# Load the data in the following two CSVs:
# data/exercise4/lab4_normal_data.csv
# data/exercise4/lab4_malicious_data.csv
# The first consists completely of normal data, while the second consists completely of malicious data
# Note: Both sets of data contain the same features used in Exercise 1; the data has already been preprocessed
# (i.e., you can keep all the features and there are no labels in the CSVs)

# CODE HERE

norm_df = pd.read_csv('data/exercise4/lab4_normal_data.csv', delimiter=' *, *', engine='python')
mal_df = pd.read_csv('data/exercise4/lab4_malicious_data.csv', delimiter=' *, *', engine='python')

norm_df

Unnamed: 0,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,dload,sloss,...,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports
0,6,4,258,172,74.087490,252,254,14158.942380,8495.365234,0,...,1,1,1,1,0,0,0,1,1,0
1,14,38,734,42014,78.473372,62,252,8395.112305,503571.312500,2,...,1,1,1,2,0,0,0,1,6,0
2,8,16,364,13186,14.170161,62,252,1572.271851,60929.230470,1,...,2,1,1,3,0,0,0,2,6,0
3,12,12,628,770,13.677108,62,252,2740.178955,3358.622070,1,...,2,1,1,3,1,1,0,2,1,0
4,10,6,534,268,33.373826,254,252,8561.499023,3987.059814,2,...,2,2,1,40,0,0,0,2,39,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55995,1,0,46,0,0.000000,0,0,0.000000,0.000000,0,...,2,2,2,2,0,0,0,2,2,1
55996,2,0,92,0,0.016668,0,0,6.133765,0.000000,0,...,2,2,2,2,0,0,0,2,2,1
55997,2,0,92,0,0.016668,0,0,6.133765,0.000000,0,...,2,2,2,2,0,0,0,2,2,1
55998,2,0,92,0,0.016668,0,0,6.133765,0.000000,0,...,2,2,2,2,0,0,0,2,2,1


In [3]:
# Create 15 datasets, where the ith dataset consists of:
# - all normal data
# - only the ith malicious datapoint

# CODE HERE

datasets = []

for i in range(mal_df.shape[0]):
    dataset = pd.concat([norm_df, mal_df.iloc[i:i+1, :]], ignore_index=True)
    datasets.append(dataset)


## Anomaly detection

In [4]:
# For each dataset, run isolation forests
#
# Use the following evaluation metric:
# - rank the anomalousness of each datapoint using the isolation forest
# - record the list index of each attack datapoint when sorting from most to least unusual
#     - e.g., if the attack datapoint is at index 0 in the list, we want to record the value 0
#
# Note: don't worry about ties in ranking
# Hint: What is the difference between isolation forest's 'decision_function' and 'predict' methods? 

# CODE HERE

from sklearn.ensemble import IsolationForest

for dataset in datasets:
    model = IsolationForest(random_state=0, behaviour='new', contamination=0.1).fit(dataset)
    anomaly_scores = model.decision_function(dataset)
    idx_score_list = sorted(enumerate(anomaly_scores), key=lambda x: x[1])
    ranks = [i[0] for i in idx_score_list]
    print(ranks.index(len(dataset) - 1))


71
23379
753
470
5
29018
20343
28675
235
23613
71
7
7
453
80


## Questions:
1) Why is there no separate training and test set?

In unsupervised machine learning, the data isn't labeled, so there's really no point in "training" and "testing." Instead, we do some sort of clustering to detect anomalies (outliers).

2) What is the metric measuring?  What would be a perfect score?  Bonus: What is the expected performance of an outlier detector that assigns a random score to each datapoint?

The anomaly score helps determine how normal or abnormal a datapoint is. The more negative, the more abnormal (outliers), the more positive, the more normal (inliers). A perfect score is the most anomalous datapoint (most negative at index 0).

3) How well does the isolation forest perform compared to a perfect score? Bonus: How well does the isolation forest perform compared to a random detector?

The isolation forest doesn't seem to perform that well. A lot of the attack datapoints aren't close to the perfect score. Only a few are within top 10 most anomalous, but some are ranked in the thousands.

4) What are some issues that would prevent this model from being practically deployed?

This model could have a high false positive rate which is a common issue with unsupervised machine learning. It could be costly to analyze each anomaly if a high percentage of these anomalies are false positives.

5) What might happen if we inject five attack datapoints at a time?  What might happen if we inject 100 attack datapoints at a time?

Adding more attack datapoints may cause them to have a more positive anomaly score, which means it is more "normal" because there are more datapoints that are similar to the attack datapoint, which makes it harder to detect the anomaly.

6) What is the effect of the parameters max_features and max_samples?  What other parameters could you adjust to change performance?

Setting max_features to a smaller number makes the model perform worse as the attack datapoint is even further away from the perfect score.
Setting max_samples to a larger number or to the number of datapoints seems to make the model perform better.
Other parameters include: n_estimators, contamination

Optional: What are some alternative anomaly detection models one could use instead of an isolation forest? Bonus: Try one of these alternatives and compare performance.