# DATA 3461 - Lab 6 (SA Dataset/6-8)
### Darlene Eligado 
### ID 1001889134

### This file only has exercises regarding the SA dataset (exercises 6-8). 

---

# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [2]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [20]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [16]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [8]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [17]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [21]:
len(np.unique(D["target"]))

23

In [13]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

-----

# **Load in data & Preprocess again**

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_kddcup99
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

In [2]:
# loading in data
# specify getting SA dataset & make it into a dataframe 
D = fetch_kddcup99(subset="SA", percent10=True, as_frame=True)
df = pd.DataFrame(D.data, columns=D.feature_names)

In [3]:
# eda
df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0


In [4]:
# add target column 
df["target"]=D.target
df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,b'normal.'


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100655 entries, 0 to 100654
Data columns (total 42 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   duration                     100655 non-null  object
 1   protocol_type                100655 non-null  object
 2   service                      100655 non-null  object
 3   flag                         100655 non-null  object
 4   src_bytes                    100655 non-null  object
 5   dst_bytes                    100655 non-null  object
 6   land                         100655 non-null  object
 7   wrong_fragment               100655 non-null  object
 8   urgent                       100655 non-null  object
 9   hot                          100655 non-null  object
 10  num_failed_logins            100655 non-null  object
 11  logged_in                    100655 non-null  object
 12  num_compromised              100655 non-null  object
 13  root_shell    

In [6]:
# will be doing the same preprocsessing from full KDD dataset in exercises 1-5

# list of continuous columns
continuous_columns = [
    'duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', 
    'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 
    'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 
    'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 
    'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 
    'dst_host_rerror_rate', 'dst_host_srv_rerror_rate'
]

# convert continuous columns to numeric
for column in continuous_columns:
    df[column] = pd.to_numeric(df[column], errors='coerce')

In [7]:
# checking if turned into numeric
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100655 entries, 0 to 100654
Data columns (total 42 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     100655 non-null  int64  
 1   protocol_type                100655 non-null  object 
 2   service                      100655 non-null  object 
 3   flag                         100655 non-null  object 
 4   src_bytes                    100655 non-null  int64  
 5   dst_bytes                    100655 non-null  int64  
 6   land                         100655 non-null  object 
 7   wrong_fragment               100655 non-null  int64  
 8   urgent                       100655 non-null  int64  
 9   hot                          100655 non-null  int64  
 10  num_failed_logins            100655 non-null  int64  
 11  logged_in                    100655 non-null  object 
 12  num_compromised              100655 non-null  int64  
 13 

In [8]:
# dealing w/ target column now
# print the distribution of the target column
df['target'].value_counts()

target
b'normal.'         97278
b'smurf.'           2389
b'neptune.'          917
b'back.'              19
b'satan.'             17
b'ipsweep.'           11
b'warezclient.'        9
b'teardrop.'           7
b'portsweep.'          7
b'pod.'                1
Name: count, dtype: int64

In [9]:
# encode target column by unique values

df["target"]=df["target"].str.decode("utf-8")

# mapping based on the unique values
label_mapping = {
    'normal.': 0,
    'smurf.': 1,
    'neptune.': 2,
    'back.': 3,
    'satan.': 4,
    'ipsweep.': 5,
    'warezclient.': 6,
    'teardrop.': 7,
    'portsweep.': 8,
    'pod.': 9,

}


# apply mapping to target col
df["target"]=df["target"].replace(label_mapping)

In [10]:
# # print the distribution of the target column again
df['target'].value_counts()

target
0    97278
1     2389
2      917
3       19
4       17
5       11
6        9
7        7
8        7
9        1
Name: count, dtype: int64

In [11]:
# now deal w/ other columns

# categorical
object_columns = df.select_dtypes(include='object').columns
print(object_columns)

# now display unique values for each object column
for column in object_columns:
    print(f"Value counts for '{column}':\n{df[column].value_counts()}\n")

Index(['protocol_type', 'service', 'flag', 'land', 'logged_in',
       'is_host_login', 'is_guest_login'],
      dtype='object')
Value counts for 'protocol_type':
protocol_type
b'tcp'     77780
b'udp'     19186
b'icmp'     3689
Name: count, dtype: int64

Value counts for 'service':
service
b'http'           61906
b'smtp'            9598
b'private'         8236
b'domain_u'        5862
b'other'           5649
b'ftp_data'        3808
b'ecr_i'           2735
b'urp_i'            537
b'finger'           471
b'eco_i'            400
b'ntp_u'            380
b'ftp'              374
b'telnet'           222
b'auth'             221
b'pop_3'             81
b'time'              53
b'IRC'               42
b'urh_i'             14
b'X11'                9
b'Z39_50'             4
b'netbios_ns'         4
b'domain'             4
b'sql_net'            3
b'iso_tsap'           2
b'gopher'             2
b'systat'             2
b'login'              2
b'efs'                2
b'echo'               2
b'nnsp'      

In [12]:
# i will now try to encode all of this >.<

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# protocol_type
df['protocol_type'] = label_encoder.fit_transform(df['protocol_type'])

# view the mapping of the original values to their encoded values
protocol_type_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print(protocol_type_mapping)

# service
df['service'] = label_encoder.fit_transform(df['service'])

# view the mapping of the original values to their encoded values
service_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("\n",service_mapping)

# flag
df['flag'] = label_encoder.fit_transform(df['flag'])

# view the mapping of the original values to their encoded values
flag_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("\n",flag_mapping)

{b'icmp': 0, b'tcp': 1, b'udp': 2}

 {b'IRC': 0, b'X11': 1, b'Z39_50': 2, b'auth': 3, b'bgp': 4, b'csnet_ns': 5, b'ctf': 6, b'daytime': 7, b'domain': 8, b'domain_u': 9, b'echo': 10, b'eco_i': 11, b'ecr_i': 12, b'efs': 13, b'finger': 14, b'ftp': 15, b'ftp_data': 16, b'gopher': 17, b'http': 18, b'http_443': 19, b'imap4': 20, b'iso_tsap': 21, b'klogin': 22, b'login': 23, b'mtp': 24, b'netbios_dgm': 25, b'netbios_ns': 26, b'netbios_ssn': 27, b'nnsp': 28, b'nntp': 29, b'ntp_u': 30, b'other': 31, b'pop_3': 32, b'printer': 33, b'private': 34, b'red_i': 35, b'rje': 36, b'shell': 37, b'smtp': 38, b'sql_net': 39, b'ssh': 40, b'sunrpc': 41, b'systat': 42, b'telnet': 43, b'tftp_u': 44, b'tim_i': 45, b'time': 46, b'urh_i': 47, b'urp_i': 48, b'uucp': 49, b'uucp_path': 50}

 {b'OTH': 0, b'REJ': 1, b'RSTO': 2, b'RSTR': 3, b'S0': 4, b'S1': 5, b'S2': 6, b'S3': 7, b'SF': 8}


In [13]:
# now encoding the binary columns to integers

binary_columns = ['land', 'logged_in', 'is_host_login', 'is_guest_login']
df[binary_columns] = df[binary_columns].astype(int)

for column in binary_columns:
    print(f"'{column}': {df[column].unique()}")

'land': [0 1]
'logged_in': [1 0]
'is_host_login': [0]
'is_guest_login': [0 1]


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100655 entries, 0 to 100654
Data columns (total 42 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     100655 non-null  int64  
 1   protocol_type                100655 non-null  int64  
 2   service                      100655 non-null  int64  
 3   flag                         100655 non-null  int64  
 4   src_bytes                    100655 non-null  int64  
 5   dst_bytes                    100655 non-null  int64  
 6   land                         100655 non-null  int64  
 7   wrong_fragment               100655 non-null  int64  
 8   urgent                       100655 non-null  int64  
 9   hot                          100655 non-null  int64  
 10  num_failed_logins            100655 non-null  int64  
 11  logged_in                    100655 non-null  int64  
 12  num_compromised              100655 non-null  int64  
 13 

---

# Exercise 6

Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

anomaly detection algorithms used: 
- isolation forest
- one-class svm
- elliptic envelope

In [15]:
X = df.drop(columns=['target'])
y = df['target']

# standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [16]:
# models
from sklearn.svm import OneClassSVM # will be trying SVM since LOF did not work 

isolation_forest = IsolationForest(contamination=0.05, random_state=42)
elliptic_envelope = EllipticEnvelope(contamination=0.05, random_state=42)
one_class_svm = OneClassSVM(nu=0.05, kernel="rbf", gamma="auto")
# local_outlier_factor = LocalOutlierFactor(n_neighbors=20, contamination=0.05, novelty=True) 

In [17]:
# isolation forest
isolation_forest.fit(X_scaled)
y_pred_if = isolation_forest.predict(X_scaled)
y_pred_if = np.where(y_pred_if == 1, 0, 1)  # convert to binary (0 = normal) & (1 = anomaly)

# elliptic envelope
elliptic_envelope.fit(X_scaled)
y_pred_ee = elliptic_envelope.predict(X_scaled)
y_pred_ee = np.where(y_pred_ee == 1, 0, 1)

# one-class svm
one_class_svm.fit(X_scaled)
y_pred_ocsvm = one_class_svm.predict(X_scaled)
y_pred_ocsvm = np.where(y_pred_ocsvm == 1, 0, 1)

'''
# lof
local_outlier_factor.fit(X_scaled)
y_pred_lof = local_outlier_factor.predict(X_scaled)
y_pred_lof = np.where(y_pred_lof == 1, 0, 1)
'''



'\n# lof\nlocal_outlier_factor.fit(X_scaled)\ny_pred_lof = local_outlier_factor.predict(X_scaled)\ny_pred_lof = np.where(y_pred_lof == 1, 0, 1)\n'

In [19]:
# evaluate models
# isolation forest
print("Isolation Forest:\n", classification_report(y, y_pred_if))

# elliptic envelope
print("Elliptic Envelope:\n", classification_report(y, y_pred_ee))

# one class svm
print("One-class SVM:\n", classification_report(y, y_pred_ocsvm))

'''
# lof
print("Local Outlier Factor:\n", classification_report(y, y_pred_lof))
'''

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Isolation Forest:
               precision    recall  f1-score   support

           0       0.97      0.96      0.97     97278
           1       0.00      0.00      0.00      2389
           2       0.00      0.00      0.00       917
           3       0.00      0.00      0.00        19
           4       0.00      0.00      0.00        17
           5       0.00      0.00      0.00        11
           6       0.00      0.00      0.00         9
           7       0.00      0.00      0.00         7
           8       0.00      0.00      0.00         7
           9       0.00      0.00      0.00         1

    accuracy                           0.93    100655
   macro avg       0.10      0.10      0.10    100655
weighted avg       0.94      0.93      0.93    100655

Elliptic Envelope:
               precision    recall  f1-score   support

           0       0.97      0.95      0.96     97278
           1       0.00      0.00      0.00      2389
           2       0.00      0.00      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


'\n# lof\nprint("Local Outlier Factor:\n", classification_report(y, y_pred_lof))\n'

## Results:

**Best Model**: **One-class SVM** generally performed the best out of the 3 models.

---

# Exercise 7: 

Create a subsample of 250 datapoints, redo exercise 6, using Leave-one-out as the method of evaluation.

In [21]:
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import precision_score, recall_score, f1_score

In [24]:
# making a subsample of 250 points
sampled_df = df.sample(n=250, random_state=42)
X_sampled = sampled_df.drop("target", axis=1)
y_sampled = (sampled_df["target"] != 1).astype(int)

In [25]:
# models
isolation_forest = IsolationForest(random_state=42)
elliptic_envelope = EllipticEnvelope()
one_class_svm = OneClassSVM(nu=0.05, kernel="rbf", gamma="auto")

In [26]:
 # leave on out cv
loo = LeaveOneOut()

def evaluate_model(model, X, y):
    y_true = []
    y_pred = []
    
    for train_index, test_index in loo.split(X):
        # split data
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        model.fit(X_train)
        
        # predict based on model type
        if isinstance(model, OneClassSVM):
            y_pred_single = model.predict(X_test)
            y_pred_single = np.where(y_pred_single == -1, 1, 0)  # Convert -1 to 1 (anomaly) and 1 to 0 (normal)
        else:  # isolation forest & elliptic envelope
            y_pred_single = model.predict(X_test)
            y_pred_single = np.where(y_pred_single == -1, 1, 0)
        
        y_true.append(y_test.values[0])
        y_pred.append(y_pred_single[0])

    return y_true, y_pred

In [28]:
# isolation forest
y_true_iso, y_pred_iso = evaluate_model(isolation_forest, X_sampled, y_sampled)
print("Isolation Forest Classification Report:")
print(classification_report(y_true_iso, y_pred_iso))  

# elliptic envelope 
y_true_ell, y_pred_ell = evaluate_model(elliptic_envelope, X_sampled, y_sampled)
print("\nElliptic Envelope Classification Report:")
print(classification_report(y_true_ell, y_pred_ell))

# one-class svm
y_true_ocsvm, y_pred_ocsvm = evaluate_model(one_class_svm, X_sampled, y_sampled)
print("\nOne-Class SVM Classification Report:")
print(classification_report(y_true_ocsvm, y_pred_ocsvm))

Isolation Forest Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         6
           1       0.75      0.07      0.13       244

    accuracy                           0.07       250
   macro avg       0.38      0.04      0.07       250
weighted avg       0.73      0.07      0.13       250






Elliptic Envelope Classification Report:
              precision    recall  f1-score   support

           0       0.01      0.50      0.03         6
           1       0.90      0.11      0.19       244

    accuracy                           0.12       250
   macro avg       0.46      0.30      0.11       250
weighted avg       0.88      0.12      0.19       250


One-Class SVM Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         6
           1       0.98      0.98      0.98       244

    accuracy                           0.95       250
   macro avg       0.49      0.49      0.49       250
weighted avg       0.95      0.95      0.95       250



In [18]:
'''
# lof 
y_true_lof, y_pred_lof = evaluate_model(local_outlier_factor, X_sampled, y_sampled, use_predict=False)
print("\nLocal Outlier Factor Classification Report:")
print(classification_report(y_true_lof, y_pred_lof))  
'''

'\n# lof \ny_true_lof, y_pred_lof = evaluate_model(local_outlier_factor, X_sampled, y_sampled, use_predict=False)\nprint("\nLocal Outlier Factor Classification Report:")\nprint(classification_report(y_true_lof, y_pred_lof))  \n'

### Results: 
**One-Class SVM performed best overall**, especially at identifying normal instances, though all models missed many anomalies.

# **Exercise 8**

Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

In [41]:
from sklearn.inspection import permutation_importance
from sklearn.metrics import classification_report


# select top 5 features & evaluate model w/ permutation importance
def select_top_features_permutation(model, X, y):
    # Fit the model
    model.fit(X)
    y_pred = model.predict(X)
    y_pred = np.where(y_pred == -1, 1, 0)  # convert -1 to 1 (anomaly), 1 to 0 (normal)

    # calc permutation importance
    perm_importance = permutation_importance(model, X, y, n_repeats=10, random_state=42, scoring="accuracy")
    sorted_idx = perm_importance.importances_mean.argsort()[-5:]  # indices of top 5 features

    top_features = X.columns[sorted_idx]
    print("Top 5 features:", top_features)

    # evaluate model w/ selected features
    X_top_features = X.iloc[:, sorted_idx]
    model.fit(X_top_features)
    y_pred_top = model.predict(X_top_features)
    y_pred_top = np.where(y_pred_top == -1, 1, 0)

    print(classification_report(y, y_pred_top))
    return top_features


X_scaled = pd.DataFrame(scaler.fit_transform(X_sampled), columns=X_sampled.columns)


print("Isolation Forest:")
top_features_iso = select_top_features_permutation(IsolationForest(random_state=42, contamination=0.05), X_scaled, y_sampled)


print("\nElliptic Envelope:")
top_features_ell = select_top_features_permutation(EllipticEnvelope(contamination=0.05), X_scaled, y_sampled)


print("\nOne-Class SVM:")
top_features_ocsvm = select_top_features_permutation(OneClassSVM(kernel="linear", gamma="auto"), X_scaled, y_sampled)

Isolation Forest:
Top 5 features: Index(['dst_host_count', 'dst_host_same_srv_rate', 'count', 'dst_bytes',
       'srv_count'],
      dtype='object')
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         6
           1       0.54      0.03      0.05       244

    accuracy                           0.03       250
   macro avg       0.27      0.01      0.03       250
weighted avg       0.53      0.03      0.05       250


Elliptic Envelope:




Top 5 features: Index(['hot', 'dst_host_srv_serror_rate', 'dst_host_srv_count',
       'dst_host_srv_rerror_rate', 'dst_host_rerror_rate'],
      dtype='object')
              precision    recall  f1-score   support

           0       0.03      1.00      0.05         6
           1       1.00      0.05      0.10       244

    accuracy                           0.08       250
   macro avg       0.51      0.53      0.08       250
weighted avg       0.98      0.08      0.10       250


One-Class SVM:




Top 5 features: Index(['rerror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate',
       'logged_in', 'protocol_type'],
      dtype='object')
              precision    recall  f1-score   support

           0       0.02      1.00      0.05         6
           1       1.00      0.01      0.02       244

    accuracy                           0.04       250
   macro avg       0.51      0.51      0.04       250
weighted avg       0.98      0.04      0.02       250

