---

# AERO 5 - Hands on Machine Learning for cybersecurity (2023/2024)


# 4 – Malware threat detection
---

In this lab session we will discuss how the Machine Learning is used for malware/outlier detection.

“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.”

Supervised Anomoly Detection : 

    a. Labels available for both normal data and anomalies. 
    b. Similar to rare class mining / imbalanced classification. 
    
Unsupervised Anomoly Detection (Outlier Detection):
    
    c. No labels, training set = normal + abnormal data / Assumption: anomalies are very rare. 
    
Semi-supervised Anomoly Detection (Novelty Detection):
    
    d. Only normal data available to train. 
    e. The algorithm learns on normal data only.
    
The `scikit-learn` documentation is complete and should be consulted whenever necessary. In particular herein you can consult:

https://scikit-learn.org/stable/modules/tree.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

## 1. Classification tree

In this exercise, we will extract malware artifacts in a dataset `MalwareArtifacts.csv` and we will use the `AddressOfEntryPoint` and `DllCharacteristics` fields as potentially distinctive features for detecting the suspect.

1. Start by importing the relevant libraries and the dataset.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

2. Extract artifacts samples fields `AddressOfEntryPoint` as samples and `DllCharacteristics` as targets.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

3. Using the `scikit-learn` package, split samples and targets into training samples, testing_samples, training_targets and testing_targets.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

4. Fit a classification tree on your training dataset.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

5. Use this model to make a prediction now then compute its accuracy score. What do you conclude?

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

6. Answer the previous two questions again using a random forest model.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

## 2. Classification tree, random forest and isolation forest

Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. 
The intrusion detector learning task is to build a predictive model (i.e., a classifier) capable of distinguishing between ''bad'' connections, called intrusions or attacks, and ''good'' normal connections.

In this exercise, we consider the KDD (Knowledge Discovery Database) that contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

1. Start by importing the relevant packages for the exercise.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

2. The following code loads the KDD data:

In [None]:
from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset=None, shuffle=True, percent10=True)

From this dataset, build $𝑋$ as the data and $𝑌$ as the target.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

The KDD has several features that we could print using this sentence:

In [None]:
feature_cols = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serrer_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']

3. Using `pandas.Dataframe`, structure the data so that it takes the previous feature_cols as columns, and put the target in the right form using `pandas.Series`.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

4. Print the data and visualize it.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

5. In the following, for efficient processing, convert the columns into floats. Then convert the categorical into dummy or indicator variables.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

6. Generate the counts and visualize them while executing.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

7. Fit a classification tree with `max_depth=7` on all data. Then create the corresponding Graphviz file to visualize the results of the decision tree with a graph.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

8. The following sentence gives us the feature importance:

In [None]:
pd.DataFrame({'feature':X.columns,'importance':rf.feature_importances_}).sort_values('importance', ascending=False).head(10)

Print the output. What do you think about it?

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

9. How about a random forest model now?

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

10. Evaluate both previous estimators’ performances using cross-validation from `scikit-learn`. 

https://scikit-learn.org/stable/modules/cross_validation.html. Conclude.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

The Isolation Forest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

In this example we will want to use binary data where 1 will represent a "not-normal" attack:

In [None]:
y_binary = y != 'normal.'

11. Divide the data into train and test sets.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

12. Apply the isolation forest model to fit the data. Then make prediction.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

13. Evaluate the performance of the model using the `accuracy_score`. Conclude.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

## 3. K-means clustering

In this exercise, you will apply the unsupervised K-means clustering to the previous dataset of artifacts `MalwareArtifacts.csv`. The features that we select as distinctive criteria of the possible malware correspond to the `MajorLinkerVersion`, `MajorImageVersion`, `MajorOperatingSystemVersion`, and `DllCharacteristics` fields. 

1. Start by importing the relevant python librairies and the dataset. Extract then artifacts samples fields and define the targets. 

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

2. You can proceed to instantiate the K-Means clustering by fixing the number of classes to 2 and the maximum number
of iterations that the algorithm can execute to 300.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

3. Evaluate now the results obtained by the algorithm. You can for instance use the `silhouette_score` usually employed to calculate the goodness of a clustering technique.\
More informations are available on https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================

Conclude.

In [None]:
# EDIT THIS CELL

## 3. Method of your choice

With the ML method of your choice, reconsider the first propsed dataset, then fit a model to extract malware artifacts.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================


# =======================================================