# Cyber security detection

In this practical session, we will embark on an insightful journey into the world of Machine Learning (ML) with a focus on cybersecurity, specifically within Supervisory Control and Data Acquisition (SCADA) networks. Our goal is to develop a machine learning model capable of detecting cybersecurity issues in SCADA networks. This exercise will not only enhance your understanding of ML in cybersecurity but also provide you with hands-on experience in safeguarding critical infrastructure.

## Understanding SCADA Systems

SCADA systems are vital for the operation and monitoring of industrial processes. These systems are used to control and supervise equipment and processes in various sectors, including energy, water treatment, and transportation. For instance, in the energy sector, SCADA systems manage the distribution of electricity in power grids, ensuring that the supply meets demand. In water treatment plants, they monitor and control the flow and treatment of water, ensuring safety and efficiency.

SCADA networks are critical because they support the seamless operation of our infrastructure, making them prime targets for cyber-attacks. The consequences of a successful attack can range from operational disruptions to significant financial losses and potential threats to public safety.

## Dataset Overview

The dataset you will be working with has been meticulously compiled from a SCADA system testbed, designed to emulate real-world industrial systems. This testbed was subjected to various cyber-attacks, particularly reconnaissance attacks, where attackers scan the network to identify vulnerabilities that could be exploited in future attacks.

Here are the key columns in the dataset:

**Source Port (Sport)**: This represents the port number of the source in the network transaction, crucial for understanding the communication patterns and potential unauthorized access points.

**Total Packets (TotPkts)**: This column shows the total number of packets exchanged in a transaction, which can indicate the volume and intensity of network communication.

**Total Bytes (TotBytes)**: Reflects the total amount of data transferred during the transaction, important for detecting large or unusual data transfers.

**Source Packets (SrcPkts) and Destination Packets (DstPkts)**: These columns provide counts of packets sent from the source to the destination and vice versa, useful for identifying asymmetric traffic flows often seen in cyber-attacks.

**Source Bytes (SrcBytes)**: This column indicates the amount of data sent from the source, helping to identify potential data exfiltration attempts.

**Target** -> Label: Each record is labeled as either part of a network attack or normal activity. This labeling is crucial for training our ML model to differentiate between benign and malicious network behaviors.


Source of the dataset: https://www.cse.wustl.edu/~jain/iiot/index.html

You can download the dataset here: https://drive.google.com/file/d/12JNZjMbYucd-TfcFoz9AfNhv6dg7WAzk/view?usp=sharing


1) Read the data with pandas (the dataset is about 200Mb)

Number of samples: 7_037_983

In [None]:
# FIXME

2. Identify categorical and numerical features for your ML model

In [None]:
# FIXME

3. Compute some statistics on the dataset,

- Label proportion
- Proportion of NULL values
- Count the number of modalities for each categorical features
- Check label proportion for some specific modality of a categorical feature
- Chose a column and compute some statistics like mean, std ...
- Display the distribution of a continuous feature (bucketization of the feature and count)
- Display the label rate per bucket for all the numerical features (what can we understand from those plots)

In [None]:
# Label proportion
# FIXME

In [None]:
# Proportion of NULL values
# FIXME

In [None]:
# Count the number of modalities for each categorical features
# FIXME

In [None]:
# Check label proportion for some specific modality of a categorical feature
# FIXME

In [None]:
# Chose a column and compute some statistics like mean, std ...
# FIXME

In [None]:
# Display the distribution of a continuous feature (bucketization of the feature and count)
# FIXME

In [84]:
# Display the label rate per bucket for all the numerical features (what can we understand from those plots)
# FIXME

What can we understand from the data ?

In [None]:
# FIXME

### Split the dataset

Split the dataset into training and test

You can drop the categorical features for now if you don't know how to process categorical features with a huge number of categories

After the split, ensure that we have enough labels

In [None]:
# FIXME

### Its time to fit our models

Build 3 different models:
- Logistic regression
- Decision tree
- Gaussian Naives bayes

We are going to compare the performances of our 3 models

In [None]:
# FIXME

### Compute metrics from scratch

Accuracy / confusion matrix / precision / recall / f1 score

Check if your implementation matches sklearn implementation results


In [None]:
# FIXME

Now compute the roc auc

In [None]:
# FIXME

### Find the optimal threshold to maximize the f1 score

Choose the best model and compute the optimal threshold to maximize f1 score

In [None]:
# FIXME