# Master's Project
# Intrusion Detection System - CICIDS2017 Analysis

### Step 1: Import Libraries

This markdown introduces the next block where all required Python libraries are loaded.

```markdown
## Import libraries
```

### Step 2: Suppress Warnings

This avoids cluttering the output with warning messages, making the notebook cleaner.

In [1]:
import warnings
warnings.filterwarnings("ignore")

### Step 3: Import Required Libraries

Loads essential libraries for:
- Data handling: `pandas`, `numpy`
- Visualization: `matplotlib`, `seaborn`
- ML tools: `sklearn`, `xgboost`

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,precision_recall_fscore_support
from sklearn.metrics import f1_score,roc_auc_score
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from xgboost import plot_importance

### Step 4: Markdown - Dataset Introduction

Explains the dataset source and purpose:
- Public dataset: CICIDS2017
- Using a sampled subset for speed
- Reusable on other datasets with similar format

### Step 5: Read Dataset

Reads the sampled CICIDS2017 dataset into a pandas DataFrame.

In [3]:
df = pd.read_csv('./data/CICIDS2017_sample.csv')

### Step 6: View Dataset

Displays the loaded dataset for inspection.

In [4]:
df

Unnamed: 0,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,4,2,0,37,0,31,6,18.500000,17.677670,0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,142377,46,62,1325,105855,570,0,28.804348,111.407285,4344,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,118873,23,28,1169,45025,570,0,50.826087,156.137367,2896,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,143577,43,55,1301,107289,570,0,30.255814,115.178969,4344,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,143745,49,59,1331,110185,570,0,27.163265,108.067176,4344,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56656,234,2,2,64,232,32,32,32.000000,0.000000,116,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
56657,133288,2,2,94,482,47,47,47.000000,0.000000,241,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
56658,11507694,5,4,450,3525,450,0,90.000000,201.246118,3525,...,32,893.0,0.0,893,893,6503640.0,0.0,6503640,6503640,DoS
56659,11507707,8,6,416,11632,416,0,52.000000,147.078211,5792,...,32,897.0,0.0,897,897,6503122.0,0.0,6503122,6503122,DoS


### Step 7: Check Class Distribution

Shows how many samples belong to each class (e.g., BENIGN, DoS, etc.).

In [5]:
df.Label.value_counts()

Label
BENIGN          22731
DoS             19035
PortScan         7946
BruteForce       2767
WebAttack        2180
Bot              1966
Infiltration       36
Name: count, dtype: int64

### Step 8: Preprocessing Heading

Introduces preprocessing phase: normalization, handling missing values.

### Step 9: Normalize + Fill Missing Values

Applies:
- **Z-score normalization** on numeric columns
- **Fills NaNs** with `0`

In [6]:
# Z-score normalization
features = df.dtypes[df.dtypes != 'object'].index
df[features] = df[features].apply(
    lambda x: (x - x.mean()) / (x.std()))

# Fill empty values by 0
df = df.fillna(0)