<a href="https://colab.research.google.com/github/esthy13/cil-intrusion-detection/blob/main/cyberproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data Exploration Summary (UNSW-NB15)

UNSW-NB15 Dataset Exploration Summary

In this project, we performed an exploratory data analysis (EDA) on the UNSW-NB15 dataset to understand its structure, labels, feature types, and suitability for a Class-Incremental Learning (CIL) based Intrusion Detection System.

Dataset Structure

The dataset is provided with predefined training and testing splits:

Training set: 175,341 samples

Testing set: 82,332 samples

Total columns: 45

A comparison of column names confirmed that the training and test sets share an identical schema, allowing the same preprocessing pipeline to be applied consistently to both.

In [None]:
import pandas as pd

train_path = "UNSW_NB15_training-set.csv"
test_path  = "UNSW_NB15_testing-set.csv"

train = pd.read_csv(train_path)
test  = pd.read_csv(test_path)

print("Train shape:", train.shape)
print("Test shape:", test.shape)

display(train.head())


Train shape: (175341, 45)
Test shape: (82332, 45)


Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0
1,2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,Normal,0
2,3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,Normal,0
3,4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,Normal,0
4,5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,Normal,0


In [None]:
print(train.columns)


Index(['id', 'dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
       'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss',
       'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
       'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
       'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm',
       'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm',
       'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm',
       'ct_srv_dst', 'is_sm_ips_ports', 'attack_cat', 'label'],
      dtype='object')


In [None]:
train[['label', 'attack_cat']].head()


Unnamed: 0,label,attack_cat
0,0,Normal
1,0,Normal
2,0,Normal
3,0,Normal
4,0,Normal


In [None]:
print(train['label'].value_counts())
print(train['attack_cat'].value_counts())


label
1    119341
0     56000
Name: count, dtype: int64
attack_cat
Normal            56000
Generic           40000
Exploits          33393
Fuzzers           18184
DoS               12264
Reconnaissance    10491
Analysis           2000
Backdoor           1746
Shellcode          1133
Worms               130
Name: count, dtype: int64


In [None]:
set(train.columns) - set(test.columns)


set()

In [None]:
set(test.columns) - set(train.columns)


set()

In [None]:
missing = train.isnull().sum()
missing[missing > 0]


Unnamed: 0,0


In [None]:
import numpy as np

inf_counts = np.isinf(train.select_dtypes(include=[float, int])).sum()
inf_counts[inf_counts > 0]


Unnamed: 0,0


In [None]:
constant_cols = [c for c in train.columns if train[c].nunique() == 1]
constant_cols


[]

In [None]:
train.dtypes


Unnamed: 0,0
id,int64
dur,float64
proto,object
service,object
state,object
spkts,int64
dpkts,int64
sbytes,int64
dbytes,int64
rate,float64


In [None]:
categorical_cols = train.select_dtypes(include=['object']).columns.tolist()
numerical_cols   = train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("Categorical columns:", categorical_cols)
print("Number of categorical:", len(categorical_cols))
print("Number of numerical:", len(numerical_cols))


Categorical columns: ['proto', 'service', 'state', 'attack_cat']
Number of categorical: 4
Number of numerical: 41


In [None]:
set(train.columns) - set(test.columns)


set()

Labels

The dataset contains two label-related columns:

label: a binary label where 0 represents normal traffic and 1 represents attack traffic.

attack_cat: a multi-class label indicating the specific attack category.

Since the project focuses on Class-Incremental Learning, the attack_cat column is used as the target variable, as it allows the model to learn and expand its knowledge of different attack types over time. The label column is not used for training but is retained for reference and potential auxiliary analysis.

The dataset includes 9 attack categories plus normal traffic, with a highly imbalanced class distribution. In particular, some classes such as Worms contain very few samples, making the dataset well suited for studying catastrophic forgetting and incremental learning behavior.

Data Quality

The exploratory analysis showed that:

There are no missing values in the dataset.

There are no infinite or invalid numerical values.

There are no constant columns.

As a result, no imputation or numerical sanitization is required before preprocessing.

Feature Types

Among the 45 columns:

41 features are numerical, representing flow statistics, packet counts, byte counts, and other network behavior indicators.

3 categorical features are used as input features:

proto

service

state

The attack_cat and label columns are treated as labels and excluded from the feature set.

Preprocessing Decisions

Based on the exploration:

The id column is removed, as it has no semantic meaning for learning.

Categorical features (proto, service, state) are encoded using one-hot encoding.

Numerical features are normalized using feature scaling.

The attack_cat column is used as the training target for the CIL scenario.

These preprocessing decisions are fixed and applied consistently across all experiments.

Conclusion

The UNSW-NB15 dataset is clean, well structured, and provides clearly defined multi class attack labels, making it particularly suitable for class-incremental intrusion detection experiments. Its inherent class imbalance and presence of rare attack types further support its use for evaluating catastrophic forgetting and continual learning strategies.