# DATA SAMPLING 


### Step 1 - Suppress Warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

Prevents runtime warnings from cluttering the output. Useful in notebooks where warnings are expected (e.g., deprecated functions) but not critical.

### Step 2 - Import Libraries

In [None]:
import numpy as np
import pandas as pd

numpy: Efficient numerical computation (used indirectly for sampling/randomization).

pandas: Essential for loading, manipulating, and saving tabular datasets.

### Step 3 - Load the CICIDS2017 Dataset
The CICIDS2017 dataset is publicly available at: https://www.unb.ca/cic/datasets/ids-2017.html.

Due to the large size of this dataset, sampled subsets of CICIDS2017 are used in this project. These subsets are located in the "data" folder.
If you wish to apply this code to other datasets (e.g., the CAN-intrusion dataset), simply update the dataset path and follow the same steps. The models implemented in this code are generic and can be applied to any intrusion detection or network traffic dataset.

In [None]:
df = pd.read_csv('./data/CICIDS2017.csv')

Loads the full CICIDS2017 dataset from a specified path into a DataFrame.

Note: This path (/data/...) is used in Google Colab. On local systems, adjust it accordingly.

### Step 4 - Display the Data

In [None]:
df

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,54865,3,2,0,12,0,6,6,6.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,55054,109,1,1,6,6,6,6,6.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,55055,52,1,1,6,6,6,6,6.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,46236,34,1,1,6,6,6,6,6.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,54863,3,2,0,12,0,6,6,6.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2830738,53,32215,4,2,112,152,28,28,28.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2830739,53,324,2,2,84,362,42,42,42.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2830740,58030,82,2,1,31,6,31,0,15.5,21.92031,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2830741,53,1048635,6,2,192,256,32,32,32.0,0.00000,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


Displays the entire DataFrame. Useful to verify that the dataset is loaded correctly.

### Step 5 - Check Class Distribution

In [None]:
df[' Label'].value_counts()

Unnamed: 0_level_0,count
Label,Unnamed: 1_level_1
BENIGN,2273097
DoS Hulk,231073
PortScan,158930
DDoS,128027
DoS GoldenEye,10293
FTP-Patator,7938
SSH-Patator,5897
DoS slowloris,5796
DoS Slowhttptest,5499
Bot,1966


Shows how many records exist for each attack category (e.g., BENIGN, DoS, PortScan, etc.).

Helps in identifying class imbalance — a common issue in intrusion detection datasets.

### Step 6 - Sample Data for Each Class
The dataset is highly imbalanced, so this step creates a balanced subset.

In [None]:
# Randomly sample instances from majority classes
df_minor = df[(df[' Label']=='WebAttack')|(df[' Label']=='Bot')|(df[' Label']=='Infiltration')]
df_BENIGN = df[(df[' Label']=='BENIGN')]
df_BENIGN = df_BENIGN.sample(n=None, frac=0.01, replace=False, weights=None, random_state=None, axis=0)
df_DoS = df[(df[' Label']=='DoS')]
df_DoS = df_DoS.sample(n=None, frac=0.05, replace=False, weights=None, random_state=None, axis=0)
df_PortScan = df[(df[' Label']=='PortScan')]
df_PortScan = df_PortScan.sample(n=None, frac=0.05, replace=False, weights=None, random_state=None, axis=0)
df_BruteForce = df[(df[' Label']=='BruteForce')]
df_BruteForce = df_BruteForce.sample(n=None, frac=0.2, replace=False, weights=None, random_state=None, axis=0)

Sampling strategies:

Rare classes (WebAttack, Bot, Infiltration) are kept as-is (df_minor).

Majority classes (BENIGN, DoS, PortScan, BruteForce) are downsampled using random sampling:
BENIGN → 1%, 
DoS → 5%, 
PortScan → 5%, 
BruteForce → 20%. 

Sampling is done using pandas. DataFrame.sample(frac=...).

### Step 7 - Concatenate All Sampled Data

In [None]:

df_s = pd.concat([df_BENIGN, df_DoS, df_PortScan, df_BruteForce, df_minor])

Combines all the sampled subsets into a single dataset for training/testing. This results in a significantly smaller, but more balanced, dataset.

### Step 8 - Sort Index

In [None]:
df_s = df_s.sort_index()

Reorders the rows by their original index from the main dataset.

Maintains logical/chronological order (useful for time-series or traceability).

### Step 9 - Save the Sampled Dataset

In [None]:
# Save the sampled dataset
df_s.to_csv('./data/CICIDS2017_sample.csv',index=0)

Saves the final sampled dataset as a .csv file.

index=0 means the DataFrame index column will not be saved — often preferred for clean CSVs.