# CICIDS 2017 – Dataset Understanding

This notebook focuses on understanding the structure, composition, and
limitations of the CICIDS 2017 dataset before any cleaning or modeling.

The dataset is treated as a SOC analyst’s data source, not a machine learning benchmark.


## Operational Context

Each row in the CICIDS 2017 dataset represents aggregated network flow behavior
rather than individual packets.

The data simulates enterprise network traffic containing both benign activity
and labeled attack scenarios generated in a controlled environment.

This dataset is suitable for studying traffic behavior patterns and detection challenges,
but not for claiming real-world attack prevalence.


In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid')
import matplotlib.pyplot as plt

In [5]:
data1 = pd.read_csv(r"D:\MachineLearningCVE\Monday-WorkingHours.pcap_ISCX.csv")
data2 = pd.read_csv(r"D:\MachineLearningCVE\Tuesday-WorkingHours.pcap_ISCX.csv")
data3 = pd.read_csv(r"D:\MachineLearningCVE\Wednesday-workingHours.pcap_ISCX.csv")
data4 = pd.read_csv(r"D:\MachineLearningCVE\Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv")
data5 = pd.read_csv(r"D:\MachineLearningCVE\Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv")
data6 = pd.read_csv(r"D:\MachineLearningCVE\Friday-WorkingHours-Morning.pcap_ISCX.csv")
data7 = pd.read_csv(r"D:\MachineLearningCVE\Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv")
data8 = pd.read_csv(r"D:\MachineLearningCVE\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv")

In [6]:
data_list = [data1, data2, data3, data4, data5, data6, data7, data8]

print("Individual file dimensions:")
for i, df in enumerate(data_list, start=1):
    print(f"Data{i}: {df.shape[0]} rows, {df.shape[1]} columns")


Individual file dimensions:
Data1: 529918 rows, 79 columns
Data2: 445909 rows, 79 columns
Data3: 692703 rows, 79 columns
Data4: 170366 rows, 79 columns
Data5: 288602 rows, 79 columns
Data6: 191033 rows, 79 columns
Data7: 286467 rows, 79 columns
Data8: 225745 rows, 79 columns


In [7]:
data = pd.concat(data_list, ignore_index=True)

rows, cols = data.shape
print("Combined dataset dimensions:")
print(f"Rows: {rows}")
print(f"Columns: {cols}")
print(f"Total cells: {rows * cols}")


Combined dataset dimensions:
Rows: 2830743
Columns: 79
Total cells: 223628697


## Why the Dataset Is Combined

CICIDS 2017 is distributed across multiple CSV files representing
different days and attack scenarios.

Combining these files allows holistic analysis while preserving
scenario diversity within the dataset.


In [8]:
data.columns


Index([' Destination Port', ' Flow Duration', ' Total Fwd Packets',
       ' Total Backward Packets', 'Total Length of Fwd Packets',
       ' Total Length of Bwd Packets', ' Fwd Packet Length Max',
       ' Fwd Packet Length Min', ' Fwd Packet Length Mean',
       ' Fwd Packet Length Std', 'Bwd Packet Length Max',
       ' Bwd Packet Length Min', ' Bwd Packet Length Mean',
       ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s',
       ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max', ' Flow IAT Min',
       'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std', ' Fwd IAT Max',
       ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean', ' Bwd IAT Std',
       ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags', ' Bwd PSH Flags',
       ' Fwd URG Flags', ' Bwd URG Flags', ' Fwd Header Length',
       ' Bwd Header Length', 'Fwd Packets/s', ' Bwd Packets/s',
       ' Min Packet Length', ' Max Packet Length', ' Packet Length Mean',
       ' Packet Length Std', ' Packet Length Variance', '

In [9]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2830743 entries, 0 to 2830742
Data columns (total 79 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0    Destination Port             int64  
 1    Flow Duration                int64  
 2    Total Fwd Packets            int64  
 3    Total Backward Packets       int64  
 4   Total Length of Fwd Packets   int64  
 5    Total Length of Bwd Packets  int64  
 6    Fwd Packet Length Max        int64  
 7    Fwd Packet Length Min        int64  
 8    Fwd Packet Length Mean       float64
 9    Fwd Packet Length Std        float64
 10  Bwd Packet Length Max         int64  
 11   Bwd Packet Length Min        int64  
 12   Bwd Packet Length Mean       float64
 13   Bwd Packet Length Std        float64
 14  Flow Bytes/s                  float64
 15   Flow Packets/s               float64
 16   Flow IAT Mean                float64
 17   Flow IAT Std                 float64
 18   Flow IAT Max         

In [12]:
# Identify the label column (handles hidden whitespace)
label_col = [col for col in data.columns if 'Label' in col][0]
label_col


' Label'

## Label Column Naming Observation

The CICIDS 2017 dataset contains leading/trailing whitespace in the label column name.

This does not affect analysis when handled carefully, but it highlights the importance
of inspecting raw datasets before applying transformations or modeling.

In [13]:
data[label_col].value_counts()


 Label
BENIGN                        2273097
DoS Hulk                       231073
PortScan                       158930
DDoS                           128027
DoS GoldenEye                   10293
FTP-Patator                      7938
SSH-Patator                      5897
DoS slowloris                    5796
DoS Slowhttptest                 5499
Bot                              1966
Web Attack � Brute Force         1507
Web Attack � XSS                  652
Infiltration                       36
Web Attack � Sql Injection         21
Heartbleed                         11
Name: count, dtype: int64

In [14]:
data[label_col].value_counts(normalize=True) * 100


 Label
BENIGN                        80.300366
DoS Hulk                       8.162981
PortScan                       5.614427
DDoS                           4.522735
DoS GoldenEye                  0.363615
FTP-Patator                    0.280421
SSH-Patator                    0.208320
DoS slowloris                  0.204752
DoS Slowhttptest               0.194260
Bot                            0.069452
Web Attack � Brute Force       0.053237
Web Attack � XSS               0.023033
Infiltration                   0.001272
Web Attack � Sql Injection     0.000742
Heartbleed                     0.000389
Name: proportion, dtype: float64

## Class Imbalance Considerations

The dataset is highly imbalanced, with benign traffic dominating the majority
of records.

This reflects real SOC environments, where malicious activity is rare but critical.

As a result:
- Accuracy alone is misleading
- Rare attack patterns can be statistically drowned out
- Distribution-based reasoning is essential

## Dataset Caveats

CICIDS 2017 is a labeled, simulated dataset.

Limitations include:
- Structured attack generation
- Assumed ground-truth labels
- Limited behavioral drift

These constraints are acknowledged in all subsequent analysis phases.