# CICIDS 2017 â€“ Data Cleaning

This notebook focuses on improving data quality while preserving
rare but meaningful security events.

Cleaning decisions are conservative and explicitly justified.


In [1]:
import numpy as np
import pandas as pd


In [2]:
data1 = pd.read_csv(r"D:\MachineLearningCVE\Monday-WorkingHours.pcap_ISCX.csv")
data2 = pd.read_csv(r"D:\MachineLearningCVE\Tuesday-WorkingHours.pcap_ISCX.csv")
data3 = pd.read_csv(r"D:\MachineLearningCVE\Wednesday-workingHours.pcap_ISCX.csv")
data4 = pd.read_csv(r"D:\MachineLearningCVE\Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv")
data5 = pd.read_csv(r"D:\MachineLearningCVE\Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv")
data6 = pd.read_csv(r"D:\MachineLearningCVE\Friday-WorkingHours-Morning.pcap_ISCX.csv")
data7 = pd.read_csv(r"D:\MachineLearningCVE\Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv")
data8 = pd.read_csv(r"D:\MachineLearningCVE\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv")

data = pd.concat(
    [data1, data2, data3, data4, data5, data6, data7, data8],
    ignore_index=True
)


## Cleaning Philosophy

Security datasets often contain rare, extreme, or noisy values that
may represent attacks rather than errors.

Therefore:
- Only clearly invalid values are corrected
- Rare events are preserved
- Every modification is explicitly justified

In [4]:
# Remove leading/trailing whitespace from column names
data.columns = data.columns.str.strip()

## Column Name Normalization

Some CICIDS 2017 columns contain leading or trailing whitespace.
This step improves reliability and prevents accidental KeyErrors
without altering the underlying data.

In [5]:
# Replace infinite values with NaN
data.replace([np.inf, -np.inf], np.nan, inplace=True)

## Infinite Values Handling

Infinite values cannot be processed reliably by analytical tools.
They are treated as missing values and handled conservatively
in the next step.


In [6]:
missing_counts = data.isna().sum()
missing_counts[missing_counts > 0]

Flow Bytes/s      2867
Flow Packets/s    2867
dtype: int64

## Missing Value Strategy

Missing values are handled cautiously.

Rows with missing values in critical numeric features are removed,
as imputation could introduce artificial patterns in security data.

In [7]:
data_before = data.shape[0]
data.dropna(inplace=True)
data_after = data.shape[0]

print(f"Rows before: {data_before}")
print(f"Rows after: {data_after}")

Rows before: 2830743
Rows after: 2827876


> Note: Row removal due to missing values affects a small fraction of the dataset
and was chosen over imputation to avoid introducing artificial patterns
in security-sensitive features.


## Duplicate Records

Exact duplicate rows do not add analytical value and may bias
distribution-based analysis.

Only exact duplicates are removed.

In [8]:
dup_before = data.shape[0]
data.drop_duplicates(inplace=True)
dup_after = data.shape[0]

print(f"Rows before duplicate removal: {dup_before}")
print(f"Rows after duplicate removal: {dup_after}")

Rows before duplicate removal: 2827876
Rows after duplicate removal: 2520798


In [9]:
print("Final cleaned dataset shape:", data.shape)

Final cleaned dataset shape: (2520798, 79)


## Cleaning Summary

The following steps were applied:
- Normalized column names
- Replaced infinite values with NaN
- Removed rows with missing values
- Removed exact duplicate records

No feature engineering or scaling was performed.
Rare events were preserved wherever possible.
