<a href="https://colab.research.google.com/github/funway/nid-imbalance-study/blob/main/preprocessing/pre_process_separated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#CSE-CIC-IDS2018 数据集预处理

## 数据集下载
使用 aws 命令行工具，从云存储中下载: `aws s3 sync --no-sign-request --region us-east-2 "s3://cse-cic-ids2018/Processed Traffic Data for ML Algorithms/" ./`

## 数据集内容
### 特征项
正常有 80 列特征。
```
['ACK Flag Cnt', 'Active Max', 'Active Mean', 'Active Min', 'Active Std', 'Bwd Blk Rate Avg', 'Bwd Byts/b Avg', 'Bwd Header Len', 'Bwd IAT Max', 'Bwd IAT Mean', 'Bwd IAT Min', 'Bwd IAT Std', 'Bwd IAT Tot', 'Bwd PSH Flags', 'Bwd Pkt Len Max', 'Bwd Pkt Len Mean', 'Bwd Pkt Len Min', 'Bwd Pkt Len Std', 'Bwd Pkts/b Avg', 'Bwd Pkts/s', 'Bwd Seg Size Avg', 'Bwd URG Flags', 'CWE Flag Count', 'Down/Up Ratio', 'Dst Port', 'ECE Flag Cnt', 'FIN Flag Cnt', 'Flow Byts/s', 'Flow Duration', 'Flow IAT Max', 'Flow IAT Mean', 'Flow IAT Min', 'Flow IAT Std', 'Flow Pkts/s', 'Fwd Act Data Pkts', 'Fwd Blk Rate Avg', 'Fwd Byts/b Avg', 'Fwd Header Len', 'Fwd IAT Max', 'Fwd IAT Mean', 'Fwd IAT Min', 'Fwd IAT Std', 'Fwd IAT Tot', 'Fwd PSH Flags', 'Fwd Pkt Len Max', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Min', 'Fwd Pkt Len Std', 'Fwd Pkts/b Avg', 'Fwd Pkts/s', 'Fwd Seg Size Avg', 'Fwd Seg Size Min', 'Fwd URG Flags', 'Idle Max', 'Idle Mean', 'Idle Min', 'Idle Std', 'Init Bwd Win Byts', 'Init Fwd Win Byts', 'Label', 'PSH Flag Cnt', 'Pkt Len Max', 'Pkt Len Mean', 'Pkt Len Min', 'Pkt Len Std', 'Pkt Len Var', 'Pkt Size Avg', 'Protocol', 'RST Flag Cnt', 'SYN Flag Cnt', 'Subflow Bwd Byts', 'Subflow Bwd Pkts', 'Subflow Fwd Byts', 'Subflow Fwd Pkts', 'Timestamp', 'Tot Bwd Pkts', 'Tot Fwd Pkts', 'TotLen Bwd Pkts', 'TotLen Fwd Pkts', 'URG Flag Cnt']
```
只有 Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv 文件多了 `['Dst IP', 'Src Port', 'Flow ID', 'Src IP']` 四个特征。

### 字符型特征
只有 `Label` 特征是字符串，表示该行数据是某种类型的攻击。
其余特征都是数值型。

### Label 值
['Benign', 'Bot', 'Brute Force -Web', 'Brute Force -XSS', 'DDOS attack-HOIC',
 'DDOS attack-LOIC-UDP', 'DDoS attacks-LOIC-HTTP', 'DoS attacks-GoldenEye',
 'DoS attacks-Hulk', 'DoS attacks-SlowHTTPTest', 'DoS attacks-Slowloris',
 'FTP-BruteForce', 'Infilteration', 'SQL Injection', 'SSH-Bruteforce']
 共 15 种。


## Google Colab Env

In [None]:
### 挂载 Google Drive ###

import os
from google.colab import drive

if not os.path.exists('/content/drive/MyDrive'):
    # Colab 是一个虚拟机环境, /content 目录是默认的用户工作目录
    drive.mount('/content/drive')

# 打印 datasets 目录
!ls -thl /content/drive/MyDrive/NYIT/870/datasets/CSE-CIC-IDS2018

ls: cannot access '/content/drive/MyDrive/NYIT/870/datasets/CSE-CIC-IDS2018': No such file or directory


## Modules import & Globals setup

In [None]:
### Modules ###

from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler


### Globals ###

## 数据文件目录
dataset = 'CSE-CIC-IDS2018'
dataset_folder = f'/content/drive/MyDrive/NYIT/870/datasets/original/{dataset}/'
preprocessed_folder = f'/content/drive/MyDrive/NYIT/870/datasets/preprocessed/{dataset}/'
balanced_folder = f'/content/drive/MyDrive/NYIT/870/datasets/balanced/{dataset}/'

## csv 文件匹配
# 修改该正则表达式，可以只匹配某个单独的文件
# csv_reg = '*01-03-2018*.csv'
csv_reg = '*.csv'
csv_files = list(Path(dataset_folder).rglob(csv_reg))
for csv in csv_files:
    print(f'csv file: {csv}')
    pass

## 指定 csv 文件
csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Friday-02-03-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Friday-16-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv'
# csv_file = '/content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv'

## 无用的特征列
cols_to_drop = ['Flow ID', 'Src IP', 'Dst IP', 'Src Port', 'Timestamp']

## Label 列的所有可能值
# unique_labels = []
unique_labels = ['Benign', 'Bot', 'Brute Force -Web', 'Brute Force -XSS', 'DDOS attack-HOIC', 'DDOS attack-LOIC-UDP', 'DDoS attacks-LOIC-HTTP', 'DoS attacks-GoldenEye', 'DoS attacks-Hulk', 'DoS attacks-SlowHTTPTest', 'DoS attacks-Slowloris', 'FTP-BruteForce', 'Infilteration', 'SQL Injection', 'SSH-Bruteforce']

## 选择特征缩放的方式(标准化，归一化)
# 支持: standard, minmax, robust, l1pstandard, l1pminmax
# scaling_method = 'standard'
# scaling_method = 'minmax'
scaling_methods = ['l1pminmax', 'l1pstandard', 'robust']


csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Friday-02-03-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Friday-23-02-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv
csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Thursday-01-03-2018_TrafficForML_CICFlowMete

## All Unique Values of the Label Column

In [None]:
## 获取 Label 特征项的所有可能值 ##

# unique_labels = []
if not unique_labels:
    all_labels_count = {}  # 统计所有 csv 文件的 labels 种类与数量

    # 遍历每个 csv 文件
    csv_files = list(Path(dataset_folder).rglob('*.csv'))
    for csv in csv_files:
        print(f'Reading csv file: {Path(csv).name}')

        csv_labels_count = {}  # 统计当前 CSV 文件的 labels 数量

        # 分块读取
        chunk_size = 100000
        for chunk in pd.read_csv(csv, usecols=['Label'], chunksize=chunk_size):
            if 'Label' in chunk.columns:
                # 统计当前 chunk 的 labels 种类与数量，返回 {'Benign': 100, 'Bot': 99} 字典
                chunk_labels_count = chunk['Label'].value_counts().to_dict()

                for label, count in chunk_labels_count.items():
                    # 更新当前 csv 文件的 labels 统计
                    csv_labels_count[label] = csv_labels_count.get(label, 0) + count

                    # 更新所有 csv 文件的 labels 统计
                    all_labels_count[label] = all_labels_count.get(label, 0) + count

        # 打印当前 csv 的 unique labels
        print(f'  unique labels: [{len(csv_labels_count)}], {dict(sorted(csv_labels_count.items()))}\n')
        pass

    # 打印所有的 unique labels
    print(f'All unique labels count: [{len(all_labels_count)}] \n{dict(sorted(all_labels_count.items()))}\n')

    # 如果 'Label' 存在则删除
    all_labels_count.pop('Label', None)

    # 转换成列表
    unique_labels = sorted(all_labels_count.keys())
else:
    print(f'unique_labels has been set')

# 打印所有唯一的 Label 值
print(f"[{datetime.now().strftime('%x %X')}] All unique labels list: [{len(unique_labels)}] (removed 'Lable')\n{unique_labels}", )

unique_labels has been set
[04/20/25 03:59:05] All unique labels list: [15] (removed 'Lable')
['Benign', 'Bot', 'Brute Force -Web', 'Brute Force -XSS', 'DDOS attack-HOIC', 'DDOS attack-LOIC-UDP', 'DDoS attacks-LOIC-HTTP', 'DoS attacks-GoldenEye', 'DoS attacks-Hulk', 'DoS attacks-SlowHTTPTest', 'DoS attacks-Slowloris', 'FTP-BruteForce', 'Infilteration', 'SQL Injection', 'SSH-Bruteforce']


###Unique Labels Output:
```
Reading csv file: Friday-02-03-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [2], {'Benign': 762384, 'Bot': 286191}
Reading csv file: Friday-16-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [4], {'Benign': 446772, 'DoS attacks-Hulk': 461912, 'DoS attacks-SlowHTTPTest': 139890, 'Label': 1}
Reading csv file: Friday-23-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [4], {'Benign': 1048009, 'Brute Force -Web': 362, 'Brute Force -XSS': 151, 'SQL Injection': 53}
Reading csv file: Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [2], {'Benign': 7372557, 'DDoS attacks-LOIC-HTTP': 576191}
Reading csv file: Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [3], {'Benign': 238037, 'Infilteration': 93063, 'Label': 25}
Reading csv file: Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [3], {'Benign': 996077, 'DoS attacks-GoldenEye': 41508, 'DoS attacks-Slowloris': 10990}
Reading csv file: Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [4], {'Benign': 1048213, 'Brute Force -Web': 249, 'Brute Force -XSS': 79, 'SQL Injection': 34}
Reading csv file: Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [3], {'Benign': 667626, 'FTP-BruteForce': 193360, 'SSH-Bruteforce': 187589}
Reading csv file: Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [3], {'Benign': 360833, 'DDOS attack-HOIC': 686012, 'DDOS attack-LOIC-UDP': 1730}
Reading csv file: Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv
  unique labels: [3], {'Benign': 544200, 'Infilteration': 68871, 'Label': 33}

All unique labels count: [16]
{'Benign': 13484708, 'Bot': 286191, 'Brute Force -Web': 611, 'Brute Force -XSS': 230, 'DDOS attack-HOIC': 686012, 'DDOS attack-LOIC-UDP': 1730, 'DDoS attacks-LOIC-HTTP': 576191, 'DoS attacks-GoldenEye': 41508, 'DoS attacks-Hulk': 461912, 'DoS attacks-SlowHTTPTest': 139890, 'DoS attacks-Slowloris': 10990, 'FTP-BruteForce': 193360, 'Infilteration': 161934, 'Label': 59, 'SQL Injection': 87, 'SSH-Bruteforce': 187589}

All unique labels list: [15] (removed 'Lable')
['Benign', 'Bot', 'Brute Force -Web', 'Brute Force -XSS', 'DDOS attack-HOIC', 'DDOS attack-LOIC-UDP', 'DDoS attacks-LOIC-HTTP', 'DoS attacks-GoldenEye', 'DoS attacks-Hulk', 'DoS attacks-SlowHTTPTest', 'DoS attacks-Slowloris', 'FTP-BruteForce', 'Infilteration', 'SQL Injection', 'SSH-Bruteforce']
```
###Questions❓
* 很明显 Benign 的数据太多了
* ['Brute Force -Web': 611, 'Brute Force -XSS': 230, 'DDOS attack-LOIC-UDP': 1730, 'SQL Injection': 87] 这几个又太少

所以
1. 在所 imblanced 处理的时候，是否可以针对这些少量的 Label 进行增强？
2. 针对 3GB 的那个文件，是不是可以把非 Benign 的都提取出来，然后再补上 Benign 数据到 300MB 的平均大小即可。太大了 Colab 会崩溃。


## Load a CSV File

In [None]:
## 读取 csv 到 pandas.DataFrame 对象 ##
print(f"[{datetime.now().strftime('%x %X')}] Loading csv file: {csv_file}")

## 一次性读取 #######
nrows = None  # 一次性读取多少行, None 表示全部读取
df = pd.read_csv(csv_file, nrows=nrows)
####################

## 分块读取 #########
# chunk_size = 100000  # Adjust based on memory capacity
# df_list = []  # List to store chunks

# # Read CSV file in chunks
# for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
#     print(f"Processing chunk with shape: {chunk.shape}")
#     df_list.append(chunk)  # Store each chunk

# # Combine all chunks into a single DataFrame
# df = pd.concat(df_list, ignore_index=True)
####################

print(f'  原始数据包含[{len(df.columns)}]列特征: {sorted(df.columns.tolist())}')
print(f"  Label 列的值: {df['Label'].value_counts()}")

# 删除 Label 列的值等于 'Label' 的行
# (因为有几个文件在某一行又出现了一排的列名)
if 'Label' in df.columns and ('Label' in df['Label'].values):
    df = df[df['Label'] != 'Label']
    print(f'[{datetime.now().strftime("%x %X")}] 删除其中 Label 列的值等于 "Label" 的行')
    print(f"  Label 列的值: {df['Label'].value_counts()}")

print('\n===== DataFrame Info =====')
df.info()
print('===== DataFrame Info =====')

[04/20/25 03:59:05] Loading csv file: /content/drive/MyDrive/NYIT/870/datasets/original/CSE-CIC-IDS2018/Friday-02-03-2018_TrafficForML_CICFlowMeter.csv
  原始数据包含[80]列特征: ['ACK Flag Cnt', 'Active Max', 'Active Mean', 'Active Min', 'Active Std', 'Bwd Blk Rate Avg', 'Bwd Byts/b Avg', 'Bwd Header Len', 'Bwd IAT Max', 'Bwd IAT Mean', 'Bwd IAT Min', 'Bwd IAT Std', 'Bwd IAT Tot', 'Bwd PSH Flags', 'Bwd Pkt Len Max', 'Bwd Pkt Len Mean', 'Bwd Pkt Len Min', 'Bwd Pkt Len Std', 'Bwd Pkts/b Avg', 'Bwd Pkts/s', 'Bwd Seg Size Avg', 'Bwd URG Flags', 'CWE Flag Count', 'Down/Up Ratio', 'Dst Port', 'ECE Flag Cnt', 'FIN Flag Cnt', 'Flow Byts/s', 'Flow Duration', 'Flow IAT Max', 'Flow IAT Mean', 'Flow IAT Min', 'Flow IAT Std', 'Flow Pkts/s', 'Fwd Act Data Pkts', 'Fwd Blk Rate Avg', 'Fwd Byts/b Avg', 'Fwd Header Len', 'Fwd IAT Max', 'Fwd IAT Mean', 'Fwd IAT Min', 'Fwd IAT Std', 'Fwd IAT Tot', 'Fwd PSH Flags', 'Fwd Pkt Len Max', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Min', 'Fwd Pkt Len Std', 'Fwd Pkts/b Avg', 'Fwd P

## Drop some columns

In [None]:
## 删除部分无用的特征列 ##
cols_to_drop_exist = [col for col in cols_to_drop if col in df.columns]
df = df.drop(cols_to_drop_exist, axis=1)  # axis=1 表示删除列

print('\n===== DataFrame Info =====')
df.info()
print('===== DataFrame Info =====')


===== DataFrame Info =====
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 79 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Dst Port           1048575 non-null  int64  
 1   Protocol           1048575 non-null  int64  
 2   Flow Duration      1048575 non-null  int64  
 3   Tot Fwd Pkts       1048575 non-null  int64  
 4   Tot Bwd Pkts       1048575 non-null  int64  
 5   TotLen Fwd Pkts    1048575 non-null  int64  
 6   TotLen Bwd Pkts    1048575 non-null  float64
 7   Fwd Pkt Len Max    1048575 non-null  int64  
 8   Fwd Pkt Len Min    1048575 non-null  int64  
 9   Fwd Pkt Len Mean   1048575 non-null  float64
 10  Fwd Pkt Len Std    1048575 non-null  float64
 11  Bwd Pkt Len Max    1048575 non-null  int64  
 12  Bwd Pkt Len Min    1048575 non-null  int64  
 13  Bwd Pkt Len Mean   1048575 non-null  float64
 14  Bwd Pkt Len Std    1048575 non-null  float64
 15  Flow

## Handle Inf & NaN

In [None]:
## 保证数值列转换成数值类型而不是 object 类型 ##

# 提数值取特征列，排除 'Label' 列
numeric_features = df.drop(columns=['Label'])

# 对数值特征列进行数值转换
df[numeric_features.columns] = numeric_features.apply(pd.to_numeric, errors='coerce')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 79 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Dst Port           1048575 non-null  int64  
 1   Protocol           1048575 non-null  int64  
 2   Flow Duration      1048575 non-null  int64  
 3   Tot Fwd Pkts       1048575 non-null  int64  
 4   Tot Bwd Pkts       1048575 non-null  int64  
 5   TotLen Fwd Pkts    1048575 non-null  int64  
 6   TotLen Bwd Pkts    1048575 non-null  float64
 7   Fwd Pkt Len Max    1048575 non-null  int64  
 8   Fwd Pkt Len Min    1048575 non-null  int64  
 9   Fwd Pkt Len Mean   1048575 non-null  float64
 10  Fwd Pkt Len Std    1048575 non-null  float64
 11  Bwd Pkt Len Max    1048575 non-null  int64  
 12  Bwd Pkt Len Min    1048575 non-null  int64  
 13  Bwd Pkt Len Mean   1048575 non-null  float64
 14  Bwd Pkt Len Std    1048575 non-null  float64
 15  Flow Byts/s        1046017 non-n

In [None]:
## 处理 Inf 值 ##
print("正无穷 (+Inf) 个数:", (df == np.inf).sum().sum())
print("负无穷 (-Inf) 个数:", (df == -np.inf).sum().sum())

# 方法一: 删除该行
df = df[~df.isin([np.inf, -np.inf]).any(axis=1)]
df.reset_index(drop=True, inplace=True)

# 方法二: 替换为对应列的最大/最小值
# max_value = df.replace([np.inf, -np.inf], np.nan).max()
# min_value = df.replace([np.inf, -np.inf], np.nan).min()
# df.replace(np.inf, max_value, inplace=True)
# df.replace(-np.inf, min_value, inplace=True)

# 方法三: 替换成 NaN
# df = df.replace([np.inf, -np.inf], np.nan)

print(df.shape)
print("正无穷 (+Inf) 个数:", (df == np.inf).sum().sum())
print("负无穷 (-Inf) 个数:", (df == -np.inf).sum().sum())

正无穷 (+Inf) 个数: 5542
负无穷 (-Inf) 个数: 0
(1044525, 79)
正无穷 (+Inf) 个数: 0
负无穷 (-Inf) 个数: 0


In [None]:
## 处理 NaN 值 ##
print(f'NaN 个数: {df.isna().sum().sum()}')

# 方法一: 删除该行
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

# 方法二: 填充值
# df['Label'] = df['Label'].fillna('Benign')  # 填充 Label 列
# df = df.fillna(0)  # 填充其他列

print(df.shape)

NaN 个数: 0
(1044525, 79)


## Scaling of Numerical Features

In [None]:
## 对数值型特征进行特征缩放(标准化或者归一化) ##

def scaling(X, scaling_method, output_file):
    print(f"[{datetime.now().strftime('%x %X')}] 📊 Scaling method: {scaling_method}")

    if scaling_method == 'standard':
        # 标准化 (均值为0，标准差为1)
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        pass
    elif scaling_method == 'minmax':
        # 归一化 (缩放到 [0,1] 区间)
        scaler = MinMaxScaler()
        X_scaled = scaler.fit_transform(X)
        pass
    elif scaling_method == 'l1pstandard':
        X_log = np.log1p(X.replace(-1, -0.5))
        X_scaled = StandardScaler().fit_transform(X_log)
        pass
    elif scaling_method == 'l1pminmax':
        # 把 -1 值替换成 -0.5，然后进行 log1p 缩放极端值
        X_log = np.log1p(X.replace(-1, -0.5))
        # 最后再归一化
        X_scaled = MinMaxScaler().fit_transform(X_log)
        pass
    elif scaling_method == 'robust':
        X_scaled = RobustScaler().fit_transform(X)
        pass
    else:
        raise ValueError(f'Unknown scaling method: {scaling_method}')

    # 使用 float32 减小文件大小
    X_scaled = X_scaled.astype(np.float32)

    np.save(output_file, X_scaled)  # 保存为 .npy 文件
    print(f"[{datetime.now().strftime('%x %X')}] ✅ Saved to {output_file}\n")
    return X_scaled

df_numeric = df.drop(columns=['Label'])  # 剔除 Label 列，保留数值列
filename = Path(csv_file).stem  # 获取原文件名，不包含扩展名

for scaling_method in scaling_methods:
    output_file = Path(preprocessed_folder) / f'separated/{filename}_X_{scaling_method}.npy'
    X_scaled = scaling(df_numeric, scaling_method, output_file)


[04/20/25 03:59:31] 📊 Scaling method: l1pminmax
[04/20/25 03:59:35] ✅ Saved to /content/drive/MyDrive/NYIT/870/datasets/preprocessed/CSE-CIC-IDS2018/separated/Friday-02-03-2018_TrafficForML_CICFlowMeter_X_l1pminmax.npy

[04/20/25 03:59:35] 📊 Scaling method: l1pstandard
[04/20/25 03:59:40] ✅ Saved to /content/drive/MyDrive/NYIT/870/datasets/preprocessed/CSE-CIC-IDS2018/separated/Friday-02-03-2018_TrafficForML_CICFlowMeter_X_l1pstandard.npy

[04/20/25 03:59:40] 📊 Scaling method: robust
[04/20/25 03:59:46] ✅ Saved to /content/drive/MyDrive/NYIT/870/datasets/preprocessed/CSE-CIC-IDS2018/separated/Friday-02-03-2018_TrafficForML_CICFlowMeter_X_robust.npy



## Label Enconding (Numericalization & One-hot)

In [None]:
print(f'unique labels: {unique_labels}\n')


## 对 Label 特征列进行数值化编码 ##
print(f"[{datetime.now().strftime('%x %X')}] Numericalization Encoding...")

label_mapping = {label: idx for idx, label in enumerate(unique_labels)}
for key, value in label_mapping.items():
    print(f'{value:2}: {key}')

def encode_label(label):
    if label in label_mapping:
        return label_mapping[label]
    else:
        raise ValueError(f"Unknown label '{label}' encountered during encoding.")

df['Label_encoded'] = df['Label'].apply(encode_label)
print(df['Label_encoded'].shape)

output_file = Path(preprocessed_folder) / f'separated/{filename}_label.npy'
np.save(output_file, df['Label_encoded'].to_numpy())
print(f"[{datetime.now().strftime('%x %X')}] ✅ Saved to {output_file}")

unique labels: ['Benign', 'Bot', 'Brute Force -Web', 'Brute Force -XSS', 'DDOS attack-HOIC', 'DDOS attack-LOIC-UDP', 'DDoS attacks-LOIC-HTTP', 'DoS attacks-GoldenEye', 'DoS attacks-Hulk', 'DoS attacks-SlowHTTPTest', 'DoS attacks-Slowloris', 'FTP-BruteForce', 'Infilteration', 'SQL Injection', 'SSH-Bruteforce']

[04/20/25 03:59:46] Numericalization Encoding...
 0: Benign
 1: Bot
 2: Brute Force -Web
 3: Brute Force -XSS
 4: DDOS attack-HOIC
 5: DDOS attack-LOIC-UDP
 6: DDoS attacks-LOIC-HTTP
 7: DoS attacks-GoldenEye
 8: DoS attacks-Hulk
 9: DoS attacks-SlowHTTPTest
10: DoS attacks-Slowloris
11: FTP-BruteForce
12: Infilteration
13: SQL Injection
14: SSH-Bruteforce
(1044525,)
[04/20/25 03:59:47] ✅ Saved to /content/drive/MyDrive/NYIT/870/datasets/preprocessed/CSE-CIC-IDS2018/separated/Friday-02-03-2018_TrafficForML_CICFlowMeter_label.npy


In [None]:
## 对 Label 特征列进行 onehot 编码 ##
print(f"[{datetime.now().strftime('%x %X')}] One-hot Encoding...")

do_onehot = False
if do_onehot:
    # 映射字典
    label_mapping = {lbl: [1 if lbl == label else 0 for label in unique_labels] for lbl in unique_labels}
    for key, value in label_mapping.items():
        print(f'{value}: {key}')

    # onehot 编码
    df_onehot = pd.DataFrame(df['Label'].map(label_mapping).to_list(), columns=unique_labels)
    print(df_onehot.shape)

    # 保存为 .npy 文件
    # output_file = Path(preprocessed_folder) / f'separated/{filename}_label_onehot.npy'
    # np.save(output_file, df_onehot.to_numpy())

    # 转换成 稀疏矩阵 保存为 .npz 文件
    import scipy.sparse
    onehot_sparse = scipy.sparse.csr_matrix(df_onehot.to_numpy())
    output_file = Path(preprocessed_folder) / f'separated/{filename}_label_onehot_sparse.npz'
    scipy.sparse.save_npz(output_file, onehot_sparse)

    # 加载稀疏矩阵
    # labels_sparse = scipy.sparse.load_npz(output_file)
    # labels_onehot = labels_sparse.toarray()
    # print(labels_onehot.shape)

    print(f'\nSaved to {output_file}')

else:
    print(f"[{datetime.now().strftime('%x %X')}] 跳过！对 Label 的 onehot 编码可以直接从数值化编码转换过来，没必要提前做。")

[04/20/25 03:59:47] One-hot Encoding...
[04/20/25 03:59:47] 跳过！对 Label 的 onehot 编码可以直接从数值化编码转换过来，没必要提前做。
