<a href="https://colab.research.google.com/github/funway/nid-imbalance-study/blob/main/preprocessing/features_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 特征选择(AKA 特征降维)
* 从原始的 N 个特征列中，选择并保留 n 个“更重要的”特征。
* 可以减少后续处理的计算复杂度，提高模型的性能。
* ***选做***，可以用来跟不做降维的原始数据进行对比。
* ❓但是三种方法选择出来的结果相差也太大了。。。

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
### Modules ###
from pathlib import Path
import numpy as np


### Globals ###

## 数据文件
dataset = 'CSE-CIC-IDS2018'
dataset_folder = Path(f'/content/drive/MyDrive/NYIT/870/datasets/original/{dataset}/')
preprocessed_folder = Path(f'/content/drive/MyDrive/NYIT/870/datasets/preprocessed/{dataset}/')

features_file = preprocessed_folder / f'integrated/trimed_X_standard.npy'
label_file = preprocessed_folder / f'integrated/trimed_label_standard.npy'

# 加载数据
X = np.load(features_file)
y = np.load(label_file)
print(f'X shape: {X.shape}')
print(f'y shape: {y.shape}')

# 定义保留的特征数
n_features = 30

X shape: (4746934, 70)
y shape: (4746934,)


## 方法1: RandomForest (随机森林)

* sklearn 不支持 GPU 加速, RF 模型训练地好慢。。。😫 80分钟。。。


RF 选择的特征索引(共 30 个): [22 20 57  6  2 38  9 41 46  5 52 17  4 15 54 36  7 47 34 56  3 16 33 30
 58 35 42 59 61  0]

In [None]:
from sklearn.ensemble import RandomForestClassifier

method = 'rf'

# n_estimators 参数定义决策树数量, random_state 是随机数种子
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

rf_feature_importance = rf.feature_importances_
print(f'RF 计算的每个特征重要度: {rf_feature_importance}')

# 选择 n_features 个重要特征 (np.argsort 是从小到大排序，大的在后面)
rf_selected_features = np.argsort(rf_feature_importance)[-n_features:]
print(f'RF 选择的特征索引(共 {n_features} 个): {rf_selected_features}')

X_selected = X[:, rf_selected_features]
print(f'X_selected shape: {X_selected.shape}')

# 保存为 .npy 文件
output_file = features_file.parent / 'features_selected' / f'{features_file.stem}_{method}.npy'
np.save(output_file, X_selected)
print(f'Saved to {output_file}')

RF 计算的每个特征重要度: [0.06257114 0.00980145 0.01447311 0.02618789 0.01854108 0.0160365
 0.01418659 0.02421358 0.00666681 0.01526107 0.00345572 0.00575456
 0.0037087  0.00729465 0.0048733  0.02037794 0.02663771 0.01710525
 0.00875671 0.00961826 0.01313671 0.00848918 0.01214772 0.00959306
 0.01016774 0.0091588  0.00312852 0.00395448 0.00296763 0.00364512
 0.02749571 0.01096083 0.00250583 0.02737163 0.02538535 0.0315363
 0.02292193 0.00612436 0.01458359 0.01159652 0.0074784  0.01527984
 0.03485277 0.01130918 0.00256481 0.00724307 0.01592534 0.02508882
 0.00288324 0.00347549 0.0038423  0.01141619 0.01664762 0.00955277
 0.02072298 0.01182248 0.02542779 0.01361611 0.03063729 0.04292391
 0.00923159 0.05025387 0.00715961 0.00831657 0.00673282 0.01029202
 0.00383366 0.01176986 0.00506291 0.00224365]
RF 选择的特征索引(共 30 个): [22 20 57  6  2 38  9 41 46  5 52 17  4 15 54 36  7 47 34 56  3 16 33 30
 58 35 42 59 61  0]
X_selected shape: (4746934, 30)
Saved to /content/drive/MyDrive/NYIT/870/datasets/preproces

## 方法2: XGBoost + SHAP


* XGBoost 支持 GPU 计算 ⏩
* 如果使用 GPU, `model.fit()` 只需要两分钟
* 如果没有 GPU, `model.fit()` 需要20多分钟 (ಥ﹏ಥ)



SHAP 选择的特征索引(共 30 个): [21 14 19  9 17 50 27 67 16 46  2 64 36 35 37  6 25 33  5  3 42 63 47 32
 30 20 59 58 61  0]

In [None]:
from operator import mod
import shap
import gc
import sys
import xgboost as xgb
import tensorflow as tf

print(f"Size of X: {sys.getsizeof(X) / (1024 ** 2):.2f} MB")

method = 'xgb'
model_file = preprocessed_folder / f'models/features_selection_{method}.json'

if not model_file.exists():
    # 检查是否可以使用 GPU
    gpu_available = len(tf.config.list_physical_devices('GPU')) > 0
    print(f'GPU 是否可用: {gpu_available}')

    # 如果 GPU 可用，则启用 XGBoost 的 GPU 加速
    device = 'cuda' if gpu_available else 'cpu'

    # 训练 XGBoost 模型
    model = xgb.XGBClassifier(tree_method='hist', device=device, random_state=42)
    model.fit(X, y)
    print(f'XGBoost 模型训练完成')

    # 保存模型
    model.save_model(model_file)
    print(f'XGBoost 模型保存到 {model_file}')
else:
    print(f'XGBoost 模型文件存在, 加载模型 {model_file}')
    # 加载模型
    model = xgb.XGBClassifier()
    model.load_model(model_file)


Size of X: 1267.57 MB
GPU 是否可用: True
XGBoost 模型训练完成
XGBoost 模型保存到 /content/drive/MyDrive/NYIT/870/datasets/preprocessed/CSE-CIC-IDS2018/models/features_selection_xgb.json


In [None]:
method = 'shap'

# 使用 SHAP 解释模型
explainer = shap.TreeExplainer(model)
# todo!
# explainer = shap.TreeExplainer(model, data=x_train.sample(1000))
# 有的人是这么个写法，额外给了 data 参数，
# 那 data 参数怎么从 X 中提取呢？随机提取？还是按 y 比例提取？感觉这也是得不停调参的。😓

# 分批计算 SHAP 值
batch_size = 500000
total_samples = 0
running_sum = np.zeros((X.shape[1], ))
print(f'running_sum.shape: {running_sum.shape}')


for i in range(0, len(X), batch_size):
    print(f'Processing batch {i // batch_size + 1}/{len(X) // batch_size + 1}')
    batch = X[i:i + batch_size]
    print(f'  batch.shape: {batch.shape}')

    shap_values_batch = explainer.shap_values(batch)
    print(f'  shap_values_batch.shape: {shap_values_batch.shape}')
    # shap 的尺寸是三维的 (sampels_number, features_number, labels_number)
    # 所以才会特别占用内存！！

    # 计算没 batch 数据的特征重要性
    batch_feature_importance = np.abs(shap_values_batch).mean(axis=(0, 2))  # 先对 batch 维度求均值，再对类别求均值
    # todo!
    # 这里不是很对，这里的两次计算都是算术平均 mean()。
    # 对 样本纬度 求算术平均可以理解，
    # 但是对 类别纬度 求算术平均，这就没有考虑到不同类别的标签的不平衡性啊，这该怎么处理呢？怎么个加权法？

    # 按样本数加权
    running_sum += batch_feature_importance * batch.shape[0]
    total_samples += batch.shape[0]

    print(f'Batch {i // batch_size + 1}/{len(X) // batch_size + 1} done')
    pass

# 计算特征重要性
feature_importance = running_sum / total_samples
print(f'feature_importance {feature_importance.shape}: {feature_importance}')

# 选择前 n 个重要特征
selected_features = np.argsort(feature_importance)[-n_features:]
print(f'SHAP 选择的特征索引(共 {n_features} 个): {selected_features}')

X_selected = X[:, selected_features]
print(f'X_selected shape: {X_selected.shape}')

# 保存为 .npy 文件
output_file = features_file.parent / 'features_selected' / f'{features_file.stem}_{method}.npy'
np.save(output_file, X_selected)
print(f'Saved to {output_file}')

running_sum.shape: (70,)


AttributeError: 'numpy.ndarray' object has no attribute 'value_counts'

## 方法3: Mutual Info Selection (基于互信息的特征选择)
这个也无法利用 GPU 加速 (´･_･`)

* 直接对全部数据调用 fit(X, y) 方法会耗尽 12GB 的内存。
* 即使分片计算，因为 sklearn 无法使用 GPU，耗时需要60分钟+。。。

MI 选择的特征索引(共 30 个): [39 41 40 51 64  7 65 10  8 37  9 52 13 53 38 62 33 54  3 36 50 34 59 56
 58  4 46 45  0 61]



In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from tqdm.notebook import tqdm

method = 'mi'
batch_size = 500000

# 计算互信息
def compute_mutual_info(X, y, batch_size):
    n_rows, n_columns = X.shape
    scores = np.zeros(n_columns)  # 初始化特征分数

    for i in tqdm(range(0, n_rows, batch_size), desc="Computing MI"):
        X_batch = X[i:i+batch_size]
        y_batch = y[i:i+batch_size]
        batch_scores = mutual_info_classif(X_batch, y_batch, discrete_features='auto')
        scores += batch_scores  # 累积分数

    return scores / (n_rows / batch_size)  # 归一化


# 互信息选择
mi_scores = compute_mutual_info(X, y, batch_size)
print(f'MI 计算每个特征的评分: {mi_scores}')

mi_selected_features = np.argsort(mi_scores)[-n_features:]  # 取前 n_features 个高分特征)
print(f'MI 选择的特征索引(共 {n_features} 个): {mi_selected_features}')

X_selected = X[:, mi_selected_features]
print(f'X_selected shape: {X_selected.shape}')

# 保存为 .npy 文件
output_file = features_file.parent / 'features_selected' / f'{features_file.stem}_{method}.npy'
np.save(output_file, X_selected)
print(f'Saved to {output_file}')


Computing MI:   0%|          | 0/10 [00:00<?, ?it/s]

MI 计算每个特征的评分: [0.39467743 0.36399015 0.35162157 0.37668707 0.39108356 0.35336246
 0.35925176 0.36760243 0.36877455 0.36966063 0.36866435 0.36332722
 0.35711717 0.36986237 0.36321645 0.35035122 0.36174409 0.35557549
 0.3548249  0.35725014 0.35911897 0.35592346 0.35790976 0.35660028
 0.360194   0.36024976 0.34480358 0.35114199 0.34219003 0.34891858
 0.35280678 0.3546686  0.11147108 0.37585817 0.38778328 0.36169763
 0.37812503 0.36902266 0.37097913 0.36493591 0.36558213 0.36519635
 0.34903659 0.35470439 0.36086469 0.39436991 0.39370265 0.35615636
 0.11153143 0.36089614 0.38356591 0.36625822 0.36967515 0.36996453
 0.37668138 0.35337201 0.39103255 0.35923027 0.39106762 0.3883033
 0.35312196 0.40428788 0.37187487 0.35947869 0.36729059 0.36838061
 0.35452924 0.35222781 0.35431205 0.35566636]
MI 选择的特征索引(共 30 个): [39 41 40 51 64  7 65 10  8 37  9 52 13 53 38 62 33 54  3 36 50 34 59 56
 58  4 46 45  0 61]
X_selected shape: (4746934, 30)
Saved to /content/drive/MyDrive/NYIT/870/datasets/preproces