## 모델 결정 및 생성방법

[전반적인 흐름]
  
초기: 총 7가지 모델을 생각
- 트리모델: LightGBM, XGBoost, RandomForest, CatBoost
- 비트리모델: MLP, SVC, KNN

- 이를 '단일 모델 / SMOTE / SMOTE + undersampling' 3가지 방법을 나누어 진행
	- train test -> 8:2 / 7:3 2가지로 수행
	- random_state = 156
	- k_neighbors = 3 (SMOTE 적용시)
	- feature importance 시각화 (트리모델)

SMOTE 방법
- 비트리 모델에 비해 트리모델이 훨씬 더 좋은 성능을 가진 것으로 나타남
	- LightGBM 리더보드 점수:  0.7623
	- XGBoost 리더보드 점수: 0.6781
	- RandomForest 리더보드 점수: 0.6804
	- CatBoost 리더보드 점수: 0.6334
	- MLP 리더보드 점수: 0.5548
	- SVC 리더보드 점수: 0.1172
	- KNN 리더보드 점수: 0.4669

리더보드 확인 후 느낀 예측모델 특징
- 사이버 공격 유형 예측이라는 주제 특성상 트리모델 기반의 성능이 더욱 좋은 것으로 나타남
	- 트리모델을 이용하여 단일모델, SMOTE, SMOTE+undersampling, Voting, Stacking 모델을 나누어 진행
		- 특히 LGBM 성능이 다른 모델에 비해 뛰어난 것으로 나타남 -> LGBM 모델 집중 공략 결정
	- train test는 8:2가 7:3에 비해 성능이 뛰어난 것으로 나타남 -> 최종 8:2로 결정
	- 리더보드 점수를 보았을 때 다양한 모델을 수행하는 것도 좋으나 더욱 섬세한 Feature Engineering을 진행하는 것이 좋을 것으로 판단 -> 다양한 파생변수 생성 및 feature engineering 부분 집중 공략 결정

## 코드 파트 (필요시 활용하면 될 것 같습니당!)

### LightGBM+SMOTE - test_size = 0.2

**필요 패키지 로딩**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**train, test 분리**

In [None]:
# 타겟 및 피처 나누기
X = train.drop(['attack_type'], axis=1)
y = train['attack_type']

# 학습/검증 데이터 분할
X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.2, random_state=156, stratify=y)

**SMOTE 적용**
- 클래스 불균형 해소를 위한 전처리

In [None]:
smote = SMOTE(random_state=156, k_neighbors=3)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

**각 클래스 별 가중치 부여방식 설정**

In [None]:
from sklearn.utils.class_weight import compute_class_weight
classes = np.unique(y)
weights = compute_class_weight(class_weight='balanced', classes=classes, y=y)
class_weight_dict = dict(zip(classes, weights))

In [None]:
# LightGBM에 가중치 부여방식 적용
lgbm_clf = lgb.LGBMClassifier(class_weight=class_weight_dict, random_state=156)
lgbm_clf.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003007 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2259
[LightGBM] [Info] Number of data points in the train set: 9599, number of used features: 16
[LightGBM] [Info] Start training from score -2.487750
[LightGBM] [Info] Start training from score -2.469430
[LightGBM] [Info] Start training from score -2.487248
[LightGBM] [Info] Start training from score -2.503865
[LightGBM] [Info] Start training from score -2.481700
[LightGBM] [Info] Start training from score -2.487924
[LightGBM] [Info] Start training from score -2.488409
[LightGBM] [Info] Start training from score -2.487779
[LightGBM] [Info] Start training from score -2.495159
[LightGBM] [Info] Start training from score -2.478209
[LightGBM] [Info] Start training from score -2.505797
[LightGBM] [Info] Start training from score -2.446957


In [None]:
# 예측 및 성능 평가
y_pred = lgbm_clf.predict(X_val)
print(classification_report(y_val, y_pred))
print("Macro F1 Score:", f1_score(y_val, y_pred, average='macro'))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1758
           1       1.00      0.80      0.89         5
           2       1.00      0.99      0.99        94
           3       0.58      0.70      0.64        10
           4       0.70      0.88      0.78         8
           5       0.97      0.99      0.98       344
           6       0.99      0.99      0.99       159
           7       1.00      1.00      1.00         6
           8       0.75      0.43      0.55         7
           9       1.00      0.60      0.75         5
          10       0.67      0.67      0.67         3
          11       0.50      1.00      0.67         1

    accuracy                           0.99      2400
   macro avg       0.85      0.84      0.82      2400
weighted avg       0.99      0.99      0.99      2400

Macro F1 Score: 0.8241940970598775


**test 데이터로 모델 학습 및 예측**

In [None]:
test_pred = lgbm_clf.predict(test)

리더보드 점수 : 0.7623

- 추가적으로 XGBoost 관련된 부분도 있는데 필요할까봐 넣어놓긴 할게요! (흐름상 혼동이 있을 것 같아 빼는게 나을 것 같긴 합니다!)

###XGBoost + Optuna + StratifiedKFold + 이후 재학습시 SMOTE 적

**Optuna + StratifiedKFold 적용**

In [None]:
X = train.drop(['attack_type'], axis=1)
y = train['attack_type']

In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.4.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.4-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.4.0-py3-none-any.whl (395 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.9/395.9 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.16.4-py3-none-any.whl (247 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.16.4 colorlog-6.9.0 optuna-4.4.0


In [None]:
import optuna
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import f1_score, make_scorer
import numpy as np

# 튜닝 대상 데이터
X_tune = X
y_tune = y

# 평가지표
f1_macro = make_scorer(f1_score, average='macro')

# Optuna 목적 함수 정의
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 5.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 5.0),
        'random_state': 156,
        'use_label_encoder': False,
        'eval_metric': 'mlogloss',
        'verbosity': 0
    }

    model = XGBClassifier(**params)
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=156)
    scores = cross_val_score(model, X_tune, y_tune, scoring=f1_macro, cv=skf)

    return np.mean(scores)

# Optuna 튜닝
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, timeout=1800)

# 최적 파라미터 및 성능 확인
print("Best trial:")
print(f"  F1-macro: {study.best_value}")
print(f"  Params: {study.best_params}")

[I 2025-07-22 17:47:54,336] A new study created in memory with name: no-name-a7604e4b-c2c6-40f4-b77c-e123731ce1bd
[I 2025-07-22 17:48:42,997] Trial 0 finished with value: 0.782831123382574 and parameters: {'max_depth': 4, 'learning_rate': 0.18614843778981663, 'n_estimators': 434, 'subsample': 0.910394841500616, 'colsample_bytree': 0.6129493888508529, 'reg_alpha': 0.060476252028701105, 'reg_lambda': 1.176920780747931}. Best is trial 0 with value: 0.782831123382574.
[I 2025-07-22 17:49:09,777] Trial 1 finished with value: 0.7067299814281316 and parameters: {'max_depth': 9, 'learning_rate': 0.25068836688820295, 'n_estimators': 475, 'subsample': 0.6702895692078107, 'colsample_bytree': 0.6453437281912574, 'reg_alpha': 4.985766949774792, 'reg_lambda': 2.718460293284595}. Best is trial 0 with value: 0.782831123382574.
[I 2025-07-22 17:49:35,883] Trial 2 finished with value: 0.7349321635409585 and parameters: {'max_depth': 9, 'learning_rate': 0.24840196549105045, 'n_estimators': 374, 'subsampl

Best trial:
  F1-macro: 0.7941996765792193
  Params: {'max_depth': 9, 'learning_rate': 0.2194674119335795, 'n_estimators': 452, 'subsample': 0.978298559420077, 'colsample_bytree': 0.8348560506244289, 'reg_alpha': 0.2622686444270609, 'reg_lambda': 1.2832332980959433}


**튜닝 후 최종 모델 학습**

In [None]:
# SMOTE 적용
from imblearn.over_sampling import SMOTE

smote = SMOTE(k_neighbors=3, random_state=156)
X_train_over, y_train_over = smote.fit_resample(X, y)

# 최적 파라미터 기반 모델 학습
best_params = study.best_params
best_params.update({
    'random_state': 156,
    'use_label_encoder': False,
    'eval_metric': 'mlogloss',
    'verbosity': 0
})

final_model = XGBClassifier(**best_params)
final_model.fit(X_train_over, y_train_over)

# 예측 및 평가
test_pred = final_model.predict(test)
print("Test set 예측 완료!")

Test set 예측 완료!


리더보드 점수: 0.7456으로 높은 상승이 나타남