### 앙상블 사용법
클래스는 `ensemble.py`에서 확인  

**`config_paths`**  
fine-tuning 시 사용했던 config.yaml의 경로를 string list로 입력  
이때 모델과 config.yaml은 해당 위치의 `models`라는 폴더 안에 넣어두기  
config.yaml의 model_name에서 배포자명을 제외한 모델의 이름과 저장한 pt 파일의 이름이 일치하도록 하기(문제가 된다면 수정 예정)  

**`base_predictions()`**  
fine-tuning을 수행한 LLM들을 사용해서 base predictions를 추출  
데이터 개수만큼의 행, 모델 수만큼의 열이 생성  

**'stacking`, `kfold_stacking`, `soft_voting`**  
각각 기본 stacking, soft voting, kfold stacking으로 메타 모델을 학습  
clf 인자에 'linear'를 입력하면 LinearRegression을, 'lgbm'을 입력하면 LightBGM 모델을 사용하고 클래스의 멤버 변수로 등록  
clf를 바꾸고 싶다면 clf 인자를 변경해서 함수를 다시 호출하면 됨  
kfold stacking은 원래 kfold를 적용한 데이터로 LLM부터 학습해야 하지만 시간 관계 상, 리소스 관계 상 메타 모델에만 k-fold를 적용하는 방식으로 구현  
n 값으로 몇 개의 fold를 사용할지 지정  

**`inference`**  
sample_submission.csv를 읽어와 앙상블 결과 저장  
is_voting이 True이면 soft voting 결과를 사용하고, False이면 stacking과 kfold_stacking 중 가장 마지막에 사용한 classifier를 기준으로 추론 수행  

In [2]:
from ensemble import Ensemble

config_paths = ['./models/kf_deberta_cross_sts_config.yaml', './models/kf_deberta_cross_sts_config.yaml']
ensemble = Ensemble(config_paths=config_paths,
                    train_path='../../train_preprocess_v1.csv',
                    valid_path='../../dev_preprocess_v1.csv',
                    test_path='../../test_preprocess_v1.csv')

X_train_base_, y_train_, X_valid_base_, y_valid_, X_test_ = ensemble.base_predictions()
print(X_train_base_.shape, y_train_.shape)
print(X_valid_base_.shape, y_valid_.shape)
print(X_test_.shape)

Right now using "deliciouscat/kf-deberta-base-cross-sts"


tokenization: 100%|██████████| 9324/9324 [00:03<00:00, 2926.44it/s]
tokenization: 100%|██████████| 550/550 [00:00<00:00, 3070.08it/s]
tokenization: 100%|██████████| 1100/1100 [00:00<00:00, 2883.20it/s]
base prediction for train data: 100%|██████████| 583/583 [02:14<00:00,  4.34it/s]
base prediction for valid data: 100%|██████████| 35/35 [00:07<00:00,  4.44it/s]
base prediction for test data: 100%|██████████| 1100/1100 [00:30<00:00, 36.56it/s]




Right now using "deliciouscat/kf-deberta-base-cross-sts"


tokenization: 100%|██████████| 9324/9324 [00:03<00:00, 2942.45it/s]
tokenization: 100%|██████████| 550/550 [00:00<00:00, 3153.90it/s]
tokenization: 100%|██████████| 1100/1100 [00:00<00:00, 2918.78it/s]
base prediction for train data: 100%|██████████| 583/583 [02:13<00:00,  4.37it/s]
base prediction for valid data: 100%|██████████| 35/35 [00:07<00:00,  4.44it/s]
base prediction for test data: 100%|██████████| 1100/1100 [00:30<00:00, 36.18it/s]



(9324, 2) (9324, 1)
(550, 2) (550, 1)
(1100, 2)





In [3]:
ensemble.stacking(clf="linear")
ensemble.stacking(clf="lgbm")
ensemble.soft_voting()
ensemble.kfold_stacking("linear", 3)
ensemble.kfold_stacking("lgbm", 3)

    train pearson sim: [0.99857559]
    valid pearson sim: [0.92838911]


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000215 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 9324, number of used features: 2
[LightGBM] [Info] Start training from score 1.849968
    train pearson sim: [0.99869295]
    valid pearson sim: [0.92851294]


    train pearson sim: [0.99857559]
    valid pearson sim: [0.92838911]


    train pearson sim: [0.99860786]
    k-valid pearson sim: [0.99844497]
    valid pearson sim: [0.99844497]


    train pearson sim: [0.99856262]
    k-valid pearson sim: [0.9986258]
    valid pearson sim: [0.9986258]


    train pearson sim: [0.99854456]
    k-valid pearson sim: [0.99869849]
    valid pearson sim: [0.99869849]


    train pearson sim: [0.9985721]
    k-valid pearson sim: [0.99859489]
    valid pearson sim: [0.998594

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


    train pearson sim: [0.99872998]
    k-valid pearson sim: [0.9984905]
    valid pearson sim: [0.9984905]


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000144 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 7459, number of used features: 2
[LightGBM] [Info] Start training from score 1.839697
    train pearson sim: [0.99869897]
    k-valid pearson sim: [0.99867778]
    valid pearson sim: [0.99867778]


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000069 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 7459, number of used features: 2
[LightGBM] [Info] Start training from score 1.864137


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


    train pearson sim: [0.99866779]
    k-valid pearson sim: [0.99874458]
    valid pearson sim: [0.99874458]


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000104 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 7459, number of used features: 2
[LightGBM] [Info] Start training from score 1.844135
    train pearson sim: [0.99869724]
    k-valid pearson sim: [0.99865659]
    valid pearson sim: [0.99865659]


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000066 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 7460, number of used features: 2
[LightGBM] [Info] Start training from score 1.851059


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


    train pearson sim: [0.99871699]
    k-valid pearson sim: [0.9985408]
    valid pearson sim: [0.9985408]




In [7]:
ensemble.stacking("lgbm")
ensemble.inference(is_voting=False, submission_path='../../sample_submission.csv')

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000129 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 510
[LightGBM] [Info] Number of data points in the train set: 9324, number of used features: 2
[LightGBM] [Info] Start training from score 1.849968
    train pearson sim: [0.99869295]
    valid pearson sim: [0.92851294]




  y = column_or_1d(y, warn=True)
