<a href="https://colab.research.google.com/github/cswcjt/Dacon-Oil/blob/main/imbalance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

-- Imbalanced data problem: 비대칭 데이터 문제

    - 데이터 클래스 비율이 너무 차이가 나면(highly-imbalanced data) 단순히 우세한 클래스를 택하는 모형의 정확도가 높아지므로 모형의 성능판별이 어려워진다. 즉, 정확도(accuracy)가 높아도 데이터 갯수가 적은 클래스의 재현율(recall-rate)이 급격히 작아지는 현상이 발생할 수 있다.

-- Imbalanced-Learn methodology: 해결 방법 

    - 비대칭 데이터는 다수 클래스 데이터에서 일부만 사용하는 언더 샘플링이나 소수 클래스 데이터를 증가시키는 오버 샘플링을 사용하여 데이터 비율을 맞추면 정밀도(precision)가 향상된다. 

    - 오버샘플링(Over-Sampling)

    - 언더샘플링(Under-Sampling)

    - 복합샘플링(Combining Over-and Under-Sampling)

-- under sampling 

    - RandomUnderSampler: 무작위로 데이터를 없애는 단순 샘플링

    - TomekLinks: 
        - 1) 토멕링크(Tomek’s link)란 서로 다른 클래스에 속하는 한 쌍의 데이터 (𝑥+,𝑥−)로 서로에게 더 가까운 다른 데이터가 존재하지 않는 것이다. 
        - 2) 토멕링크를 찾은 다음 그 중에서 다수 클래스에 속하는 데이터를 제외하는 방법

    - CondensedNearestNeighbour(CNN): 
        - CNN(Condensed Nearest Neighbour) 방법은 1-NN 모형으로 분류되지 않는 데이터만 남기는 방법이다. 선텍된 데이터 집합을 𝑆라고 하자.
        - 1) 소수 클래스 데이터를 모두 𝑆에 포함시킨다.
        - 2) 다수 데이터 중에서 하나를 골라서 가장 가까운 데이터가 다수 클래스이면 포함시키지 않고 아니면 𝑆에 포함시킨다.
        - 3) 더이상 선택되는 데이터가 없을 때까지 3를 반복한다.

    - OneSidedSelection: 
        - TomekLinks + CNN 

    - EditedNearestNeighbours(ENN): 
        - 다수 클래스 데이터 중 가장 가까운 k(n_neighbors)개의 데이터가 모두(kind_sel="all") 또는 다수(kind_sel="mode") 다수 클래스가 아니면 삭제하는 방법이다. 소수 클래스 주변의 다수 클래스 데이터는 사라진다.

    - NeighbourhoodCleaningRule: 
        -  CNN + ENN

    
-- over sampling

    - RandomOverSampler
        - Random Over Sampling은 소수 클래스의 데이터를 반복해서 넣는 것(replacement)이다. 가중치를 증가시키는 것과 비슷하다. 

    - ADASYN(Adaptive Synthetic Sampling Approach for Imbalanced Learning)
        - 소수 클래스 데이터와 그 데이터에서 가장 가까운 k개의 소수 클래스 데이터 중 무작위로 선택된 데이터 사이의 직선상에 가상의 소수 클래스 데이터를 만드는 방법이다.

    - SMOTE(Synthetic Minority Over-sampling Technique)
        - ADASYN 방법처럼 데이터를 생성하지만 생성된 데이터를 무조건 소수 클래스라고 하지 않고 분류 모형에 따라 분류한다.

-- 복합 샘플링

    - SMOTE+ENN
        - SMOTE+ENN 방법은 SMOTE(Synthetic Minority Over-sampling Technique) 방법과 ENN(Edited Nearest Neighbours) 방법을 섞은 것이다. 
    - SMOTE+Tomek
        - SMOTE+Tomek 방법은 SMOTE(Synthetic Minority Over-sampling Technique) 방법과 토멕링크 방법을 섞은 것이다.

-- 주의 사항

    - train & test datasets의 포퍼먼스를 주시해야한다. 
        - to gain insight into the impact of the method, it is a good idea to monitor the performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

    - 클래스의 분포의 skewness가 높은지 확인이 필요하다. 
        - The increase in the number of examples for the minority class, especially if the class skew was severe, can also result in a marked increase in the computational cost when fitting the model, especially considering the model is seeing the same examples in the training dataset again and again.

    - Pipeline 사용시
        - A traditional scikit-learn Pipeline cannot be used; instead, a Pipeline from the imbalanced-learn library can be used

-- 왜도와 첨도

    - 왜도 (Skewness) : 분포의 비대칭도
        - 정규분포 = 왜도 0
        - 왼쪽으로 치우침 = 왜도 > 0
        - 오른쪽으로 치우침 = 왜도 < 0
    - 첨도 (Kurtosis) : 확률분포의 뾰족한 정도
        - 정규분포 = 첨도 0(Pearson 첨도 = 3)
        - 위로 뾰족함 = 첨도 > 0(Pearson 첨도 >3)
        - 아래로 뾰족함 = 첨도 < 0 (Pearson 첨도 < 3) 


 

Preprocessing에 들어갈 imbalance 함수 만들자

In [1]:
import time
from tqdm import tqdm

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter # 샘플결과 확인
from sklearn.model_selection import train_test_split # 트테트테
from sklearn.decomposition import PCA # 차원축소
from sklearn.ensemble import RandomForestClassifier # 모델선택
from sklearn.metrics import f1_score # 성과지표
from sklearn.metrics import classification_report # 성과지표
from imblearn.under_sampling import * # 임벨런스
from imblearn.over_sampling import * # 임벨런스
from imblearn.combine import * # 임벨런스
from imblearn.pipeline import Pipeline # 파이프라인구축

In [3]:
# classification models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, StackingClassifier, HistGradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from lightgbm.sklearn import LGBMClassifier
from xgboost.sklearn import XGBClassifier

# classification metrics
from sklearn.metrics import confusion_matrix, plot_roc_curve, f1_score

# regression models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, VotingRegressor, StackingRegressor, HistGradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from lightgbm.sklearn import LGBMRegressor
from xgboost.sklearn import XGBRegressor

# regression metrics
from sklearn.metrics import mean_absolute_error, r2_score

In [4]:
from google.colab import drive
drive.mount('/content/drive')
base_path = "/content/drive/MyDrive/fastcamp/datas/open/"
train = pd.read_csv(base_path + "train.csv", encoding='cp949')
test = pd.read_csv(base_path + "test.csv", encoding='cp949')
submission = pd.read_csv(base_path + "sample_submission.csv", encoding='cp949')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
train = train.fillna(0)
train.COMPONENT_ARBITRARY = train.COMPONENT_ARBITRARY.map({"COMPONENT1" : 1, "COMPONENT2" : 2, "COMPONENT3" : 3, "COMPONENT4" : 4})
X = train.drop(columns=["ID", "Y_LABEL"], inplace=False)
y = train["Y_LABEL"]

In [6]:
train

Unnamed: 0,ID,COMPONENT_ARBITRARY,ANONYMOUS_1,YEAR,SAMPLE_TRANSFER_DAY,ANONYMOUS_2,AG,AL,B,BA,...,U25,U20,U14,U6,U4,V,V100,V40,ZN,Y_LABEL
0,TRAIN_00000,3,1486,2011,7,200,0,3,93,0,...,0.0,0.0,0.0,0.0,0.0,0,0.0,154.0,75,0
1,TRAIN_00001,2,1350,2021,51,375,0,2,19,0,...,2.0,4.0,6.0,216.0,1454.0,0,0.0,44.0,652,0
2,TRAIN_00002,2,2415,2015,2,200,0,110,1,1,...,0.0,3.0,39.0,11261.0,41081.0,0,0.0,72.6,412,1
3,TRAIN_00003,3,7389,2010,2,200,0,8,3,0,...,0.0,0.0,0.0,0.0,0.0,0,0.0,133.3,7,0
4,TRAIN_00004,3,3954,2015,4,200,0,1,157,0,...,0.0,0.0,0.0,0.0,0.0,0,0.0,133.1,128,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14090,TRAIN_14090,3,1616,2014,8,200,0,2,201,1,...,0.0,0.0,0.0,0.0,0.0,0,0.0,135.4,16,0
14091,TRAIN_14091,1,2784,2013,2,200,0,3,85,0,...,0.0,0.0,0.0,0.0,0.0,0,14.5,117.5,1408,0
14092,TRAIN_14092,3,1788,2008,9,550,0,6,0,1,...,0.0,0.0,0.0,0.0,0.0,0,0.0,54.0,1301,0
14093,TRAIN_14093,2,2498,2009,19,550,0,2,4,0,...,7.0,8.0,100.0,1625.0,18890.0,0,0.0,44.3,652,0


In [7]:
X.isnull().values.any()

False

In [8]:
# from collections import Counter 
# from sklearn.datasets import fetch_mldata 
# from imblearn.under_sampling import CondensedNearestNeighbour 
# pima = fetch_mldata('diabetes_scale') 
# X, y = pima['data'], pima['target'] 
# print('Original dataset shape %s' % Counter(y)) 
# cnn = CondensedNearestNeighbour(random_state=42) 
# X_res, y_res = cnn.fit_resample(X, y) 
# print('Resampled dataset shape %s' % Counter(y_res)) 

In [9]:
#CondensedNearestNeighbour(random_state=0).fit_resample(X, y)

In [10]:
class Preprocessing:
    def __init__(self, **kwargs):
        """
        X: pd.DataFrame=X, y: pd.DataFrame=y, categorical_feature: str="COMPONENT_ARBITRARY", 
        test_size:int=0.1, random_state_: int=42 ,dimensionality: callable= PCA()
        """

        # preprocessing for data set
        self.X = kwargs["X"]
        self.y = kwargs["y"]
        self.categorical_feature = kwargs["categorical_feature"]

        # self.concat_df = pd.concat([X, y], axis=1) 
        self.test_size = kwargs["test_size"]

        # preprocessing for learning model
        self.learners_dict = {
            'classification': {
                'RF': RandomForestClassifier,
                'XGB': XGBClassifier,
                'LGBM': LGBMClassifier
            },
        
            'regression': {
                'RF': RandomForestRegressor,
                'XGB': XGBRegressor,
                'LGBM': LGBMRegressor
            }
        }

        # preprocessing for sampling model
        self.samplers_dict = {
            "under": {
                'RandomUnderSampler': RandomUnderSampler,
                'TomekLinks': TomekLinks,
                'CondensedNearestNeighbour': CondensedNearestNeighbour, 
                'OneSidedSelection': OneSidedSelection,
                'EditedNearestNeighbours': EditedNearestNeighbours,
                'NeighbourhoodCleaningRule': NeighbourhoodCleaningRule
            },

            "over": {
                'RandomOverSampler': RandomOverSampler,
                'ADASYN': ADASYN,
                'NeighbourhoodCleaningRule': NeighbourhoodCleaningRule
            },

            "hybrid": {
                'SMOTEENN': SMOTEENN,
                'SMOTETomek': SMOTETomek
            }
        }

        # preprocessing for dimensionality
        self.dimensionality = kwargs["dimensionality"]

        # create new attrubutes for methods 
        learner = kwargs["learner"]
        sampler = kwargs["sampler"]
        self.my_learner = self.learners_dict[learner[0]][learner[1]]
        self.my_sampler = self.samplers_dict[sampler[0]][sampler[1]]
        self.random_state_ = kwargs["random_state_"]

    def sampling(self, X: pd.DataFrame=X, y: pd.DataFrame=y) -> tuple:
        try: 
            sampler = self.my_sampler(random_state=self.random_state_)
            X2, y2 = sampler.fit_resample(X, y)
            print(f"{sampler} completed resampling X and y" )
            return X2, y2
        
        except ValueError:
            print("categorical value 넣지마세요!")

        except TypeError: 
            print("random_state 없는 샘플러")
            sampler = self.my_sampler()
            X2, y2 = sampler.fit_resample(X, y)
            print(f"{sampler} completed resampling X and y" )
            return X2, y2

    def grouping_df(self, X, y, y_column: str='Y_LABEL') -> dict: 
        """
        divide train_df to make each group df
        return grouped df list 
        """

        # concat X2 and y2 to divide groups 
        categorical_feature = self.categorical_feature
        print(categorical_feature)
        concat_df = pd.concat([X,y], axis=1)
        group_dic = {}

        for criteria in sorted(concat_df[categorical_feature].unique()): 
            print(f"dividing my df on {criteria}")
            temp_df = concat_df.loc[concat_df[categorical_feature] == criteria,].drop(columns=categorical_feature)

            # make grouped X, y
            X3 = temp_df.drop(columns=[y_column])
            y3 = temp_df[y_column]
            group_dic.update({criteria: (X3, y3)})
        
        return group_dic

    def split_X_y_bundle(self, X_y_bundle: tuple or dict) -> dict: 
        """
        split train and validation data set
        return X_train, X_val, y_train, y_val
        """

        if type(X_y_bundle) == tuple: 
            (X, y) = X_y_bundle
            X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state_)
            return X_train, X_val, y_train, y_val

        else: 
            split_dict = {}
            for key, (X, y) in X_y_bundle.items():
                (X_train, X_val, y_train, y_val) = train_test_split(X, y, test_size=self.test_size, random_state=self.random_state_)
                split_dict.update({key : (X_train, X_val, y_train, y_val)})     
            return split_dict

    def feature_importance_for_groups(self, split_dict): # -> 메서드 가지고 오면 피처임포턴스 리턴하는 메서드
        # nan 값 처리 후 사용 가능
        classifier = self.my_learner()
        acc_dict = {}
        feature_importance_dict = {}

        for criteria, (X_train, X_val, y_train, y_val) in split_dict.items(): 
            classifier.fit(X_train, y_train) # Random Forest 학습을 위해 parameter 채우기
            pred = classifier.predict(X_val) # Random Forest 테스트를 위해 parameter 채우기
            acc = f1_score(y_val, pred)  # f1_score 계산
            acc_dict.update({criteria : acc})
            print("f1_score : %.3f" % acc)

            importances = classifier.feature_importances_
            ftr_importances = pd.Series(importances, index = X_train.columns).sort_values(ascending=False)
            feature_importance_dict.update({criteria : ftr_importances})

        return acc_dict, feature_importance_dict

    def chose_drop_features(self, feature_importance_dict, threshold: int=None, draw=True): 
        """
        return list of all feature_importance for each group 
        """

        if draw == True:
            for criteria, feature_importance in feature_importance_dict.items(): 
                plt.figure(figsize=(12, 6))
                plt.title(f'{criteria} Feature Importances')
                sns.barplot(x=feature_importance, y=feature_importance.index)
                plt.show()

        drop_target_list = []
        for criteria, feature_importance in feature_importance_dict.items():
            temp_df = feature_importance.reset_index()
            temp_df.columns = ["name", "value"]

            if threshold == None: 
                drop_target_list.extend(temp_df[temp_df.value == 0].name.to_list())

            elif threshold != None:
                drop_target_list.extend(temp_df[temp_df.value <= threshold].name.to_list())

        return list(set(drop_target_list))

    def print_report(self, split_dict) -> str: 
        try: 
            sampler = self.my_sampler(random_state=self.random_state_)
        except TypeError:  
            sampler = self.my_sampler()
        #print(sampler)
        classifier = self.my_learner()
        #print(classifier)
        dimensionality = self.dimensionality()
        #print(type(dimensionality))
        # sampling method, dimensionality, model
        pipeline = Pipeline([('sampling_method', sampler), ('dimensionality', dimensionality), ('model', classifier)]) 
        
        for criteria, (X_train, X_val, y_train, y_val) in split_dict.items(): 
            pipeline.fit(X_train, y_train) 
            y_hat = pipeline.predict(X_val)
            print(f"{dimensionality} 사용한 pipe line")
            print(classification_report(y_val, y_hat))


In [11]:
sampler_dic = {
    # "under": {
    #     'RandomUnderSampler': RandomUnderSampler,
    #     'TomekLinks': TomekLinks,
    #     'CondensedNearestNeighbour': CondensedNearestNeighbour, 
    #     'OneSidedSelection': OneSidedSelection,
    #     'EditedNearestNeighbours': EditedNearestNeighbours,
    #     'NeighbourhoodCleaningRule': NeighbourhoodCleaningRule
    # },

    "over": {
        'RandomOverSampler': RandomOverSampler,
        # 'ADASYN': ADASYN, -> value Error : "No samples will be generated with the provided ratio settings."
        'NeighbourhoodCleaningRule': NeighbourhoodCleaningRule
    },

    "hybrid": {
        'SMOTEENN': SMOTEENN,
        'SMOTETomek': SMOTETomek
    }
}

In [12]:
# # sampler 하나 
# variable_dict = {
#     "X": X, 
#     "y": y, 
#     "categorical_feature": "COMPONENT_ARBITRARY", 
#     "test_size": 0.1, 
#     "learner": ("classification", "XGB"), 
#     "sampler": ("under", "RandomUnderSampler"), 
#     "random_state_": 42,
#     "dimensionality": PCA
# }

# first_try = Preprocessing(**variable_dict)
# X = first_try.X
# y = first_try.y
# print()

# # 샘플링 그룹핑 스플릿
# X2, y2 = first_try.sampling(X, y)
# grouped_dic = first_try.grouping_df(X2, y2, y_column='Y_LABEL')
# split_X_y_bundle = first_try.split_X_y_bundle(grouped_dic)
# print()

# # 피처임포턴스 확인
# result_ = first_try.feature_importance_for_groups(split_X_y_bundle)
# features = result_[1]
# drop_target_list = first_try.chose_drop_features(features, draw=False)
# print()
# print(drop_target_list)
# print()

# # 파이프라인 결과 확인
# print(first_try.print_report(split_X_y_bundle))

In [13]:
# RF & sampler 전부다 
for key, value in tqdm(sampler_dic.items(), desc="\n첫 번째 반복문"):
    for name, function in tqdm(value.items(), desc="\n두 번째 반복문"):
        variable_dict = {
            "X": X, 
            "y": y, 
            "categorical_feature": "COMPONENT_ARBITRARY", 
            "test_size": 0.1, 
            "learner": ("classification", "RF"), 
            "sampler": (key, name), 
            "random_state_": 42,
            "dimensionality": PCA
        }

        first_try = Preprocessing(**variable_dict)
        X = first_try.X
        y = first_try.y
        print()

        # 샘플링 그룹핑 스플릿
        X2, y2 = first_try.sampling(X, y)
        grouped_dic = first_try.grouping_df(X2, y2, y_column='Y_LABEL')
        split_X_y_bundle = first_try.split_X_y_bundle(grouped_dic)
        print()

        # 피처임포턴스 확인
        result_ = first_try.feature_importance_for_groups(split_X_y_bundle)
        features = result_[1]
        drop_target_list = first_try.chose_drop_features(features, draw=False)
        print()
        print(drop_target_list)
        print()

        # 파이프라인 결과 확인
        print(first_try.print_report(split_X_y_bundle))


첫 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s]

두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


RandomOverSampler(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.997
f1_score : 0.995
f1_score : 0.995
f1_score : 1.000

['FOXID', 'SOOTPERCENTAGE', 'U20', 'U25', 'FTBN', 'FNOX', 'U100', 'U50', 'U14', 'FSO4', 'FOPTIMETHGLY', 'U4', 'U75', 'V100', 'U6', 'FH2O', 'BE', 'FUEL']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       386
           1       1.00      1.00      1.00       312

    accuracy                           1.00       698
   macro avg       1.00      1.00      1.00       698
weighted avg       1.00      1.00      1.00       698

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       215
           1       1.00      1.00      1.00       217

    accuracy                           1.00       432
   macro avg       1.00   



두 번째 반복문:  50%|█████     | 1/2 [00:25<00:25, 25.85s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        77
           1       1.00      1.00      1.00        54

    accuracy                           1.00       131
   macro avg       1.00      1.00      1.00       131
weighted avg       1.00      1.00      1.00       131

None

random_state 없는 샘플러
NeighbourhoodCleaningRule() completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.690
f1_score : 0.733
f1_score : 0.667
f1_score : 0.667

['FOXID', 'FNOX', 'U100', 'FH2O', 'SOOTPERCENTAGE', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'V100']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       282
           1       1.00      0.50      0.67        38

    accuracy                           0.94       



두 번째 반복문: 100%|██████████| 2/2 [00:42<00:00, 20.54s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:42<00:00, 21.34s/it]

첫 번째 반복문:  50%|█████     | 1/2 [00:42<00:42, 42.69s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.96      0.99      0.97        68
           1       0.50      0.25      0.33         4

    accuracy                           0.94        72
   macro avg       0.73      0.62      0.65        72
weighted avg       0.93      0.94      0.94        72

None




두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


SMOTEENN(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.980
f1_score : 0.981
f1_score : 0.972
f1_score : 0.867

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'SOOTPERCENTAGE', 'U6', 'U20', 'FTBN', 'FSO4', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'V100']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       266
           1       0.99      0.98      0.98       299

    accuracy                           0.98       565
   macro avg       0.98      0.98      0.98       565
weighted avg       0.98      0.98      0.98       565

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       139
           1       0.99      0.95      0.97       239

    accuracy                           0.96       378
   macro avg       0.96      0



두 번째 반복문:  50%|█████     | 1/2 [00:41<00:41, 41.71s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        52
           1       0.94      0.88      0.91        17

    accuracy                           0.96        69
   macro avg       0.95      0.93      0.94        69
weighted avg       0.96      0.96      0.96        69

None

SMOTETomek(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.974
f1_score : 0.972
f1_score : 0.967
f1_score : 0.839

['FOXID', 'SOOTPERCENTAGE', 'U25', 'U20', 'FTBN', 'FNOX', 'U100', 'U50', 'U14', 'FSO4', 'FOPTIMETHGLY', 'U4', 'U75', 'V100', 'U6', 'FH2O', 'BE', 'FUEL']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       382
           1       0.97      0.97      0.97       315

    accuracy                           0.98       697
   macro avg       0.98     



두 번째 반복문: 100%|██████████| 2/2 [01:19<00:00, 39.25s/it][A
두 번째 반복문: 100%|██████████| 2/2 [01:19<00:00, 39.62s/it]

첫 번째 반복문: 100%|██████████| 2/2 [02:01<00:00, 64.20s/it]
첫 번째 반복문: 100%|██████████| 2/2 [02:01<00:00, 60.97s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.93      1.00      0.96        74
           1       1.00      0.67      0.80        18

    accuracy                           0.93        92
   macro avg       0.96      0.83      0.88        92
weighted avg       0.94      0.93      0.93        92

None





In [14]:
# RandomOverSampler(random_state=42)
# f1_score : 0.997
# f1_score : 0.995
# f1_score : 0.995
# f1_score : 1.000

# SMOTEENN(random_state=42) 
# f1_score : 0.975
# f1_score : 0.983
# f1_score : 0.970
# f1_score : 0.800

In [15]:
# XGB & sampler 전부다 
for key, value in tqdm(sampler_dic.items(), desc="\n첫 번째 반복문"):
    for name, function in tqdm(value.items(), desc="\n두 번째 반복문"):
        variable_dict = {
            "X": X, 
            "y": y, 
            "categorical_feature": "COMPONENT_ARBITRARY", 
            "test_size": 0.1, 
            "learner": ("classification", "XGB"), 
            "sampler": (key, name), 
            "random_state_": 42,
            "dimensionality": PCA
        }

        first_try = Preprocessing(**variable_dict)
        X = first_try.X
        y = first_try.y
        print()

        # 샘플링 그룹핑 스플릿
        X2, y2 = first_try.sampling(X, y)
        grouped_dic = first_try.grouping_df(X2, y2, y_column='Y_LABEL')
        split_X_y_bundle = first_try.split_X_y_bundle(grouped_dic)
        print()

        # 피처임포턴스 확인
        result_ = first_try.feature_importance_for_groups(split_X_y_bundle)
        features = result_[1]
        drop_target_list = first_try.chose_drop_features(features, draw=False)
        print()
        print(drop_target_list)
        print()

        # 파이프라인 결과 확인
        print(first_try.print_report(split_X_y_bundle))


첫 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s]

두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


RandomOverSampler(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.902
f1_score : 0.939
f1_score : 0.844
f1_score : 0.964

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'MO', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'MN', 'U20', 'FTBN', 'SN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'K', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.97      0.91      0.94       386
           1       0.89      0.96      0.93       312

    accuracy                           0.93       698
   macro avg       0.93      0.93      0.93       698
weighted avg       0.93      0.93      0.93       698

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.88      0.99      0.93       215
           1       0.98      0.86      0.92       217

    accur



두 번째 반복문:  50%|█████     | 1/2 [00:11<00:11, 11.70s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       1.00      0.99      0.99        77
           1       0.98      1.00      0.99        54

    accuracy                           0.99       131
   macro avg       0.99      0.99      0.99       131
weighted avg       0.99      0.99      0.99       131

None

random_state 없는 샘플러
NeighbourhoodCleaningRule() completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.678
f1_score : 0.690
f1_score : 0.667
f1_score : 0.400

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'SB', 'H2O', 'U6', 'MN', 'U20', 'FTBN', 'SN', 'FSO4', 'FOPTIMETHGLY', 'CD', 'U75', 'BE', 'FUEL', 'U25', 'PB', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       282
           1       0.95      0.50      



두 번째 반복문: 100%|██████████| 2/2 [00:22<00:00, 11.40s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:22<00:00, 11.45s/it]

첫 번째 반복문:  50%|█████     | 1/2 [00:22<00:22, 22.91s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.97      0.99      0.98        68
           1       0.67      0.50      0.57         4

    accuracy                           0.96        72
   macro avg       0.82      0.74      0.77        72
weighted avg       0.95      0.96      0.96        72

None




두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


SMOTEENN(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.966
f1_score : 0.966
f1_score : 0.942
f1_score : 0.848

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'MO', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'MN', 'U20', 'FTBN', 'NA', 'SN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'PB', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.91      0.91      0.91       266
           1       0.92      0.92      0.92       299

    accuracy                           0.92       565
   macro avg       0.92      0.92      0.92       565
weighted avg       0.92      0.92      0.92       565

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       139
           1       1.00      0.91      0.95       239

    a



두 번째 반복문:  50%|█████     | 1/2 [00:31<00:31, 31.78s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.93      0.83      0.88        52
           1       0.61      0.82      0.70        17

    accuracy                           0.83        69
   macro avg       0.77      0.83      0.79        69
weighted avg       0.85      0.83      0.83        69

None

SMOTETomek(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.958
f1_score : 0.954
f1_score : 0.924
f1_score : 0.759

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'MN', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.93      0.94      0.93       382
           1       0.92      0.92      0.92       315

    accuracy          



두 번째 반복문: 100%|██████████| 2/2 [00:57<00:00, 28.24s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:57<00:00, 28.78s/it]

첫 번째 반복문: 100%|██████████| 2/2 [01:20<00:00, 43.29s/it]
첫 번째 반복문: 100%|██████████| 2/2 [01:20<00:00, 40.24s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.91      0.99      0.95        74
           1       0.92      0.61      0.73        18

    accuracy                           0.91        92
   macro avg       0.91      0.80      0.84        92
weighted avg       0.91      0.91      0.91        92

None





In [16]:
# RandomOverSampler(random_state=42) 
# f1_score : 0.902
# f1_score : 0.939
# f1_score : 0.844
# f1_score : 0.964

# SMOTEENN(random_state=42) 
# f1_score : 0.966
# f1_score : 0.966
# f1_score : 0.942
# f1_score : 0.848

In [17]:
# LGBM & sampler 전부다 
for key, value in tqdm(sampler_dic.items(), desc="\n첫 번째 반복문"):
    for name, function in tqdm(value.items(), desc="\n두 번째 반복문"):
        variable_dict = {
            "X": X, 
            "y": y, 
            "categorical_feature": "COMPONENT_ARBITRARY", 
            "test_size": 0.1, 
            "learner": ("classification", "LGBM"), 
            "sampler": (key, name), 
            "random_state_": 42,
            "dimensionality": PCA
        }

        first_try = Preprocessing(**variable_dict)
        X = first_try.X
        y = first_try.y
        print()

        # 샘플링 그룹핑 스플릿
        X2, y2 = first_try.sampling(X, y)
        grouped_dic = first_try.grouping_df(X2, y2, y_column='Y_LABEL')
        split_X_y_bundle = first_try.split_X_y_bundle(grouped_dic)
        print()

        # 피처임포턴스 확인
        result_ = first_try.feature_importance_for_groups(split_X_y_bundle)
        features = result_[1]
        drop_target_list = first_try.chose_drop_features(features, draw=False)
        print()
        print(drop_target_list)
        print()

        # 파이프라인 결과 확인
        print(first_try.print_report(split_X_y_bundle))


첫 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s]

두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


RandomOverSampler(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.997
f1_score : 0.993
f1_score : 0.981
f1_score : 0.991

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       386
           1       0.97      1.00      0.99       312

    accuracy                           0.99       698
   macro avg       0.99      0.99      0.99       698
weighted avg       0.99      0.99      0.99       698

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       215
           1       1.00      1.00      1.00       217

    accuracy                    



두 번째 반복문:  50%|█████     | 1/2 [00:08<00:08,  8.46s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.99      1.00      0.99        77
           1       1.00      0.98      0.99        54

    accuracy                           0.99       131
   macro avg       0.99      0.99      0.99       131
weighted avg       0.99      0.99      0.99       131

None

random_state 없는 샘플러
NeighbourhoodCleaningRule() completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.690
f1_score : 0.690
f1_score : 0.695
f1_score : 0.222

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'MN', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.94      1.00      0.97       282
           1       0.95      0.55      0.70        38

  



두 번째 반복문: 100%|██████████| 2/2 [00:21<00:00, 11.25s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:21<00:00, 10.85s/it]

첫 번째 반복문:  50%|█████     | 1/2 [00:21<00:21, 21.70s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.97      1.00      0.99        68
           1       1.00      0.50      0.67         4

    accuracy                           0.97        72
   macro avg       0.99      0.75      0.83        72
weighted avg       0.97      0.97      0.97        72

None




두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


SMOTEENN(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.973
f1_score : 0.990
f1_score : 0.968
f1_score : 0.857

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'SN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.97      0.98       266
           1       0.97      0.99      0.98       299

    accuracy                           0.98       565
   macro avg       0.98      0.98      0.98       565
weighted avg       0.98      0.98      0.98       565

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.94      0.99      0.97       139
           1       1.00      0.96      0.98       239

    accuracy                 



두 번째 반복문:  50%|█████     | 1/2 [00:27<00:27, 27.35s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.96      0.88      0.92        52
           1       0.71      0.88      0.79        17

    accuracy                           0.88        69
   macro avg       0.84      0.88      0.85        69
weighted avg       0.90      0.88      0.89        69

None

SMOTETomek(random_state=42) completed resampling X and y
COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.981
f1_score : 0.976
f1_score : 0.973
f1_score : 0.875

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       382
           1       0.96      0.97      0.97       315

    accuracy                



두 번째 반복문: 100%|██████████| 2/2 [00:50<00:00, 24.84s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:50<00:00, 25.23s/it]

첫 번째 반복문: 100%|██████████| 2/2 [01:12<00:00, 38.63s/it]
첫 번째 반복문: 100%|██████████| 2/2 [01:12<00:00, 36.09s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        74
           1       1.00      0.78      0.88        18

    accuracy                           0.96        92
   macro avg       0.97      0.89      0.92        92
weighted avg       0.96      0.96      0.95        92

None





In [18]:
# RandomOverSampler(random_state=42)
# f1_score : 0.997
# f1_score : 0.993
# f1_score : 0.981
# f1_score : 0.991

# SMOTETomek(random_state=42) 
# f1_score : 0.981
# f1_score : 0.976
# f1_score : 0.973
# f1_score : 0.875

In [19]:
# 피처 드랍
# X2 = X2.drop(columns=drop_target_list)
# print()

# # 데이터 준비
# sec_grouped_dic = first_try.grouping_df(X2, y2, y_column='Y_LABEL')
# sec_split_X_y_bundle = first_try.split_X_y_bundle(sec_grouped_dic)
# print()

# # 피처임포턴스 확인
# sec_result_ = first_try.feature_importance_for_groups(sec_split_X_y_bundle)
# sec_features = sec_result_[1]
# sec_drop_target_list = first_try.chose_drop_features(sec_features, draw=False)
# print()
# print(sec_drop_target_list)
# print()

# 파이프라인 결과 확인
#print(first_try.print_report(sec_split_X_y_bundle))

In [20]:
# RF & sampler 전부다 
for key, value in tqdm(sampler_dic.items(), desc="\n첫 번째 반복문"):
    for name, function in tqdm(value.items(), desc="\n두 번째 반복문"):
        variable_dict = {
            "X": X, 
            "y": y, 
            "categorical_feature": "COMPONENT_ARBITRARY", 
            "test_size": 0.1, 
            "learner": ("classification", "RF"), 
            "sampler": (key, name), 
            "random_state_": 42,
            "dimensionality": PCA
        }
        # 피처 드랍
        X22 = X2.drop(columns=drop_target_list)
        print()

        # 샘플링 그룹핑 스플릿
        grouped_dic2 = first_try.grouping_df(X22, y2, y_column='Y_LABEL')
        split_X_y_bundle2 = first_try.split_X_y_bundle(grouped_dic)
        print()

        # 피처임포턴스 확인
        result_2 = first_try.feature_importance_for_groups(split_X_y_bundle2)
        features2 = result_2[1]
        drop_target_list2 = first_try.chose_drop_features(features2, draw=False)
        print()
        print(drop_target_list2)
        print()

        # 파이프라인 결과 확인
        print(first_try.print_report(split_X_y_bundle2))


첫 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s]

두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.981
f1_score : 0.976
f1_score : 0.973
f1_score : 0.875

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       382
           1       0.96      0.97      0.97       315

    accuracy                           0.97       697
   macro avg       0.97      0.97      0.97       697
weighted avg       0.97      0.97      0.97       697

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       211
           1       0.97      0.99      0.98       268

    accuracy                           0.97       479
   macro avg       0.97      0.97   



두 번째 반복문:  50%|█████     | 1/2 [00:14<00:14, 14.45s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        74
           1       1.00      0.78      0.88        18

    accuracy                           0.96        92
   macro avg       0.97      0.89      0.92        92
weighted avg       0.96      0.96      0.95        92

None

COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.981
f1_score : 0.976
f1_score : 0.973
f1_score : 0.875

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       382
           1       0.96      0.97      0.97       315

    accuracy                           0.97       697
   macro avg       0.97      0.



두 번째 반복문: 100%|██████████| 2/2 [00:27<00:00, 13.66s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:27<00:00, 13.78s/it]

첫 번째 반복문:  50%|█████     | 1/2 [00:27<00:27, 27.57s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        74
           1       1.00      0.78      0.88        18

    accuracy                           0.96        92
   macro avg       0.97      0.89      0.92        92
weighted avg       0.96      0.96      0.95        92

None




두 번째 반복문:   0%|          | 0/2 [00:00<?, ?it/s][A


COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.981
f1_score : 0.976
f1_score : 0.973
f1_score : 0.875

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       382
           1       0.96      0.97      0.97       315

    accuracy                           0.97       697
   macro avg       0.97      0.97      0.97       697
weighted avg       0.97      0.97      0.97       697

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       211
           1       0.97      0.99      0.98       268

    accuracy                           0.97       479
   macro avg       0.97      0.97   



두 번째 반복문:  50%|█████     | 1/2 [00:12<00:12, 12.89s/it][A

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        74
           1       1.00      0.78      0.88        18

    accuracy                           0.96        92
   macro avg       0.97      0.89      0.92        92
weighted avg       0.96      0.96      0.95        92

None

COMPONENT_ARBITRARY
dividing my df on 1
dividing my df on 2
dividing my df on 3
dividing my df on 4

f1_score : 0.981
f1_score : 0.976
f1_score : 0.973
f1_score : 0.875

['FOXID', 'FNOX', 'U100', 'V', 'FH2O', 'LI', 'AG', 'SOOTPERCENTAGE', 'H2O', 'U6', 'U20', 'FTBN', 'FSO4', 'CD', 'FOPTIMETHGLY', 'U75', 'BE', 'FUEL', 'U25', 'U50', 'U14', 'U4', 'CO', 'CR', 'V100', 'NI', 'TI']

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       382
           1       0.96      0.97      0.97       315

    accuracy                           0.97       697
   macro avg       0.97      0.



두 번째 반복문: 100%|██████████| 2/2 [00:25<00:00, 12.93s/it][A
두 번째 반복문: 100%|██████████| 2/2 [00:25<00:00, 12.93s/it]

첫 번째 반복문: 100%|██████████| 2/2 [00:53<00:00, 26.58s/it]
첫 번째 반복문: 100%|██████████| 2/2 [00:53<00:00, 26.73s/it]

PCA() 사용한 pipe line
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        74
           1       1.00      0.78      0.88        18

    accuracy                           0.96        92
   macro avg       0.97      0.89      0.92        92
weighted avg       0.96      0.96      0.95        92

None





In [21]:
X2.columns.values

array(['COMPONENT_ARBITRARY', 'ANONYMOUS_1', 'YEAR',
       'SAMPLE_TRANSFER_DAY', 'ANONYMOUS_2', 'AG', 'AL', 'B', 'BA', 'BE',
       'CA', 'CD', 'CO', 'CR', 'CU', 'FH2O', 'FNOX', 'FOPTIMETHGLY',
       'FOXID', 'FSO4', 'FTBN', 'FE', 'FUEL', 'H2O', 'K', 'LI', 'MG',
       'MN', 'MO', 'NA', 'NI', 'P', 'PB', 'PQINDEX', 'S', 'SB', 'SI',
       'SN', 'SOOTPERCENTAGE', 'TI', 'U100', 'U75', 'U50', 'U25', 'U20',
       'U14', 'U6', 'U4', 'V', 'V100', 'V40', 'ZN'], dtype=object)

In [22]:
test.columns.values

array(['ID', 'COMPONENT_ARBITRARY', 'ANONYMOUS_1', 'YEAR', 'ANONYMOUS_2',
       'AG', 'CO', 'CR', 'CU', 'FE', 'H2O', 'MN', 'MO', 'NI', 'PQINDEX',
       'TI', 'V', 'V40', 'ZN'], dtype=object)

In [23]:
drop_list = list(set(test.columns.values) - set(X2.columns.values))

In [24]:
test = test.drop(columns=drop_list)