<a href="https://colab.research.google.com/github/cswcjt/Fastcampus-ML/blob/main/imbalance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

-- Imbalanced data problem: 비대칭 데이터 문제

    - 데이터 클래스 비율이 너무 차이가 나면(highly-imbalanced data) 단순히 우세한 클래스를 택하는 모형의 정확도가 높아지므로 모형의 성능판별이 어려워진다. 즉, 정확도(accuracy)가 높아도 데이터 갯수가 적은 클래스의 재현율(recall-rate)이 급격히 작아지는 현상이 발생할 수 있다.

-- Imbalanced-Learn methodology: 해결 방법 

    - 비대칭 데이터는 다수 클래스 데이터에서 일부만 사용하는 언더 샘플링이나 소수 클래스 데이터를 증가시키는 오버 샘플링을 사용하여 데이터 비율을 맞추면 정밀도(precision)가 향상된다. 

    - 오버샘플링(Over-Sampling)

    - 언더샘플링(Under-Sampling)

    - 복합샘플링(Combining Over-and Under-Sampling)

-- under sampling 

    - RandomUnderSampler: 무작위로 데이터를 없애는 단순 샘플링

    - TomekLinks: 
        - 1) 토멕링크(Tomek’s link)란 서로 다른 클래스에 속하는 한 쌍의 데이터 (𝑥+,𝑥−)로 서로에게 더 가까운 다른 데이터가 존재하지 않는 것이다. 
        - 2) 토멕링크를 찾은 다음 그 중에서 다수 클래스에 속하는 데이터를 제외하는 방법

    - CondensedNearestNeighbour(CNN): 
        - CNN(Condensed Nearest Neighbour) 방법은 1-NN 모형으로 분류되지 않는 데이터만 남기는 방법이다. 선텍된 데이터 집합을 𝑆라고 하자.
        - 1) 소수 클래스 데이터를 모두 𝑆에 포함시킨다.
        - 2) 다수 데이터 중에서 하나를 골라서 가장 가까운 데이터가 다수 클래스이면 포함시키지 않고 아니면 𝑆에 포함시킨다.
        - 3) 더이상 선택되는 데이터가 없을 때까지 3를 반복한다.

    - OneSidedSelection: 
        - TomekLinks + CNN 

    - EditedNearestNeighbours(ENN): 
        - 다수 클래스 데이터 중 가장 가까운 k(n_neighbors)개의 데이터가 모두(kind_sel="all") 또는 다수(kind_sel="mode") 다수 클래스가 아니면 삭제하는 방법이다. 소수 클래스 주변의 다수 클래스 데이터는 사라진다.

    - NeighbourhoodCleaningRule: 
        -  CNN + ENN

    
-- over sampling

    - RandomOverSampler
        - Random Over Sampling은 소수 클래스의 데이터를 반복해서 넣는 것(replacement)이다. 가중치를 증가시키는 것과 비슷하다. 

    - ADASYN(Adaptive Synthetic Sampling Approach for Imbalanced Learning)
        - 소수 클래스 데이터와 그 데이터에서 가장 가까운 k개의 소수 클래스 데이터 중 무작위로 선택된 데이터 사이의 직선상에 가상의 소수 클래스 데이터를 만드는 방법이다.

    - SMOTE(Synthetic Minority Over-sampling Technique)
        - ADASYN 방법처럼 데이터를 생성하지만 생성된 데이터를 무조건 소수 클래스라고 하지 않고 분류 모형에 따라 분류한다.

-- 복합 샘플링

    - SMOTE+ENN
        - SMOTE+ENN 방법은 SMOTE(Synthetic Minority Over-sampling Technique) 방법과 ENN(Edited Nearest Neighbours) 방법을 섞은 것이다. 
    - SMOTE+Tomek
        - SMOTE+Tomek 방법은 SMOTE(Synthetic Minority Over-sampling Technique) 방법과 토멕링크 방법을 섞은 것이다.

-- 주의 사항

    - train & test datasets의 포퍼먼스를 주시해야한다. 
        - to gain insight into the impact of the method, it is a good idea to monitor the performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

    - 클래스의 분포의 skewness가 높은지 확인이 필요하다. 
        - The increase in the number of examples for the minority class, especially if the class skew was severe, can also result in a marked increase in the computational cost when fitting the model, especially considering the model is seeing the same examples in the training dataset again and again.

    - Pipeline 사용시
        - A traditional scikit-learn Pipeline cannot be used; instead, a Pipeline from the imbalanced-learn library can be used

-- 왜도와 첨도

    - 왜도 (Skewness) : 분포의 비대칭도
        - 정규분포 = 왜도 0
        - 왼쪽으로 치우침 = 왜도 > 0
        - 오른쪽으로 치우침 = 왜도 < 0
    - 첨도 (Kurtosis) : 확률분포의 뾰족한 정도
        - 정규분포 = 첨도 0(Pearson 첨도 = 3)
        - 위로 뾰족함 = 첨도 > 0(Pearson 첨도 >3)
        - 아래로 뾰족함 = 첨도 < 0 (Pearson 첨도 < 3) 


 

Preprocessing에 들어갈 imbalance 함수 만들자

In [2]:
# 데이터분석 4종 세트
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive
drive.mount('/content/drive')
base_path = "/content/drive/MyDrive/fastcamp/datas/open/"
train = pd.read_csv(base_path + "train.csv", encoding='cp949')
test = pd.read_csv(base_path + "test.csv", encoding='cp949')
submission = pd.read_csv(base_path + "sample_submission.csv", encoding='cp949')
print("Train shape : ", train.shape)
print("Test shape : ", test.shape)

Mounted at /content/drive
Train shape :  (14095, 54)
Test shape :  (6041, 19)


In [293]:
from collections import Counter # 샘플결과 확인
from sklearn.model_selection import train_test_split as tts # 트테트테
from sklearn.decomposition import PCA # 차원축소
from sklearn.ensemble import RandomForestClassifier as RFC # 모델선택
from sklearn.metrics import f1_score # 성과지표
from sklearn.metrics import classification_report # 성과지표
from imblearn.under_sampling import * # 임벨런스
from imblearn.over_sampling import * # 임벨런스
from imblearn.combine import * # 임벨런스
from imblearn.pipeline import Pipeline # 파이프라인구축

In [208]:
base_path = "/content/drive/MyDrive/fastcamp/datas/open/"
train = pd.read_csv(base_path + "train.csv", encoding='cp949')
test = pd.read_csv(base_path + "test.csv", encoding='cp949')
submission = pd.read_csv(base_path + "sample_submission.csv", encoding='cp949')

class oil:
    def __init__(self):
        base_path = '/content/drive/MyDrive/fastcamp/datas/open/'
        self.load_path = base_path #+ 'data/'
        self.save_path = base_path #+ 'submission/'
        self.train = pd.read_csv(self.load_path + 'train.csv')
        self.test = pd.read_csv(self.load_path + 'test.csv')
        self.submission = pd.read_csv(self.load_path + 'sample_submission.csv')

        self.X = train.drop(columns=['ID', 'Y_LABEL'])
        #self.y = train.Y_LABEL
        #self.X_test = test.drop(columns=['ID'])
        #self.X2
    
    def reset_data(self, mode: str=None):
        if mode is None:
            self.train = pd.read_csv(self.load_path + 'train.csv')
            self.test = pd.read_csv(self.load_path + 'test.csv')
        elif mode == 'train':
            self.train = pd.read_csv(self.load_path + 'train.csv')
        elif mode == 'test':
            self.test = pd.read_csv(self.load_path + 'test.csv')

In [209]:
# 모든 imbalance sampler 담고있는 df 만든다. 
sampling_method_info = pd.DataFrame(
    {"under_sampling" : [RandomUnderSampler(),
                        TomekLinks(), 
                        CondensedNearestNeighbour(), 
                        OneSidedSelection(), 
                        EditedNearestNeighbours(),
                        NeighbourhoodCleaningRule()],

    "over_sampling" : [RandomOverSampler(),
                       ADASYN(),
                       NeighbourhoodCleaningRule(),
                       False,
                       False,
                       False],

    "hybrid_samping" : [SMOTEENN(),
                        SMOTETomek(),
                        False,
                        False,
                        False,
                        False,]
     }
)

In [212]:
# sampling_method_info["under_sampling"]

In [139]:
# 리스트 3개로 나누어 두는게 더 편리하다
# under_sampling -> 그룹핑 아예 안한 경우랑 비교해보는게 좋을듯 
# 나머지도 해보는걸로..!

In [177]:
if type(sampling_method_info) == pd.core.frame.DataFrame : 
    print(1)

1


In [175]:
type(sampling_method_info)

pandas.core.frame.DataFrame

In [307]:
def make_X_y(df) : 
    """
    will be called when to make train and validation data set
    return X, y
    """
    print(df)
    X = df.drop(columns = "Y_LABEL") # feature vectors
    y = df.Y_LABEL # target value
    return X, y

def grouping_df(train_df, categorical_feature) : 
    """
    divide train_df to make each group df
    return grouped df list 
    """
    #print(df[categorical_feature])
    train_df = train_df.drop(columns = "ID")
    grouped_dic = {}
    for standard in list(train_df[categorical_feature].unique()) : 
        print(f"dividing my df on {standard}")
        grouped_dic[f"{standard}"] = train_df.loc[train_df[categorical_feature] ==  standard, ].drop(columns = categorical_feature)
    return grouped_dic

def call_sampling_method(data, sampling_method_info = sampling_method_info, one_of_columns: str = None) :
    """
    choose sampling method name from one of the columns of sampling_method_info_df
    resample X_imb and y_imb
    return the balanced sampling of X and y
    """
    sampler_list = sampling_method_info[one_of_columns]
    if type(data) == pd.core.frame.DataFrame : 
        X, y = make_X_y(data)
        return (X, y)

    else : 
        X_y_dic = {}
        for key, group in data.items() : 
            group = group.fillna(0)
            X, y = make_X_y(group)
            X_y_dic[f"{key}"] = (X, y)

        balanced_dic = {}
        num = 0
        for sampler in sampler_list : 
            try :
                for key, grouped_tuple in X_y_dic.items() : 
                    balanced_dic[f"{sampler}".replace("()", "")] = sampler.fit_resample(grouped_tuple[0], grouped_tuple[1])
            except AttributeError : 
                pass

        return balanced_dic
  
def split_data(data) : 
    """
    use this function to split train and validation data set
    return X_train, X_val, y_train, y_val
    """
    if type(data) == tuple : 
        X_train, X_val, y_train, y_val = tts(data[0], data[1], test_size=0.1, random_state=42)
        return X_train, X_val, y_train, y_val

    else : 
        tts_dic = {}
        for key, tuple_set in data.items() :
            (X_train, X_val, y_train, y_val) = tts(tuple_set[0], tuple_set[1], test_size=0.1, random_state=42)
            tts_dic[f"{key}"] = (X_train, X_val, y_train, y_val)
        return tts_dic

def random_forest_call(split_data) :
  """
  main use will be supplement function for classifier_with_group
  """
  classifier_without_dic = RFC()  # Random Forest 분류기 불러오기
  acc_dic = {}
  for key, tuple_dic in split_data.items() : 
      classifier_without_dic.fit(tuple_dic[0], tuple_dic[2]) # Random Forest 학습을 위해 parameter 채우기
      pred = classifier_without_dic.predict(tuple_dic[1]) # Random Forest 테스트를 위해 parameter 채우기
      acc = f1_score(tuple_dic[3], pred)  # f1_score 계산
      acc_dic[f"{key}"] = acc
      #print(classification_report(tuple_dic[2], pred))
      print("f1_score : %.3f" % acc)

  return classifier_without_dic, acc_dic

def classifier_with_group(classifier, dic) : 
  """
  return list that contains all classifier for each group 
  """
  classifier_list = []
  for key in dic : 
      df = dic[key]
      classifier_with_group = random_forest_call(X_train = df[0], y_train = df[2], X_test = df[1], y_test = df[3])
      classifier_list.append(classifier_with_group)
  return classifier_list

  
# def zero_importance_columns(df, feature_list, threshold: int = None) : 
#     """
#     return list of all feature_importance for each group 
#     """
#     zero_list = []
#     if threshold == None : 
#         for my_list in feature_list : 
#             df = my_list.reset_index()
#             df.columns = ["name", "value"]
#             zero_list.extend(df[df.value == 0].name.to_list())
#         #print(zero_list, len(zero_list))
#         return list(set(zero_list))

# def drop_to_total(columns_list) : 
#     updated_group = []
#     for group in group_name :
#         group = group.drop(columns = intersection)
#         updated_group.append(group)
  
#     total_split_dic = {}
#     for num, group in enumerate(updated_group) : 
#         X, y = make_X_y(group)
#         X_train, X_val, y_train, y_val = train_val(df = group, X_df = X, y_df = y)
#         total_split_dic[num] = X_train, X_val, y_train, y_val
#         #total_split_dic.keys()

#     return total_split_dic

In [277]:
# data load
train_df = oil().train
train_df

Unnamed: 0,ID,COMPONENT_ARBITRARY,ANONYMOUS_1,YEAR,SAMPLE_TRANSFER_DAY,ANONYMOUS_2,AG,AL,B,BA,...,U25,U20,U14,U6,U4,V,V100,V40,ZN,Y_LABEL
0,TRAIN_00000,COMPONENT3,1486,2011,7,200,0,3,93,0,...,,,,,,0,,154.0,75,0
1,TRAIN_00001,COMPONENT2,1350,2021,51,375,0,2,19,0,...,2.0,4.0,6.0,216.0,1454.0,0,,44.0,652,0
2,TRAIN_00002,COMPONENT2,2415,2015,2,200,0,110,1,1,...,0.0,3.0,39.0,11261.0,41081.0,0,,72.6,412,1
3,TRAIN_00003,COMPONENT3,7389,2010,2,200,0,8,3,0,...,,,,,,0,,133.3,7,0
4,TRAIN_00004,COMPONENT3,3954,2015,4,200,0,1,157,0,...,,,,,,0,,133.1,128,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14090,TRAIN_14090,COMPONENT3,1616,2014,8,200,0,2,201,1,...,,,,,,0,,135.4,16,0
14091,TRAIN_14091,COMPONENT1,2784,2013,2,200,0,3,85,0,...,,,,,,0,14.5,117.5,1408,0
14092,TRAIN_14092,COMPONENT3,1788,2008,9,550,0,6,0,1,...,,,,,,0,,54.0,1301,0
14093,TRAIN_14093,COMPONENT2,2498,2009,19,550,0,2,4,0,...,7.0,8.0,100.0,1625.0,18890.0,0,,44.3,652,0


In [278]:
# grouping
grouped_dic = grouping_df(train_df, "COMPONENT_ARBITRARY")
grouped_dic["COMPONENT3"]

dividing my df on COMPONENT3
dividing my df on COMPONENT2
dividing my df on COMPONENT1
dividing my df on COMPONENT4


Unnamed: 0,ANONYMOUS_1,YEAR,SAMPLE_TRANSFER_DAY,ANONYMOUS_2,AG,AL,B,BA,BE,CA,...,U25,U20,U14,U6,U4,V,V100,V40,ZN,Y_LABEL
0,1486,2011,7,200,0,3,93,0,0,3059,...,,,,,,0,,154.0,75,0
3,7389,2010,2,200,0,8,3,0,0,1960,...,,,,,,0,,133.3,7,0
4,3954,2015,4,200,0,1,157,0,0,71,...,,,,,,0,,133.1,128,0
5,2061,2008,4,550,0,3,8,0,0,2770,...,,,,,,0,,69.7,1015,0
6,1416,2015,7,616,0,0,21,0,0,130,...,,,,,,0,,148.5,24,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14082,3060,2010,2,200,0,2,3,0,0,2523,...,,,,,,0,,140.0,35,1
14086,1637,2008,16,487,0,11,99,2,0,22,...,,,,,,0,,152.6,21,1
14088,1311,2010,6,511,0,0,20,0,0,227,...,,,,,,0,,128.9,20,0
14090,1616,2014,8,200,0,2,201,1,0,6,...,,,,,,0,,135.4,16,0


In [None]:
# imbalance problem
balanced_grouped_dic = call_sampling_method(grouped_dic, one_of_columns = "hybrid_samping")
balanced_grouped_dic.keys()

In [280]:
# split
# X_train = 0, X_val = 1, y_train = 2, y_val = 3
splited_balanced_grouped_dic = split_data(balanced_grouped_dic)
splited_balanced_grouped_dic.keys()

dict_keys(['SMOTEENN', 'SMOTETomek'])

In [289]:
splited_balanced_grouped_dic["SMOTEENN"][0]
splited_balanced_grouped_dic["SMOTEENN"][1]
splited_balanced_grouped_dic["SMOTEENN"][2]
splited_balanced_grouped_dic["SMOTEENN"][3]

101     0
260     0
1083    1
109     0
649     1
       ..
1192    1
746     1
620     1
275     0
1079    1
Name: Y_LABEL, Length: 121, dtype: int64

In [309]:
a, b = random_forest_call(splited_balanced_grouped_dic)

f1_score : 0.976
f1_score : 0.982


In [308]:
print(a)
print(b)

RandomForestClassifier()
{'SMOTEENN': 'f1_score : 0.991869918699187.3f', 'SMOTETomek': 'f1_score : 0.9824561403508771.3f'}


In [None]:
# pipe line 
### classification_report -> str에서 df로 바꾸면 편할듯
def pipine(sampling_method = None, dimensionality = None, model = None) : 
    pipeline = Pipeline([('sampling_method', sampling_method), ('dimensionality', dimensionality), ('model', model)]) # sampling method, dimensionality, model
    X_train, X_test, y_train, y_test = tts(X_samp, y_samp, random_state=42)
    pipeline.fit(X_train, y_train) 
    y_hat = pipeline.predict(X_test)
    #print(type(classification_report(y_test, y_hat)))
    return classification_report(y_test, y_hat)

def pipine(sampling_method = None, dimensionality = None, model = None) : 
    pipeline = Pipeline([('sampling_method', sampling_method), ('dimensionality', dimensionality), ('model', model)]) # sampling method, dimensionality, model
    X_train, X_test, y_train, y_test = tts(X_samp, y_samp, random_state=42)
    pipeline.fit(X_train, y_train) 
    y_hat = pipeline.predict(X_test)
    print(classification_report(y_test, y_hat))

In [None]:
pca = PCA()
rfc = RFC()
rmu = RandomUnderSampler()
pipine(rmu, pca, rfc)