# Target(Class or Label) Variable의 Imbalance 해결방법

- 상황 : 현실세계에서는 target에 대한 imbalance 발생할 수 밖에 없음
- imbalance data를 학습하면 어떤 현상이 생길 수 있을까?

# Target imbalance를 해결하는 방법(모델의 일반화성능 향상기법)
- Resampling으로 데이터의 수를 조절해서 Target간 균형을 맞춤(Minority class 주가 되는것)
    - OverSampling(데이터의 중복 --> 과적합 발생 가능, minority class) vs UnderSampling(majority class, 정보손실)
- ML 알고리즘 활용:
    - 부트스트랩 데이터 이용하는 베깅알고리즘 이용(RandomForest)
    - Gradient boosting 알고리즘 이용(lightGBM, XGB~Classifer)


## OverSampling
- SMOTE(Synthetic Minority Over-sampling)
    - https://incodom.kr/SMOTE
- ADASYN
    - https://givitallugot.github.io/articles/2021-07/Python-imbalanced-sampling-copy
    - SMOTE, ADASYN : KNN 알고리즘을 사용
        - ADASYN의 경우는 K값의 범위에서 Majority Class 개수를 고려한 weight를 통해서 보간법이 이용된다는 것이 틀린점이다.
        - ADASYN의 결과 Zero Division Error가 발생하면 ==> SMOTE를 사용하자.

## UnderSampling
- Tomek Links UnsderSampling
    - Tomek Links Point를 찾는게 주요 쟁점
    - https://givitallugot.github.io/articles/2021-07/Python-imbalanced-sampling-copy
- ENN(Edited Nearest Neighbors)
    - https://datascienceschool.net/03%20machine%20learning/14.02%20%EB%B9%84%EB%8C%80%EC%B9%AD%20%EB%8D%B0%EC%9D%B4%ED%84%B0%20%EB%AC%B8%EC%A0%9C.html

# Practice

## Basic Sampling Method

In [30]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

In [31]:
iris = load_iris()
df = pd.DataFrame(data=iris["data"], columns=iris["feature_names"])
df["label"] = iris["target"]
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [32]:
# 무작위로 10개의 데이터셋을 가져오기
df.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
106,4.9,2.5,4.5,1.7,2
44,5.1,3.8,1.9,0.4,0
18,5.7,3.8,1.7,0.3,0
103,6.3,2.9,5.6,1.8,2
69,5.6,2.5,3.9,1.1,1
27,5.2,3.5,1.5,0.2,0
94,5.6,2.7,4.2,1.3,1
66,5.6,3.0,4.5,1.5,1
86,6.7,3.1,4.7,1.5,1
124,6.7,3.3,5.7,2.1,2


In [33]:
#stratified random sampling
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, stratify=df["label"], random_state=47)

In [34]:
train["label"].value_counts()

label
2    40
1    40
0    40
Name: count, dtype: int64

In [35]:
test["label"].value_counts()

label
0    10
1    10
2    10
Name: count, dtype: int64

In [36]:
# cluster random sampling
df["petal length (cm)"].describe()

count    150.000000
mean       3.758000
std        1.765298
min        1.000000
25%        1.600000
50%        4.350000
75%        5.100000
max        6.900000
Name: petal length (cm), dtype: float64

In [37]:
num01 = df["petal length (cm)"].describe()["min"]
num02 = df["petal length (cm)"].describe()["25%"]
num03 = df["petal length (cm)"].describe()["50%"]
num04 = df["petal length (cm)"].describe()["75%"]
num05 = df["petal length (cm)"].describe()["max"]

In [38]:
df["cluster"]= pd.cut(df["petal length (cm)"], 
                       bins=[num01, num02, num03, num04, num05],
                       labels=["cluster1", "cluster2", "cluster3", "cluster4"])

df["cluster"].value_counts()

cluster
cluster1    43
cluster3    41
cluster4    34
cluster2    31
Name: count, dtype: int64

In [39]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
cluster              1
dtype: int64

In [41]:
df.dropna(inplace=True)

In [42]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
cluster              0
dtype: int64

In [43]:
cluster = np.random.choice(df["cluster"].unique())
cluster

'cluster2'

In [44]:
sample = df.loc[df["cluster"] == cluster]
sample

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label,cluster
5,5.4,3.9,1.7,0.4,0,cluster2
18,5.7,3.8,1.7,0.3,0,cluster2
20,5.4,3.4,1.7,0.2,0,cluster2
23,5.1,3.3,1.7,0.5,0,cluster2
24,4.8,3.4,1.9,0.2,0,cluster2
44,5.1,3.8,1.9,0.4,0,cluster2
53,5.5,2.3,4.0,1.3,1,cluster2
57,4.9,2.4,3.3,1.0,1,cluster2
59,5.2,2.7,3.9,1.4,1,cluster2
60,5.0,2.0,3.5,1.0,1,cluster2


## OverSampling : SMOTE, ADASYN

In [53]:
df_smote = df.copy()

In [54]:
df_smote["target"] = np.where(df["label"] == 0, 0, 1)

In [55]:
df_smote.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
cluster              0
target               0
dtype: int64

In [56]:
df_smote.drop(["cluster"], axis=1, inplace=True)

In [57]:
from sklearn.model_selection import train_test_split

X = df_smote.drop(["label", "target"], axis=1)
y = df_smote["target"]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
x_train.shape, x_test.shape

((119, 4), (30, 4))

In [58]:
y_train.value_counts()

target
1    80
0    39
Name: count, dtype: int64

In [61]:
# SMOTE, ADASYN Lib. #(terminal) pip install imblearn
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

In [62]:
smote = SMOTE(random_state=47, k_neighbors=3)

In [63]:
x_train_over, y_train_over = smote.fit_resample(x_train, y_train)

In [64]:
y_train_over.value_counts()

target
1    80
0    80
Name: count, dtype: int64

In [65]:
#ADASYN
adasyn = ADASYN(random_state=47, n_neighbors=3)
x_train_over, y_train_over = adasyn.fit_resample(x_train, y_train)

RuntimeError: Not any neigbours belong to the majority class. This case will induce a NaN case with a division by zero. ADASYN is not suited for this specific dataset. Use SMOTE instead.