# Target (Class or Label) Variable의 Imbalance 해결 방법

 - 상황 : 현실세계에서는 target에 대한 imbalance 발생할 수 밖에 없음
 - Imbalance data를 학습하면 어떤 현상이 생길 수 있을까?

# Target imbalance를 해결하는 방법
- Resampling으로 데이터의 수를 조절해서 Target간 균형을 맞춤(Minority class 주가 되는 것)
   - OverSampling(데이터의 중복 --> 과적합 발생 가능, minority class) vs UnderSampling(majority class, 정보손실)
- ML 알고리즘 활용:
   - 부트스트랩 데이터 이용하는 배깅알고리즘 이용(RandomForest)
   - Gradient boosting 알고리즘 이용(lightGBM, XGB~Classifer)
   

## OverSampling
 - SMOTE(Synthetic Minority Over-sampling )
    - https://incodom.kr/SMOTE

 - ADASYN
    - https://givitallugot.github.io/articles/2021-07/Python-imbalanced-sampling-copy

 - SMOTE, ADASYN : KNN 알고리즘을 사용
    - ADASYN의 경우는 K값의 범위에서 Majority Class 개수를 고려한 weight를 통해서 보간법이 이용된다는 것이 틀린점이다.
    - ADASYN의 결과 Zero Division Error가 발생하면 ==> SMOTE를 사용하자.

## UnderSampling
- Tomek Links UnsderSampling
    - Tomek Links Point를 찾는게 주요 쟁점
    - https://givitallugot.github.io/articles/2021-07/Python-imbalanced-sampling-copy
- ENN(Edited Nearest Neighbors)
    - https://datascienceschool.net/03%20machine%20learning/14.02%20%EB%B9%84%EB%8C%80%EC%B9%AD%20%EB%8D%B0%EC%9D%B4%ED%84%B0%20%EB%AC%B8%EC%A0%9C.html

## Pracice
# Basic Sampling Method

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

In [2]:
iris = load_iris()
df = pd.DataFrame(data=iris["data"], columns=iris["feature_names"])
df["label"] = iris["target"]
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
# 무작위로 10개의 데이터셋을 가져오기
df.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
20,5.4,3.4,1.7,0.2,0
96,5.7,2.9,4.2,1.3,1
57,4.9,2.4,3.3,1.0,1
36,5.5,3.5,1.3,0.2,0
113,5.7,2.5,5.0,2.0,2
115,6.4,3.2,5.3,2.3,2
71,6.1,2.8,4.0,1.3,1
62,6.0,2.2,4.0,1.0,1
118,7.7,2.6,6.9,2.3,2
24,4.8,3.4,1.9,0.2,0


In [4]:
# stratified random sampling

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, stratify=df["label"], random_state=27)

In [5]:
train["label"].value_counts()

label
0    40
2    40
1    40
Name: count, dtype: int64

In [6]:
test["label"].value_counts()

label
0    10
2    10
1    10
Name: count, dtype: int64

In [8]:
df["petal length (cm)"].describe()

count    150.000000
mean       3.758000
std        1.765298
min        1.000000
25%        1.600000
50%        4.350000
75%        5.100000
max        6.900000
Name: petal length (cm), dtype: float64

In [9]:
num01 = df["petal length (cm)"].describe()["min"]
num02 = df["petal length (cm)"].describe()["25%"]
num03 = df["petal length (cm)"].describe()["50%"]
num04 = df["petal length (cm)"].describe()["75%"]
num05 = df["petal length (cm)"].describe()["max"]

In [10]:
df["cluster"] = pd.cut(df["petal length (cm)"],
                bins=[num01, num02, num03, num04, num05],
                labels=["cluster1", "cluster2", "cluster3", "cluster4"])

df["cluster"].value_counts()

cluster
cluster1    43
cluster3    41
cluster4    34
cluster2    31
Name: count, dtype: int64

In [11]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
cluster              1
dtype: int64

In [13]:
df.dropna(inplace=True)

In [14]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
cluster              0
dtype: int64

In [15]:
cluster = np.random.choice(df["cluster"].unique())
cluster

'cluster4'

In [16]:
sample = df.loc[df["cluster"] == cluster]
sample

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label,cluster
100,6.3,3.3,6.0,2.5,2,cluster4
102,7.1,3.0,5.9,2.1,2,cluster4
103,6.3,2.9,5.6,1.8,2,cluster4
104,6.5,3.0,5.8,2.2,2,cluster4
105,7.6,3.0,6.6,2.1,2,cluster4
107,7.3,2.9,6.3,1.8,2,cluster4
108,6.7,2.5,5.8,1.8,2,cluster4
109,7.2,3.6,6.1,2.5,2,cluster4
111,6.4,2.7,5.3,1.9,2,cluster4
112,6.8,3.0,5.5,2.1,2,cluster4


## OverSampling : SMOTE, ADASYN

In [17]:
df_smote = df.copy()

In [24]:
df_smote["target"] = np.where(df["label"] == 0, 0, 1)

In [25]:
df_smote.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
label                0
target               0
dtype: int64

In [20]:
df_smote.drop(["cluster"], axis=1, inplace=True)

In [21]:
df_smote.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [26]:
from sklearn.model_selection import train_test_split

X = df_smote.drop(["label", "target"], axis=1)
y = df_smote["target"]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=27)
x_train.shape, x_test.shape

((119, 4), (30, 4))

In [27]:
y_train.value_counts()

target
1    80
0    39
Name: count, dtype: int64

In [30]:
# SMOTE, ADASYN Lib.

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN


In [31]:
smote = SMOTE(random_state=27, k_neighbors=3)

In [32]:
x_train_over, y_train_over = smote.fit_resample(x_train, y_train)

In [33]:
y_train_over.value_counts()

target
1    80
0    80
Name: count, dtype: int64

In [1]:
#ADASYN

adasyn = ADASYN(random_state=27, n_neighbors=3)
x_train_over, y_train_over = adasyn.fit_resample(x_train, y_train)

NameError: name 'ADASYN' is not defined