## 불균형자료 (Imbalanced Data)
어떤 클래스(minority class)에 속한 개체의 수가 다른 클래스(majority class)에 속한 개체의 수보다 현저히 적은 데이터   
오차를 최소화하는 전통적 기법을 적용하면 majority class로 분류하는 경향이 있음

<br>

### 불균형자료를 처리하는 방법
- Resampling techniques: over-sampling, under-sampling
- Cost-sensitive learning
- Ensemble methods: bagging or boosting
- Thresholding
- Generative adversarial networks (GANs)

<br>

### 재표집 기법에 의한 불균형자료 처리: imblearn
https://imbalanced-learn.org/stable/introduction.html

In [1]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
     ---------------------------------------- 0.0/226.0 kB ? eta -:--:--
     -------------------------------------- 226.0/226.0 kB 6.7 MB/s eta 0:00:00
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.10.1 imblearn-0.0



[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Over-sampling

```python
from imblearn.over_sampling import XXX
XX, yy = XXX(options).fit_resample(X, y)
```

<br>

- X: 2차원 데이터
    - 2-D list
    - 2-D numpy.ndarray
    - pandas.DataFrame
- y: 1차원 데이터
    - 1-D numpy.ndarray
    - pandas.Series

## Combination of over-and under-sampling
```python
from imblearn.combine import XXX
XX, yy = XXX(options).fit_resample(X, y)
```

<br>

### 방법
- SMOTEENN
- SMOTETomek

## Credit Card Fraud Detection
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

- transactions made by credit cards in September 2013 by European cardholders.
- 492 frauds out of 284,807 transactions
- features V1, V2, ..., V28: the result of a PCA transformation
    - cannot provide the original features and more background information due to confidentiality issues
- Time, Amount: original data
    - Time: the seconds elapsed between each transactionand the first transaction in the dataset
    - Amount: the transaction Amount, this feature can be used for example-dependant cost-sensitive learning
- Class: the response variable and it takes value 1 in case of fraud and 0 otherwise

In [23]:
import pandas as pd

카드자료 = pd.read_csv('creditcard.csv')
카드자료.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [24]:
카드자료['Class'] = 카드자료['Class'].astype('category')
for 변수명 in list(카드자료.columns[:-1]):
    카드자료[변수명] = 카드자료[변수명].astype('float32')
카드자료.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   Time    284807 non-null  float32 
 1   V1      284807 non-null  float32 
 2   V2      284807 non-null  float32 
 3   V3      284807 non-null  float32 
 4   V4      284807 non-null  float32 
 5   V5      284807 non-null  float32 
 6   V6      284807 non-null  float32 
 7   V7      284807 non-null  float32 
 8   V8      284807 non-null  float32 
 9   V9      284807 non-null  float32 
 10  V10     284807 non-null  float32 
 11  V11     284807 non-null  float32 
 12  V12     284807 non-null  float32 
 13  V13     284807 non-null  float32 
 14  V14     284807 non-null  float32 
 15  V15     284807 non-null  float32 
 16  V16     284807 non-null  float32 
 17  V17     284807 non-null  float32 
 18  V18     284807 non-null  float32 
 19  V19     284807 non-null  float32 
 20  V20     284807 non-null  f

In [25]:
카드자료.to_parquet('creditcard.parquet')

In [26]:
# Classa별 빈도
카드자료['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [27]:
X = 카드자료.iloc[:, 0:-1]
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float32
 1   V1      284807 non-null  float32
 2   V2      284807 non-null  float32
 3   V3      284807 non-null  float32
 4   V4      284807 non-null  float32
 5   V5      284807 non-null  float32
 6   V6      284807 non-null  float32
 7   V7      284807 non-null  float32
 8   V8      284807 non-null  float32
 9   V9      284807 non-null  float32
 10  V10     284807 non-null  float32
 11  V11     284807 non-null  float32
 12  V12     284807 non-null  float32
 13  V13     284807 non-null  float32
 14  V14     284807 non-null  float32
 15  V15     284807 non-null  float32
 16  V16     284807 non-null  float32
 17  V17     284807 non-null  float32
 18  V18     284807 non-null  float32
 19  V19     284807 non-null  float32
 20  V20     284807 non-null  float32
 21  V21     28

In [28]:
y = 카드자료['Class']
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 284807 entries, 0 to 284806
Series name: Class
Non-Null Count   Dtype   
--------------   -----   
284807 non-null  category
dtypes: category(1)
memory usage: 278.4 KB


In [29]:
from imblearn.over_sampling import SMOTE
X_o, y_o = SMOTE(random_state=316, sampling_strategy=0.5).fit_resample(X, y)

In [30]:
y_o.value_counts()

0    284315
1    142157
Name: Class, dtype: int64

In [32]:
# Series+Series, DataFrame+Series, DataFrame+DataFrame
카드오버 = pd.concat([X_o, y_o], axis=1)

In [33]:
카드오버['Class'].value_counts()

0    284315
1    142157
Name: Class, dtype: int64

In [34]:
from imblearn.under_sampling import NearMiss
X_u, y_u = NearMiss(version=1).fit_resample(X, y)

In [35]:
카드언더 = pd.concat([X_u, y_u], axis=1)
카드언더['Class'].value_counts()

0    492
1    492
Name: Class, dtype: int64

In [36]:
from imblearn.under_sampling import RandomUnderSampler
X_u, y_u = RandomUnderSampler(random_state=316, sampling_strategy=0.5).fit_resample(X, y)

In [37]:
카드언더 = pd.concat([X_u, y_u], axis = 1)
카드언더['Class'].value_counts()

0    984
1    492
Name: Class, dtype: int64