# Preprocessing part 2

+ In this section, I will handle the unbalanced data
+ Method: SMOTETomek

+ 为什么用SMOTETOMEK, smote, a widely uesed oversampling method to generate new positive data.
    + and then tomek, an undersampling method is uesed to remove the noise data.
    + 我的数据集样本不平衡, 需要先将正负样本平衡化 (初步测试表明, 在不平衡的原数据上使用各种方法性能都不佳, LR可以获得较好的AU_ROC值但其F1检验值也比较差)
    + 使样本平衡的办法一般包括欠采样和过采样. 欠采样方法通过选取较多类别数据中的一个小子集和较少类别数据组合, 使样本类别平衡, 但这样仅仅使用了整体数据集中的少量数据, 会造成对整体数据集的信息估计不准. 过采样方法通过多次重复选取较少类别数据或者基于KNN生成少类别新数据, 使得样本平衡, 但容易造成额外误差, 导致过拟合. 
    + 本工作中我使用过采样与欠采样结合的方式, 先使用过采样生成新数据使样本平衡, 然后使用欠采样移除异常数据与噪音
    + python imblearn提供了两种组合过采样与欠采样的方法:SMOTEENN以及SMOTETomek. 这两种方法都使用SMOTE进行过采样, 不同之处在于SMOTEENN使用EditedNearestNeighbours预测过采样后数据的标签, 去除预测错误的样本. 而SMOTETomek则通过移除Tomek link来达到欠采样的效果

In [2]:
import numpy as np
import pandas as pd
from imblearn.combine import SMOTEENN, SMOTETomek

---

## Load the data

In [4]:
training_data = pd.read_csv("data/602_train_data_distance.csv")
training_data.drop(columns=['Unnamed: 0'], inplace=True)
print(training_data.shape)
training_data.head(3)

(32950, 197)


Unnamed: 0,job_1,job_10,job_11,job_12,job_2,job_3,job_4,job_5,job_6,job_7,...,nr_employed_prep_10,nr_employed_prep_11,nr_employed_prep_2,nr_employed_prep_3,nr_employed_prep_4,nr_employed_prep_5,nr_employed_prep_6,nr_employed_prep_7,nr_employed_prep_8,nr_employed_prep_9
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [5]:
training_label = pd.read_csv("data/602_train_label.csv")
training_label.drop(columns=['Unnamed: 0'], inplace=True)
print(training_label.shape)
training_label.head(3)

(32950, 1)


Unnamed: 0,Final_Y
0,0
1,0
2,0


In [6]:
testing_data = pd.read_csv("data/602_test_data_distance.csv")
testing_data.drop(columns=['Unnamed: 0'], inplace=True)
print(testing_data.shape)
testing_data.head(3)

(8238, 197)


Unnamed: 0,job_1,job_10,job_11,job_12,job_2,job_3,job_4,job_5,job_6,job_7,...,nr_employed_prep_10,nr_employed_prep_11,nr_employed_prep_2,nr_employed_prep_3,nr_employed_prep_4,nr_employed_prep_5,nr_employed_prep_6,nr_employed_prep_7,nr_employed_prep_8,nr_employed_prep_9
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [7]:
# Labels
print(training_label.groupby('Final_Y').size())

Final_Y
0    29254
1     3696
dtype: int64


---
## SMOTETomek

In [20]:
# transform all int values as floats, to enable distance calculation
training_data_imb = training_data.astype(float)
testing_data_imb = np.array(testing_data.astype(float))
training_label_imb = np.ravel(training_label)
training_data_imb.head()

Unnamed: 0,job_1,job_10,job_11,job_12,job_2,job_3,job_4,job_5,job_6,job_7,...,nr_employed_prep_10,nr_employed_prep_11,nr_employed_prep_2,nr_employed_prep_3,nr_employed_prep_4,nr_employed_prep_5,nr_employed_prep_6,nr_employed_prep_7,nr_employed_prep_8,nr_employed_prep_9
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [13]:
# build processor
smt = SMOTETomek(random_state=0)

In [14]:
# get processed data
training_data_smt, training_label_smt = smt.fit_resample(training_data_imb, training_label_imb)

In [22]:
# transform back to df
training_data = pd.DataFrame(training_data_smt)
training_label = pd.DataFrame(training_label_smt, columns=['Final_Y'])
testing_data = pd.DataFrame(testing_data_imb)
print(training_data.shape, training_label.shape, testing_data.shape)

(58508, 197) (58508, 1) (8238, 197)


In [26]:
training_data.to_csv("data/609_training_data.csv")
testing_data.to_csv("data/609_testing_data.csv")
training_label.to_csv("data/609_training_label.csv")