# Data Augmentation Box

Project for Data Augmentation System

## Data Augmentation Order

STEP 1 - Domain Data Preparation
1. Domain data labeling check
2. Dimensionality Reduction
3. Regression analysis


STEP 2 - Data Augmentation
1. Domain data check
02. Public Data Supplement
03. Data filtering (1st)
04. Dimensionality Reduction
05. Label Spreading (semi-supervised learning based)
06. Data Filtering (2nd)
07. Regression analytsis
08. Data Filtering (3rd)
09. Data Augmentation
10. Model Generation

- - -

In [None]:
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.metrics as metrics

In [None]:
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.preprocessing import MinMaxScaler
from sklearn import decomposition
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer

scaler = MinMaxScaler() #set the scaler

## 01. Domain Data Check

we have to check the domain data 

### 1-1) image dataset

### 1-2) numerical dataset

In [None]:
### HRV numerical dataset
domain = pd.read_csv('E:/RESEARCH/Datasets/HRV/HRV_REV_all.csv', sep=',')

In [None]:
### data shape, variables check
print("The shape of the domain dataset is:",domain.shape)
# print(domain.columns)
domain.head()

* HAMD 점수에 따라서 새롭게 IndexH 라고 라벨링용 변수 만들어주자

In [None]:
### checking lables for the data
domain.loc[domain['HAMD']<=7, 'IndexH'] =0
domain.loc[(domain['HAMD']>7) & (domain['HAMD']<=16), 'IndexH'] = 1
domain.loc[domain['HAMD']>16, 'IndexH'] = 2
domain_y = domain.loc[:,'IndexH']
# domain_y = domain.loc[:,'disorder']

In [None]:
domain['IndexH'].value_counts()

* 그리고 안쓸 변수들은 제거해주자. (HRV 관련 변수만 쓸 것임)

In [None]:
### deleting unnecessary data columns
domain = domain.drop(['sub','age','gender','VISIT','disorder','HAMD', 'HAMA','PDSS','ASI','APPQ','PSWQ','SPI','PSS','BIS','SSI'], axis=1)

In [None]:
### check the domain data columns again
print(domain.columns)
print(domain.shape)

- - -

* Domain data variable selection for the right task
> baseline, stress, rest phase로 나눠진 데이터를 각각 쪼개주는 것.

In [None]:
domain_b1 = domain.loc[:, ['b1RMSSD', 'b1HR', 'b1PNN50', 'b1VLF', 'b1LF', 'b1HF', 'b1LF/HF']]
domain_b2 = domain.loc[:, ['b2RMSSD', 'b2HR', 'b2PNN50', 'b2VLF', 'b2LF', 'b2HF', 'b2LF/HF']]
domain_b3 = domain.loc[:, ['b3RMSSD', 'b3HR', 'b3PNN50', 'b3VLF', 'b3LF', 'b3HF', 'b3LF/HF']]

In [None]:
domain_b1.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']
domain_b2.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']
domain_b3.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']

* domain_s 는 stress phase에 있는 애들

In [None]:
domain_s = domain.loc[:, ['sRMSSD','sHR', 'sPNN50', 'sVLF', 'sLF', 'sHF', 'sLF/HF']]
domain_s_index = domain.loc[:, ['sRMSSD','sHR', 'sPNN50', 'sVLF', 'sLF', 'sHF', 'sLF/HF', 'IndexH']]

In [None]:
domain_s.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']

In [None]:
### Standardization
domain_b1[:] = scaler.fit_transform(domain_b1[:])
domain_b2[:] = scaler.fit_transform(domain_b2[:])
domain_b3[:] = scaler.fit_transform(domain_b3[:])
domain_s[:] = scaler.fit_transform(domain_s[:])

Later you can select the dataset that you want to analyze. 

ex) if you want to augment the stress phase dataset, choose domain_s

- - -

--------

## 02. Public Data Supplement

we proceed data crawling to support insufficient data environment

### 2-1) Crawling Image data 

In [None]:
# import urllib.request
# import time
# from urllib.parse import quote_plus
# from bs4 import BeautifulSoup
# from selenium import webdriver
# from icrawler.builtin import GoogleImageCrawler

In [None]:
# google_crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4,
#                                     storage={'root_dir': 'E:/RESEARCH/Datasets/HRV/crawl_test'})

# google_crawler.crawl(keyword='car crash', max_num=500,
# #                      date_min=None, date_max=None,
#                      min_size=(200,200), max_size=None)

In [None]:
# ### image crawling from google with GoogleImageCrawler
# google_crawler = GoogleImageCrawler(
#     feeder_threads=1,
#     parser_threads=1,
#     downloader_threads=4,
#     storage={'root_dir': 'E:/RESEARCH/Datasets/VC/classic/violin'})
# #     storage={'root_dir': 'E:/RESEARCH/Datasets/image/CIFAR_PUB/truck'}) #set the storage root

# filters = dict(
# #     type='photo',
#     #type=photo,face,clipart,linedrawing,animated
#     size='medium',
#     #size=large, medium, icon, or larger than a given size e.g.">640x480" or exactly giving size"=1024x768
# #     color='orange',
#     #coler=blackandwhite, red, oragne, yellow, green, teal, blue, purple, pink, white, gray, black, brown
# #     license='commercial,modify',
#     #license=noncommercial, commercial, noncommercial,modify , commercial,modify
#     date=((2000, 1, 1), (2021, 12, 30)))

# # type the keyword of the image that you want to crawl from google
# google_crawler.crawl(keyword= 'violin orchestra', filters=filters, offset=0, max_num=1000,
#                      min_size=(200,200), max_size=None, file_idx_offset=0)

In [None]:
# ### image crawling from google with GoogleImageCrawler
# google_crawler = GoogleImageCrawler(
#     feeder_threads=1,
#     parser_threads=1,
#     downloader_threads=4,
#     storage={'root_dir': 'E:/RESEARCH/Datasets/HRV/crawl_test'}) #set the storage root

# filters = dict(
#     #type=photo,face,clipart,linedrawing,animated
#     size='large',
#     #size=large, medium, icon, or larger than a given size e.g.">640x480" or exactly giving size"=1024x768
#     color='blackandwhite',
#     #coler=blackandwhite, red, oragne, yellow, green, teal, blue, purple, pink, white, gray, black, brown
#     license='commercial,modify',
#     #license=noncommercial, commercial, noncommercial,modify , commercial,modify
#     date=((2021, 1, 1), (2021, 12, 30)))

# # type the keyword of the image that you want to crawl from google
# google_crawler.crawl(keyword='lung ct', filters=filters, offset=0, max_num=1000,
#                      min_size=(200,200), max_size=None, file_idx_offset=0)

### 2-2) Getting Numerical Data

maybe from kaggle, google, or uci machine learning dataset

In [None]:
### In our medical case, we adopt HRV dataset from SWEEL HRV research
### Using public data must be very careful, and researchers should only use them for training data supplement.

public = pd.read_csv('E:/RESEARCH/Datasets/HRV/HRV_Public/SWELL_hrv/data/final/train.csv', sep=',')

In [None]:
### data shape, variables check
print("The shape of the public SWELL dataset is:",public.shape)
# print(public.columns)
public.head()

- - -

* preprocess our data to fit into domain data
> 실제 사용하는 domain(삼성병원)데이터는 3phase를 가지지만 public에서는 baseline이랑 stress를 나눠본다

In [None]:
### set the variables same as domain dataset
public_b = public[public['condition'] == 'no stress']
public_s1 = public[public['condition'] == 'interruption']
public_s2 = public[public['condition'] == 'time pressure']

* 각각 데이터가 몇개씩이나 있는지 확인

In [None]:
### check the number of each phase dataset
print(public_b.shape)
print(public_s1.shape)
print(public_s2.shape)

In [None]:
### now select the common(repeated) variables from the domain data and save
public = public.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
public_b = public_b.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
public_s1 = public_s1.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
public_s2 = public_s2.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]

* 마찬가지로 scaler 적용해서 standardization 적용

In [None]:
### standardization on supplemented dataset
public_b[:] = scaler.fit_transform(public_b[:])
public_s1[:] = scaler.fit_transform(public_s1[:])
public_s2[:] = scaler.fit_transform(public_s2[:])

In [None]:
### round up the variable values for fifth decimal points
public_b = public_b.round(decimals=5)
public_s1 = public_s1.round(decimals=5)
public_s2 = public_s2.round(decimals=5)

- - -

## 03. Data Filtering (1st)

### 3-1) Data Mergence

In [None]:
### First select the data phase (maybe not necessary for some dataset)
### Then, check the number of data in each domain and public dataset
### Here we are going to use baseline phase

print("Shape of the domain dataset for the training is", domain_s.shape)
print("Shape of the public dataset for the training is", public_s1.shape)

In [None]:
### select the proper amount of dataset for each
domain_resized = domain_s.sample(frac=1) ##sampling 뽑을거도 없이 전체 다 쓰면 되고.
public_resized = public_s1.sample(n=700)
print(domain_resized.shape)
print(public_resized.shape)

In [None]:
# public_resized.head()

In [None]:
# domain_resized.head()

* training이라는 이름으로 두 데이터를 합치자

In [None]:
training = pd.concat((domain_resized, public_resized))

In [None]:
### check the finalized first augmented dataset size/shape
print("Shape of the firstly augmented dataset for the training is", training.shape)

In [None]:
training.head()

## 04. Dimensionality Reduction

* 현재 domain이랑 public에서 사용되는 공용 변수는 7개.
* 군집화하기 위해서 차원축소를 해도 각 데이터의 설명력이 떨어지지 않는 지 확인해보자.

* 먼저 Domain dataset

In [None]:
### To put the labels on domain dataset and use them for labeling, index must be included
### 3 component dimensionality reduction on merged dataset
dom_pca_3 = decomposition.PCA(n_components=3)
dom_pca_3_result = dom_pca_3.fit_transform(domain_resized)
dom_3 = dom_pca_3.explained_variance_ratio_.sum()*100 #explained ratio

### check the representativeness of the reduced dimension by PCA
print('Explained variation per principal component: {}'.format(dom_pca_3.explained_variance_ratio_))
print('Cumulative variance explained by 2 principal components: {:.2%}'.format(np.sum(dom_pca_3.explained_variance_ratio_)))

In [None]:
dom_pca_3_result.shape ##reduced dimension

In [None]:
domain_resized

In [None]:
RDATA_reduced = pd.DataFrame(dom_pca_3_result)

In [None]:
# RDATA_reduced

* 그림으로 그려서 확인

In [None]:
# dom_result3 = pd.DataFrame(dom_pca_3.transform(domain_resized), columns = ['PCA%i' % i for i in range(3)], index = domain_resized.index)

In [None]:
# # Plot initialisation
# fig = plt.figure(figsize=(8,5))
# ax = fig.add_subplot(111, projection='3d')
# plt.title('PCA 3 result from Domain Dataset', fontsize=11, fontweight='bold')
# ax.scatter(dom_result3['PCA0'], dom_result3['PCA1'], dom_result3['PCA2'], s=60)
# # plt.savefig('pca_result.png')

* Silhouette score를 이용한 분석으로 몇개의 cluster로 나누는 것이 합리적인지 보자

In [None]:
# candidate values for our number of cluster
parameters = [2, 3, 4, 5, 6]

# instantiating ParameterGrid, pass number of clusters as input
parameter_grid = ParameterGrid({'n_clusters': parameters})
best_score = -1
kmeans_model = KMeans()     # instantiating KMeans model
silhouette_scores = []

# evaluation based on silhouette_score
for p in parameter_grid:
    kmeans_model.set_params(**p)  # set current hyper parameter
    kmeans_model.fit(domain_resized)     # fit model on dataset, this will find clusters based on parameter p
    ss = metrics.silhouette_score(domain_resized, kmeans_model.labels_)   # calculate silhouette_score
    silhouette_scores += [ss]       # store all the scores
    print('Parameter:', p, 'Score', ss)
    # check p which has the best score
    if ss > best_score:
        best_score = ss
        best_grid = p
        
# plotting silhouette score
plt.bar(range(len(silhouette_scores)), list(silhouette_scores), align='center', color='#849ef7', width=0.5)
plt.xticks(range(len(silhouette_scores)), list(parameters))
plt.title('Domain Dataset silhouette score')
plt.xlabel('Number of Clusters')
plt.show()

* 다음으로 Public dataset

In [None]:
### 3 component dimensionality reduction on merged dataset
pub_pca_3 = decomposition.PCA(n_components=3)
pub_pca_3_result = pub_pca_3.fit_transform(public_resized)
pub_3 = pub_pca_3.explained_variance_ratio_.sum()*100

### check the representativeness of the reduced dimension by PCA
print('Explained variation per principal component: {}'.format(pub_pca_3.explained_variance_ratio_))
print('Cumulative variance explained by 2 principal components: {:.2%}'.format(np.sum(pub_pca_3.explained_variance_ratio_)))

* 마찬가지로 그림으로 그려서 확인

In [None]:
# pub_result3 = pd.DataFrame(pub_pca_3.transform(public_resized), columns = ['PCA%i' % i for i in range(3)], index = public_resized.index)

In [None]:
# # Plot initialisation
# fig = plt.figure(figsize=(8,5))
# ax = fig.add_subplot(111, projection='3d')
# plt.title('PCA 3 result from Public Dataset', fontsize=11, fontweight='bold')
# ax.scatter(pub_result3['PCA0'], pub_result3['PCA1'], pub_result3['PCA2'], s=60)
# # plt.savefig('pca_result.png')

In [None]:
# candidate values for our number of cluster
parameters = [2, 3, 4, 5, 6]

# instantiating ParameterGrid, pass number of clusters as input
parameter_grid = ParameterGrid({'n_clusters': parameters})
best_score = -1
kmeans_model = KMeans()     # instantiating KMeans model
silhouette_scores = []

# evaluation based on silhouette_score
for p in parameter_grid:
    kmeans_model.set_params(**p)  # set current hyper parameter
    kmeans_model.fit(public_resized)     # fit model on dataset, this will find clusters based on parameter p
    ss = metrics.silhouette_score(public_resized, kmeans_model.labels_)   # calculate silhouette_score
    silhouette_scores += [ss]       # store all the scores
    print('Parameter:', p, 'Score', ss)
    # check p which has the best score
    if ss > best_score:
        best_score = ss
        best_grid = p
        
# plotting silhouette score
plt.bar(range(len(silhouette_scores)), list(silhouette_scores), align='center', color='#849ef7', width=0.5)
plt.xticks(range(len(silhouette_scores)), list(parameters))
plt.title('Public Dataset silhouette score')
plt.xlabel('Number of Clusters')
plt.show()

## 05. Data Clustering (SSL based)

## 06. Unlabeled data labeling

* 여기서 RDATA는 Real dataset이고 PDATA는 augmentation을 위한 public dataset

In [None]:
RDATA = domain_s
PDATA = public_s1.sample(n=700)
label = domain_y

* 일단 PDATA는 unlabeled data 상태이기에 -1로 라벨값 만들어주고.

In [None]:
PDATA['y'] = -1

In [None]:
PDATA.info()

* Regression 돌리기 위해서 test, train 나눠보자

In [None]:
# Labeled datapoints and following labels.
train_x, test_x, train_y, test_y = train_test_split(RDATA, label, test_size = 0.2, random_state = 710674)

In [None]:
print("The shape of training dataset x is:", train_x.shape)
print("The shape of test dataset x is:", test_x.shape)

In [None]:
# Unlabeled datapoints and following labels.
train_x2 = PDATA.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
train_y2 = PDATA['y']

In [None]:
print("The shape of public training dataset x is:", train_x2.shape)
print("The shape of public test dataset x is:", train_y2.shape)

In [None]:
# Concatenate
x = np.concatenate((train_x, train_x2))
y = np.concatenate((train_y, train_y2))

In [None]:
print("The shape of Total training dataset x is:", x.shape)
print("The shape of Total test dataset x is:", y.shape)

* Logistic regression 돌려서 변수간 연관성 및 함수를 확인한다

In [None]:
index = ['Analysis Method', 'ROC AUC']
results = pd.DataFrame(columns = index) ## result 라고 데이터프레임 하나 만들어놓고.

In [None]:
logreg = LogisticRegression(random_state = 710674, C = 0.00001, max_iter = 20000)
logreg.fit(train_x, train_y)
results = results.append(
    pd.Series(['Logistic Regression', roc_auc_score(test_y, logreg.predict_proba(test_x), multi_class='ovr')],
              index=index), ignore_index=True)

results

In [None]:
# logreg.predict_proba(test_x)

In [None]:
y_pred = logreg.predict(test_x)
acc_score = accuracy_score(test_y, y_pred)

In [None]:
acc_score

* 각 라벨별 변수에 대한 계수(coefficient)를 확인

In [None]:
logreg.coef_

* Label propagation (generating probablistic transition matrix for unlabeled datapoints)

In [None]:
def label_prop_test(kernel, params_list,x_train, x_test, y_train, y_test):
    plt.figure(figsize=(20,10))
    n, g = 0, 0
    roc_scores = []
    
    if kernel == 'rbf':
        for g in params_list:
            lp = LabelPropagation(kernel=kernel, n_neighbors=n, gamma=g, max_iter=10000, tol=0.001)
            lp.fit(x_train, y_train)
            roc_scores.append(roc_auc_score(y_test, lp.predict_proba(x_test), multi_class='ovr'))
    
    if kernel == 'knn':
        for n in params_list:
            lp = LabelPropagation(kernel=kernel, n_neighbors=n, gamma=g, max_iter=10000, tol=0.001)
            lp.fit(x_train, y_train)
            roc_scores.append(roc_auc_score(y_test, lp.predict_proba(x_test), multi_class='ovr'))
    
    plt.figure(figsize=(16,8));
    plt.plot(params_list, roc_scores)
    plt.title('Label Propagation ROC AUC with ' + kernel + ' kernel')
    plt.show()
    
    print('Best metrics value is at {}'.format(params_list[np.argmax(roc_scores)]))