# Data Augmentation Box

Project for Data Augmentation System

## Data Augmentation Order

STEP 1 - Domain Data Preparation
1. Domain data labeling check
2. Dimensionality Reduction
3. Regression analysis


STEP 2 - Data Augmentation
1. Domain data check
02. Public Data Supplement
03. Data filtering (1st)
04. Dimensionality Reduction
05. Label Spreading (semi-supervised learning based)
06. Data Filtering (2nd)
07. Regression analytsis
08. Data Filtering (3rd)
09. Data Augmentation
10. Model Generation

- - -

In [1]:
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() #set the scaler

## 01. Domain Data Check

we have to check the domain data 

### 1-1) image dataset

Here, we will use CIFAR-10 dataset for experiments

In [None]:
domain =

### 1-2) numerical dataset

In [3]:
### HRV numerical dataset
domain = pd.read_csv('E:/RESEARCH/Datasets/HRV/HRV_REV_all.csv', sep=',')

In [4]:
### data shape, variables check
print(domain.shape)
print(domain.columns)
domain.head()

(479, 93)
Index(['sub', 'VISIT', 'disorder', 'age', 'gender', 'HAMD', 'HAMA', 'PDSS',
       'ASI', 'APPQ', 'PSWQ', 'SPI', 'PSS', 'BIS', 'SSI', 'b1SDNN', 'b1NN50',
       'b1PNN50', 'b1RMSSD', 'b1VLF', 'b1LF', 'b1HF', 'b1LF/HF', 'b1POWER',
       'b1HR', 'b1RESP', 'b1SC', 'b1TEMP', 'sSDNN', 'sNN50', 'sPNN50',
       'sRMSSD', 'sVLF', 'sLF', 'sHF', 'sLF/HF', 'sPOWER', 'sHR', 'sRESP',
       'sSC', 'sTEMP', 'b2SDNN', 'b2NN50', 'b2PNN50', 'b2RMSSD', 'b2VLF',
       'b2LF', 'b2HF', 'b2LF/HF', 'b2POWER', 'b2HR', 'b2RESP', 'b2SC',
       'b2TEMP', 'rSDNN', 'rNN50', 'rPNN50', 'rRMSSD', 'rVLF', 'rLF', 'rHF',
       'rLF/HF', 'rPOWER', 'rHR', 'rRESP', 'rSC', 'rTEMP', 'b3SDNN', 'b3NN50',
       'b3PNN50', 'b3RMSSD', 'b3VLF', 'b3LF', 'b3HF', 'b3LF/HF', 'b3POWER',
       'b3HR', 'b3RESP', 'b3SC', 'b3TEMP', 'cSDNN', 'cNN50', 'cPNN50',
       'cRMSSD', 'cVLF', 'cLF', 'cHF', 'cLF/HF', 'cPOWER', 'cHR', 'cRESP',
       'cSC', 'cTEMP'],
      dtype='object')


Unnamed: 0,sub,VISIT,disorder,age,gender,HAMD,HAMA,PDSS,ASI,APPQ,...,cRMSSD,cVLF,cLF,cHF,cLF/HF,cPOWER,cHR,cRESP,cSC,cTEMP
0,E001,4,2,23,1,2,2,1,12,22,...,41.544667,190.107,298.508333,206.862333,1.284,695.477333,65.707,14.054333,3.911333,34.998
1,E001,5,2,23,1,12,7,0,12,24,...,39.825333,143.756667,115.695333,202.602667,0.585,462.054667,69.04,14.117333,6.255,35.544333
2,E002,1,2,38,1,14,17,14,31,122,...,20.052,22.006,50.182,32.529333,2.499333,104.717,92.789333,11.013333,0.945667,35.086
3,E002,2,2,38,1,13,36,16,32,139,...,20.201667,55.579,84.441,18.754,5.803,158.774333,85.980667,12.608333,0.785667,36.141
4,E002,3,2,38,1,7,10,11,23,70,...,74.788,182.229,530.565667,546.574,1.685667,1259.368667,84.368667,14.285667,0.648,35.879


In [5]:
### checking lables for the data
domain.loc[domain['HAMD']<=7, 'IndexH'] =0
domain.loc[(domain['HAMD']>7) & (domain['HAMD']<=16), 'IndexH'] = 1
domain.loc[domain['HAMD']>16, 'IndexH'] = 2
domain_y = domain.loc[:,'IndexH']
# domain_y = domain.loc[:,'disorder']

In [6]:
### deleting unnecessary data columns
domain = domain.drop(['sub','age','gender','VISIT','disorder','HAMD', 'HAMA','PDSS','ASI','APPQ','PSWQ','SPI','PSS','BIS','SSI','IndexH'], axis=1)

In [7]:
### check the domain data columns again
print(domain.columns)
print(domain.shape)

Index(['b1SDNN', 'b1NN50', 'b1PNN50', 'b1RMSSD', 'b1VLF', 'b1LF', 'b1HF',
       'b1LF/HF', 'b1POWER', 'b1HR', 'b1RESP', 'b1SC', 'b1TEMP', 'sSDNN',
       'sNN50', 'sPNN50', 'sRMSSD', 'sVLF', 'sLF', 'sHF', 'sLF/HF', 'sPOWER',
       'sHR', 'sRESP', 'sSC', 'sTEMP', 'b2SDNN', 'b2NN50', 'b2PNN50',
       'b2RMSSD', 'b2VLF', 'b2LF', 'b2HF', 'b2LF/HF', 'b2POWER', 'b2HR',
       'b2RESP', 'b2SC', 'b2TEMP', 'rSDNN', 'rNN50', 'rPNN50', 'rRMSSD',
       'rVLF', 'rLF', 'rHF', 'rLF/HF', 'rPOWER', 'rHR', 'rRESP', 'rSC',
       'rTEMP', 'b3SDNN', 'b3NN50', 'b3PNN50', 'b3RMSSD', 'b3VLF', 'b3LF',
       'b3HF', 'b3LF/HF', 'b3POWER', 'b3HR', 'b3RESP', 'b3SC', 'b3TEMP',
       'cSDNN', 'cNN50', 'cPNN50', 'cRMSSD', 'cVLF', 'cLF', 'cHF', 'cLF/HF',
       'cPOWER', 'cHR', 'cRESP', 'cSC', 'cTEMP'],
      dtype='object')
(479, 78)


- - -

* Domain data variable selection for the right task

In [8]:
domain_b1 = domain.loc[:, ['b1RMSSD', 'b1HR', 'b1PNN50', 'b1VLF', 'b1LF', 'b1HF', 'b1LF/HF']]
domain_b2 = domain.loc[:, ['b2RMSSD', 'b2HR', 'b2PNN50', 'b2VLF', 'b2LF', 'b2HF', 'b2LF/HF']]
domain_b3 = domain.loc[:, ['b3RMSSD', 'b3HR', 'b3PNN50', 'b3VLF', 'b3LF', 'b3HF', 'b3LF/HF']]

In [9]:
domain_b1.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']
domain_b2.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']
domain_b3.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']

In [10]:
domain_s = domain.loc[:, ['sRMSSD','sHR', 'sPNN50', 'sVLF', 'sLF', 'sHF', 'sLF/HF']]

In [11]:
domain_s.columns = ['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']

In [26]:
### Standardization
domain_b1[:] = scaler.fit_transform(domain_b1[:])
domain_b2[:] = scaler.fit_transform(domain_b2[:])
domain_b3[:] = scaler.fit_transform(domain_b3[:])
domain_s[:] = scaler.fit_transform(domain_s[:])

Later you can select the dataset that you want to analyze. 
ex) if you want to augment the stress phase dataset, choose domain_s

- - -

## 02. Public Data Supplement

we proceed data crawling to support insufficient data environment

### 2-1) Crawling Image data 

In [12]:
import urllib.request
import time
from urllib.parse import quote_plus
from bs4 import BeautifulSoup
from selenium import webdriver
from icrawler.builtin import GoogleImageCrawler

In [None]:
# google_crawler = GoogleImageCrawler(parser_threads=2, downloader_threads=4,
#                                     storage={'root_dir': 'E:/RESEARCH/Datasets/HRV/crawl_test'})

# google_crawler.crawl(keyword='car crash', max_num=500,
# #                      date_min=None, date_max=None,
#                      min_size=(200,200), max_size=None)

In [40]:
### image crawling from google with GoogleImageCrawler
google_crawler = GoogleImageCrawler(
    feeder_threads=1,
    parser_threads=1,
    downloader_threads=4,
    storage={'root_dir': 'E:/RESEARCH/Datasets/image/CIFAR_PUB/truck'}) #set the storage root

filters = dict(
    type='photo',
    #type=photo,face,clipart,linedrawing,animated
    size='medium',
    #size=large, medium, icon, or larger than a given size e.g.">640x480" or exactly giving size"=1024x768
#     color='orange',
    #coler=blackandwhite, red, oragne, yellow, green, teal, blue, purple, pink, white, gray, black, brown
    license='commercial,modify',
    #license=noncommercial, commercial, noncommercial,modify , commercial,modify
    date=((2019, 1, 1), (2021, 12, 30)))

# type the keyword of the image that you want to crawl from google
google_crawler.crawl(keyword='truck', filters=filters, offset=0, max_num=1000,
                     min_size=(200,200), max_size=None, file_idx_offset=0)

2022-03-14 16:49:31,050 - INFO - icrawler.crawler - start crawling...
2022-03-14 16:49:31,051 - INFO - icrawler.crawler - starting 1 feeder threads...
2022-03-14 16:49:31,052 - INFO - icrawler.crawler - starting 1 parser threads...
2022-03-14 16:49:31,052 - INFO - icrawler.crawler - starting 4 downloader threads...
2022-03-14 16:49:32,344 - INFO - parser - parsing result page https://www.google.com/search?q=truck&ijn=0&start=0&tbs=itp%3Aphoto%2Cisz%3Am%2Csur%3Afmc%2Ccdr%3A1%2Ccd_min%3A01%2F01%2F2019%2Ccd_max%3A12%2F30%2F2021&tbm=isch
2022-03-14 16:49:32,715 - INFO - downloader - image #1	https://cdn.pixabay.com/photo/2020/07/30/07/24/truck-5449602_960_720.png
2022-03-14 16:49:33,067 - INFO - downloader - image #2	https://freesvg.org/img/1528405729.png
2022-03-14 16:49:33,379 - INFO - downloader - image #3	https://cdn.pixabay.com/photo/2018/07/27/15/35/pickup-truck-3566292_960_720.jpg
2022-03-14 16:49:33,520 - INFO - downloader - image #4	https://c0.wallpaperflare.com/preview/1011/771/7

In [None]:
### image crawling from google with GoogleImageCrawler
google_crawler = GoogleImageCrawler(
    feeder_threads=1,
    parser_threads=1,
    downloader_threads=4,
    storage={'root_dir': 'E:/RESEARCH/Datasets/HRV/crawl_test'}) #set the storage root

filters = dict(
    #type=photo,face,clipart,linedrawing,animated
    size='large',
    #size=large, medium, icon, or larger than a given size e.g.">640x480" or exactly giving size"=1024x768
    color='blackandwhite',
    #coler=blackandwhite, red, oragne, yellow, green, teal, blue, purple, pink, white, gray, black, brown
    license='commercial,modify',
    #license=noncommercial, commercial, noncommercial,modify , commercial,modify
    date=((2021, 1, 1), (2021, 12, 30)))

# type the keyword of the image that you want to crawl from google
google_crawler.crawl(keyword='lung ct', filters=filters, offset=0, max_num=1000,
                     min_size=(200,200), max_size=None, file_idx_offset=0)

### 2-2) Getting Numerical Data

maybe from kaggle, google, or uci machine learning dataset

In [13]:
### In our medical case, we adopt HRV dataset from SWEEL HRV research
### Using public data must be very careful, and researchers should only use them for training data supplement.

public = pd.read_csv('E:/RESEARCH/Datasets/HRV/HRV_Public/SWELL_hrv/data/final/train.csv', sep=',')

In [14]:
### data shape, variables check
print(public.shape)
print(public.columns)
public.head()

(369289, 36)
Index(['MEAN_RR', 'MEDIAN_RR', 'SDRR', 'RMSSD', 'SDSD', 'SDRR_RMSSD', 'HR',
       'pNN25', 'PNN50', 'SD1', 'SD2', 'KURT', 'SKEW', 'MEAN_REL_RR',
       'MEDIAN_REL_RR', 'SDRR_REL_RR', 'RMSSD_REL_RR', 'SDSD_REL_RR',
       'SDRR_RMSSD_REL_RR', 'KURT_REL_RR', 'SKEW_REL_RR', 'VLF', 'VLF_PCT',
       'LF', 'LF_PCT', 'LF_NU', 'HF', 'HF_PCT', 'HF_NU', 'TP', 'LF_HF',
       'HF_LF', 'sampen', 'higuci', 'datasetId', 'condition'],
      dtype='object')


Unnamed: 0,MEAN_RR,MEDIAN_RR,SDRR,RMSSD,SDSD,SDRR_RMSSD,HR,pNN25,PNN50,SD1,...,HF,HF_PCT,HF_NU,TP,LF_HF,HF_LF,sampen,higuci,datasetId,condition
0,885.157845,853.76373,140.972741,15.554505,15.553371,9.063146,69.499952,11.133333,0.533333,11.001565,...,15.522602,0.421047,1.514737,3686.666157,65.018055,0.01538,2.139754,1.163485,2,no stress
1,939.425371,948.357865,81.317742,12.964439,12.964195,6.272369,64.36315,5.6,0.0,9.170129,...,2.108525,0.070133,0.304603,3006.487251,327.296635,0.003055,2.174499,1.084711,2,interruption
2,898.186047,907.00686,84.497236,16.305279,16.305274,5.182201,67.450066,13.066667,0.2,11.533417,...,13.769729,0.512671,1.049528,2685.879461,94.28091,0.010607,2.13535,1.176315,2,interruption
3,881.757864,893.46003,90.370537,15.720468,15.720068,5.748591,68.809562,11.8,0.133333,11.119476,...,18.181913,0.529387,1.775294,3434.52098,55.328701,0.018074,2.178341,1.179688,2,no stress
4,809.625331,811.184865,62.766242,19.213819,19.213657,3.266724,74.565728,20.2,0.2,13.590641,...,48.215822,1.839473,3.279993,2621.175204,29.487873,0.033912,2.221121,1.249612,2,no stress


- - -

* preprocess our data to fit into domain data

In [15]:
### set the variables same as domain dataset
public_b = public[public['condition'] == 'no stress']
public_s1 = public[public['condition'] == 'interruption']
public_s2 = public[public['condition'] == 'time pressure']

In [16]:
### check the number of each phase dataset
print(public_b.shape)
print(public_s1.shape)
print(public_s2.shape)

(200082, 36)
(105150, 36)
(64057, 36)


In [17]:
### now select the common(repeated) variables from the domain data and save
public = public.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
public_b = public_b.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
public_s1 = public_s1.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]
public_s2 = public_s2.loc[:,['RMSSD', 'HR', 'PNN50', 'VLF', 'LF', 'HF', 'LF_HF']]

In [23]:
### standardization on supplemented dataset
public_b[:] = scaler.fit_transform(public_b[:])
public_s1[:] = scaler.fit_transform(public_s1[:])
public_s2[:] = scaler.fit_transform(public_s2[:])

In [24]:
### round up the variable values for fifth decimal points
public_b = public_b.round(decimals=5)
public_s1 = public_s1.round(decimals=5)
public_s2 = public_s2.round(decimals=5)

- - -

## 03. Data Filtering (1st)

### 3-1) Data Mergence

In [46]:
### First select the data phase (maybe not necessary for some dataset)
### Then, check the number of data in each domain and public dataset
### Here we are going to use baseline phase

print("Shape of the domain dataset for the training is", domain_b1.shape)
print("Shape of the public dataset for the training is", public_b.shape)

Shape of the domain dataset for the training is (479, 7)
Shape of the public dataset for the training is (200082, 7)


In [48]:
### select the proper amount of dataset for each
domain_resized = domain_b1.sample(frac=1)
public_resized = public_b.sample(n=700)
print(domain_resized.shape)
print(public_resized.shape)

(479, 7)
(1000, 7)


In [49]:
training = pd.concat((domain_resized, public_resized))

In [50]:
### check the finalized first augmented dataset size/shape
print("Shape of the firstly augmented dataset for the training is", training.shape)

Shape of the firstly augmented dataset for the training is (1479, 7)


In [52]:
training.head()

Unnamed: 0,RMSSD,HR,PNN50,VLF,LF,HF,LF_HF
144,0.01222,0.310762,0.0,0.000114,4e-05,0.0002,0.012848
288,0.021936,0.204645,0.0,0.000616,0.000152,0.000558,0.017987
458,0.022482,0.199006,0.053191,0.000789,0.001259,0.000372,0.195978
275,0.037211,0.278394,0.079433,0.00058,0.000174,0.000806,0.012837
70,0.027051,0.339655,0.055319,0.001858,0.000939,0.001053,0.058309


## 04. Dimensionality Reduction

## 05. Data Clustering (SSL based)

## 06. Unlabeled data labeling