# Final Project - Variational Auto Encoder

### Made by: Muhammad Salman Khan and Syed Bilal Rizwan

# 1. Introduction

Machine learning algorithms may face problems when dealing with unbalanced datasets when there are significantly more observations in some classes than in others. A solution to this problem is to synthesize more samples from underrepresented classes in a bid to balance the dataset. As a method of balancing unbalanced datasets, we experimented with variational autoencoders (VAEs) in this research to generate tabular data. We evaluated the effectiveness of VAEs with traditional class balancing techniques like SMOTE and Gaussian Mixture Models. Our findings provide insight into the potential of VAEs as a technique for making artificial samples to balance out-of-balance datasets.

# 2. About the Datasets

**Credit Card Fraud Dataset** has been used for this experiment. Because of the fact that events of fraud are rare in comparison to regular transactions, fraudulent behaviour is a good example of an unbalanced dataset. We can use VAE to generate new records for fraudulent activities and balance the dataset. We intend to evaluate the potential of VAE as a tool for generating synthetic samples to balance unbalanced dataset like the credit card fraud dataset by comparing the performance of VAEs to conventional class balancing techniques like SMOTE and Gaussian Mixture Models.

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

The other two datasets that we will be balancing along the way are:

**Marketing Campaign Dataset**: Also, a naturally imbalance dataset as always the number of people to respond positively to a marketing campaign will be significantly lower as compared to the one that did not respond. The dataset used in this experiment has 2216 total records out of which only 8% are positive.

**Heart Disease Prediction Dataset**: According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. It consists of 319,795 rows and 279 columns which are reduced to 20 columns. Only 14% of the patients had a disease that was heart related according to this dataset.

# 3. Background

## 3.1 Variational Auto Encoder

A variational autoencoder (VAE) is a  generative model that can be used to generate new data. It consists of an encoder network that maps input data to a code, and a decoder that maps values in the latent space back to the original data space.
During training, a VAE learns to reconstruct a dataset by reducing the reconstruction loss between the input data and the decoder's output. The encoder learns to map the input in a compact latent space and the decoder trains to map that latent space back to the original input as similarly as possible.
For generation, a sample is taken from the latent space, and through the decoder new data is generated. The VAE may produce new records that are comparable to the original dataset even if they weren't present in the training data since it has learned a compact representation of the data in the latent space.

## 3.2 Other Techniques

### 3.2.1 SMOTE
Creating new minority class examples is a technique known as SMOTE, or "Synthetic Minority Oversampling Technique," that is used to address class imbalance in datasets. It operates by locating the minority class examples' closest neighbours in the feature space, after which it creates new synthetic examples along the paths that connect the original examples and their closest neighbours.

### 3.2.2 Gaussian Mixture Models(GMM)

GMM can be used to generate new synthetic data samples that are similar to the original data. This can be done by sampling from the GMM distribution and drawing new data points from the Gaussian distributions in the mixture. 

### 3.2.3 Vanilla AE

This is the most basic type of autoencoder which is used to reconstruct input  its given. Its primary purpose is dimensionality reduction rather than data generation but it is used here as a generative method to compare its performance with our main technique Variational Auto Encoder. However, unlike the above mentioned techniques, it can not be used to balance the classes in dataset, as it needs an input to reconstruct.

# 4. Methodology

For the experiment, the given methodology will be followed:
1. The dataset will be loaded. 
2. Dataset will be pre-processed by removing missing values, one-hot encoding and scaling using a MinMax scaler.
3. Machine learning models will be trained and tested on the imbalanced dataset to establish a baseline performance.
4. The dataset will be balanced by generating new samples of the underrepresented class using a variational autoencoder.
5. ML classification models will now be trained and tested on the balanced dataset generated using VAE.
5. Traditional data balancing techniques, SMOTE and Gaussian Mixture Model, will be implemented to balance the original dataset and the models will be trained and tested on the datasets balanced by each of them.
6. The performances of the balanced datasets generated by the VAE, SMOTE, and Gaussian Mixture Model will be compared to conclude the findings of the experiment. 

# 5. Implementation

### 5.1 Loading Libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import  MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import time
import numpy as np
from sklearn.preprocessing import LabelEncoder


from imblearn.over_sampling import SMOTE

## AutoEncoder
from keras.layers import Input, Dense, BatchNormalization, Lambda, Layer, Add, Multiply
from keras.models import Model, Sequential
from keras.losses import mse, binary_crossentropy
from keras import backend as K
from keras.callbacks import EarlyStopping
from keras.models import Model
from keras import Input
from tensorflow.random import set_seed
import tensorflow as tf

from sklearn.mixture import GaussianMixture

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix

import warnings
warnings.filterwarnings("ignore")

### 5.2 Credit Card Fraud Dataset

#### 5.2.1 Loading Dataset

In [2]:
full_credit_df = pd.read_csv('CreditCardUCI.csv')  #Credit Card Fraud Dataset
print('Shape of credit card dataframe is: ', full_credit_df.shape)
print(full_credit_df.isnull().sum().sum()) #Check missing values
full_credit_df.head()  #Bird's eye view of dataset

Shape of credit card dataframe is:  (284807, 31)
0


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


#### 5.2.2 Pre-Processing

In [3]:
full_credit_df = full_credit_df.drop(columns = ['Time'])
credit_classes = full_credit_df[['Class']]
credit_df = full_credit_df.drop(columns = ['Class'])  #Dropping unnecessary columns

#Scaling and One hot Encoding
Scaler = MinMaxScaler()
credit_df = pd.DataFrame(Scaler.fit_transform(credit_df), columns = credit_df.columns)
print('Shape of df now is: ', credit_df.shape)
credit_df.head()


Shape of df now is:  (284807, 29)


Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,0.5106,...,0.582942,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824
1,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,0.505267,...,0.57953,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105
2,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,0.513018,...,0.585855,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739
3,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,0.507585,...,0.57805,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807
4,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,0.524303,...,0.584615,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724


In [4]:
credit_classes.value_counts()

Class
0        284315
1           492
dtype: int64

In [5]:
X_train, X_test, y_train, y_test = train_test_split(credit_df, credit_classes, test_size=0.30, random_state =0)  

X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)['Class']

X_test = X_test.reset_index(drop=True)
y_test  = y_test.reset_index(drop=True)['Class']
y_test.value_counts()

0    85296
1      147
Name: Class, dtype: int64

#### 5.2.3 Machine Learning Pipeline

In [6]:
## function defined for training models and testing them for different modes of datasets

def train_test_ML2( dataform, X_train, y_train, X_test, y_test):
    temp_df = pd.DataFrame(columns=[ 'Data Form', 'Model', 'Accuracy', 'F1 Score', 'Recall', 'Precision', 'True_Positive', 'False_Negative', 'Time Taken'])
    for i in [LogisticRegression(), KNeighborsClassifier(n_jobs=-1), SVC(), DecisionTreeClassifier(), RandomForestClassifier(n_jobs=-1), GradientBoostingClassifier()]:
        start_time = time.time()
        reg = i.fit(X_train, y_train)
        y_pred = reg.predict(X_test)
        accuracy = np.round(accuracy_score(y_test, y_pred), 2)
        f1 = np.round(f1_score(y_test, y_pred), 2)
        recall = np.round(recall_score(y_test, y_pred), 2)
        precision = np.round(precision_score(y_test, y_pred), 2)
        end_time = time.time()
        time_taken = np.round((end_time - start_time), 2)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred, labels=[0, 1]).ravel()
        temp_df.loc[len(temp_df)] = [dataform, str(i).split('.')[-1][:-2], accuracy, f1, recall, precision,tp, fn, time_taken]
        print(i, 'is done in time: ', time_taken)
    return temp_df

#### 5.2.4 Machine Learning on Original Dataset

##### Running ML Pipeline

In [7]:
orig_df = train_test_ML2('Original', X_train, y_train, X_test, y_test)
orig_df.to_excel('credit_Original.xlsx', sheet_name = 'Original')
orig_df

LogisticRegression() is done in time:  0.82
KNeighborsClassifier(n_jobs=-1) is done in time:  367.78
SVC() is done in time:  7.37
DecisionTreeClassifier() is done in time:  8.33
RandomForestClassifier(n_jobs=-1) is done in time:  14.67
GradientBoostingClassifier() is done in time:  164.46


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Original,LogisticRegression,1.0,0.65,0.52,0.87,77,70,0.82
1,Original,KNeighborsClassifier(n_jobs=-,1.0,0.82,0.73,0.94,107,40,367.78
2,Original,SVC,1.0,0.82,0.8,0.84,117,30,7.37
3,Original,DecisionTreeClassifier,1.0,0.8,0.78,0.83,114,33,8.33
4,Original,RandomForestClassifier(n_jobs=-,1.0,0.84,0.76,0.94,112,35,14.67
5,Original,GradientBoostingClassifier,1.0,0.76,0.71,0.83,104,43,164.46


#### 5.2.5 Machine Learning on Dataset balanced by Variational Auto Encoder

#### 5.2.6 VAE Generation Pipeline

In [22]:
## fucntion to sample the data from standard nnormal distribution saved in code

def sampling(args):
    mean, log_var = args
    batch = K.shape(mean)[0]
    dim = K.int_shape(mean)[1]
    epsilon = K.random_normal(shape=(batch, dim))
    return mean + K.exp(0.5 * log_var) * epsilon

In [23]:
## Variational Autoencoder

def var_ae(X_train):

    org_dim = X_train.shape[1]
    inp_shape = (org_dim,)
    
    ## Encoder
    inputs = Input(shape=inp_shape, name='Input_to_Encoder')
    x = Dense(hidden_dim, activation='relu')(inputs)
    mean = Dense(latent_dim, name='mean')(x)
    log_var = Dense(latent_dim, name='log_var')(x)
    z = Lambda(sampling, output_shape=(latent_dim, ), name='z')([mean, log_var])
    encoder = Model(inputs, [mean, log_var, z], name='encoder')
    
    ## Decoder
    latent_inputs = Input(shape=(latent_dim,), name='sampling')
    x = Dense(hidden_dim, activation='relu')(latent_inputs)
    outputs = Dense(org_dim, activation='sigmoid')(x)
    decoder = Model(latent_inputs, outputs, name='decoder')
    
    ## VAE
    outputs = decoder(encoder(inputs)[2])
    vae = Model(inputs, outputs, name='vae')
    
    # Loss
    reconstruction_loss = mse(inputs, outputs)
    reconstruction_loss *= org_dim

    kl_loss = 1 + log_var - K.square(mean) - K.exp(log_var)
    kl_loss = K.sum(kl_loss, axis=-1)
    kl_loss *= -0.5

    vae_loss = K.mean(reconstruction_loss + kl_loss)
    vae.add_loss(vae_loss)

    vae.compile(optimizer=optimizer)
    vae.fit(X_train, epochs=epochs, batch_size=batch_size)

    return encoder, decoder, vae

In [24]:
def generate(X_train, y_train):
    
    X = X_train[y_train == 1]
    latent_sample = np.random.normal(0, 1, (dif, latent_dim))

    X_gen = decoder.predict(latent_sample)
    y_gen = np.ones(dif) * 1
    X_new = np.concatenate((X_train, X_gen))
    y_new = np.concatenate((y_train, y_gen))
    
    return X_new, y_new

##### Generating Data

In [25]:
epochs = 100
batch_size = 64
hidden_dim = 32
latent_dim = 2
optimizer = 'adam'

X_min = X_train[y_train == 1]
encoder, decoder, vae = var_ae(X_min)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [26]:
majo = len(X_train[y_train == 0])
mino = len(X_train[y_train == 1])
dif = majo - mino
dif

198674

In [27]:
X_train_vae, y_train_vae = generate(X_train, y_train)

##### Running ML Pipeline

In [28]:
X_train_vae = pd.DataFrame(X_train_vae, columns=X_train.columns)
y_train_vae = pd.DataFrame(y_train_vae, columns=credit_classes.columns)

vae_training_set = pd.concat([X_train_vae,y_train_vae ], axis = 1).sample(frac =1)
y_train_vae_shuffled = vae_training_set[['Class']]
X_train_vae_shuffled = vae_training_set.drop(columns = ['Class'])

y_train_vae_shuffled.value_counts()


Class
0.0      199019
1.0      199019
dtype: int64

In [29]:
vae_df = train_test_ML2('Variational AE', X_train_vae_shuffled, y_train_vae_shuffled, X_test, y_test)
vae_df.to_excel('credit_VariationalAE.xlsx', sheet_name = 'Variational AE')
vae_df

LogisticRegression() is done in time:  1.85
KNeighborsClassifier(n_jobs=-1) is done in time:  637.91
SVC() is done in time:  15.67
DecisionTreeClassifier() is done in time:  13.09
RandomForestClassifier(n_jobs=-1) is done in time:  21.77
GradientBoostingClassifier() is done in time:  328.42


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Variational AE,LogisticRegression,1.0,0.81,0.76,0.88,112,35,1.85
1,Variational AE,KNeighborsClassifier(n_jobs=-,1.0,0.82,0.73,0.94,108,39,637.91
2,Variational AE,SVC,1.0,0.82,0.8,0.84,117,30,15.67
3,Variational AE,DecisionTreeClassifier,1.0,0.76,0.76,0.77,111,36,13.09
4,Variational AE,RandomForestClassifier(n_jobs=-,1.0,0.85,0.77,0.94,113,34,21.77
5,Variational AE,GradientBoostingClassifier,1.0,0.82,0.76,0.88,112,35,328.42


#### 5.2.7 Machine Learning on Smote Balanced Dataset

##### Balancing Data

In [8]:
smote = SMOTE(random_state=2)
x_smote, y_smote = smote.fit_resample(X_train, y_train)
y_smote.value_counts()

0    199019
1    199019
Name: Class, dtype: int64

##### Running ML Pipeline

In [9]:
smote_df = train_test_ML2('SMOTE', x_smote, y_smote, X_test, y_test)
smote_df.to_excel('credit_smote.xlsx', sheet_name = 'SMOTE')
smote_df

LogisticRegression() is done in time:  2.98
KNeighborsClassifier(n_jobs=-1) is done in time:  576.17
SVC() is done in time:  1615.93
DecisionTreeClassifier() is done in time:  19.51
RandomForestClassifier(n_jobs=-1) is done in time:  25.23
GradientBoostingClassifier() is done in time:  315.58


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,SMOTE,LogisticRegression,0.98,0.12,0.91,0.06,134,13,2.98
1,SMOTE,KNeighborsClassifier(n_jobs=-,1.0,0.64,0.84,0.52,124,23,576.17
2,SMOTE,SVC,0.99,0.19,0.88,0.11,130,17,1615.93
3,SMOTE,DecisionTreeClassifier,1.0,0.53,0.79,0.4,116,31,19.51
4,SMOTE,RandomForestClassifier(n_jobs=-,1.0,0.85,0.82,0.89,120,27,25.23
5,SMOTE,GradientBoostingClassifier,0.99,0.25,0.87,0.14,128,19,315.58


#### 5.2.8 Machine Learning on dataset balanced by GMM

##### Generating Data

In [10]:
training_set = pd.concat([X_train, y_train], axis = 1)
gmm_fraud_set =  X_train.iloc[np.where(training_set['Class'] == 1)[0]]
gmm_fraud_set

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
126,0.917902,0.772751,0.697154,0.606801,0.783494,0.229021,0.253571,0.791417,0.355852,0.445594,...,0.573507,0.563870,0.543670,0.704956,0.313698,0.555361,0.521434,0.425746,0.324321,0.000039
405,0.981714,0.780863,0.750697,0.314516,0.768696,0.249683,0.256746,0.788051,0.423943,0.441850,...,0.582673,0.567454,0.511348,0.663271,0.371313,0.607201,0.546239,0.423694,0.316911,0.000030
695,0.966072,0.781754,0.828941,0.376289,0.777682,0.264030,0.273245,0.783928,0.404362,0.528481,...,0.577447,0.563182,0.527339,0.666193,0.436339,0.537224,0.392308,0.418211,0.315051,0.000000
3205,0.979722,0.796030,0.743232,0.488770,0.775458,0.248516,0.259881,0.786706,0.368099,0.442465,...,0.583602,0.564850,0.490161,0.663140,0.379698,0.613395,0.456962,0.425153,0.318816,0.000062
3522,0.964077,0.788522,0.789286,0.420914,0.762608,0.251437,0.253159,0.791271,0.382032,0.445066,...,0.583745,0.566312,0.482821,0.662646,0.337867,0.619014,0.434936,0.426721,0.319594,0.000068
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196428,0.913073,0.829052,0.704787,0.534739,0.765308,0.231535,0.246448,0.804463,0.330803,0.359930,...,0.584515,0.573324,0.489696,0.659720,0.343694,0.634484,0.492943,0.421684,0.319292,0.000039
197334,0.951162,0.755061,0.801153,0.375464,0.770673,0.261338,0.275103,0.784126,0.435326,0.505951,...,0.593845,0.569228,0.534920,0.684512,0.451783,0.544062,0.375623,0.416117,0.315744,0.017565
198453,0.860543,0.773353,0.728022,0.480642,0.754283,0.236042,0.217018,0.810846,0.356918,0.360892,...,0.580452,0.580467,0.518210,0.647447,0.322016,0.555144,0.398366,0.444299,0.320419,0.000039
198731,0.777435,0.829705,0.635171,0.642353,0.712025,0.226958,0.182400,0.854340,0.208950,0.242071,...,0.579774,0.602941,0.519679,0.640665,0.457481,0.576341,0.483315,0.421556,0.310021,0.000000


In [11]:
gm = GaussianMixture (n_components = 30,covariance_type = 'full',  random_state = 43) #This function initializes our GaussianMixutre model

gm.fit(gmm_fraud_set) #This will fit our data to the gaussian mixture model
new_gm_sample = gm.sample(n_samples = 198680)[0]  #This will sample data from the gaussian mixture learned from previous line
GMM_sampled_df = pd.DataFrame(new_gm_sample , columns = gmm_fraud_set.columns)
GMM_sampled_df['Class'] = 1  #Since we generated fraudulent data

##### Running ML Pipeline

In [12]:
gmm_training_set = pd.concat([training_set, GMM_sampled_df], axis = 0)
gmm_training_set = gmm_training_set.sample(frac = 1)
y_train_gmm = gmm_training_set[['Class']]
X_train_gmm = gmm_training_set.drop(columns = ['Class'])

y_train_gmm.value_counts()

Class
1        199025
0        199019
dtype: int64

In [13]:
gmm_df = train_test_ML2('GMM', X_train_gmm, y_train_gmm, X_test, y_test)
gmm_df.to_excel('credit_gmm.xlsx', sheet_name = 'GMM')
gmm_df

LogisticRegression() is done in time:  2.82
KNeighborsClassifier(n_jobs=-1) is done in time:  628.35
SVC() is done in time:  1601.5
DecisionTreeClassifier() is done in time:  26.89
RandomForestClassifier(n_jobs=-1) is done in time:  40.87
GradientBoostingClassifier() is done in time:  327.12


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,GMM,LogisticRegression,0.98,0.13,0.9,0.07,133,14,2.82
1,GMM,KNeighborsClassifier(n_jobs=-,1.0,0.61,0.85,0.47,125,22,628.35
2,GMM,SVC,0.99,0.21,0.89,0.12,131,16,1601.5
3,GMM,DecisionTreeClassifier,0.99,0.22,0.86,0.13,127,20,26.89
4,GMM,RandomForestClassifier(n_jobs=-,1.0,0.62,0.86,0.48,126,21,40.87
5,GMM,GradientBoostingClassifier,0.99,0.28,0.87,0.16,128,19,327.12


#### 5.2.9 Machine Learning on  dataset balanced by Vanilla Auto Encoder 

##### Generating Data

In [14]:
training_set = pd.concat([X_train, y_train], axis = 1)
ae_fraud_set =  X_train.iloc[np.where(training_set['Class'] == 1)[0]]
ae_fraud_set

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
126,0.917902,0.772751,0.697154,0.606801,0.783494,0.229021,0.253571,0.791417,0.355852,0.445594,...,0.573507,0.563870,0.543670,0.704956,0.313698,0.555361,0.521434,0.425746,0.324321,0.000039
405,0.981714,0.780863,0.750697,0.314516,0.768696,0.249683,0.256746,0.788051,0.423943,0.441850,...,0.582673,0.567454,0.511348,0.663271,0.371313,0.607201,0.546239,0.423694,0.316911,0.000030
695,0.966072,0.781754,0.828941,0.376289,0.777682,0.264030,0.273245,0.783928,0.404362,0.528481,...,0.577447,0.563182,0.527339,0.666193,0.436339,0.537224,0.392308,0.418211,0.315051,0.000000
3205,0.979722,0.796030,0.743232,0.488770,0.775458,0.248516,0.259881,0.786706,0.368099,0.442465,...,0.583602,0.564850,0.490161,0.663140,0.379698,0.613395,0.456962,0.425153,0.318816,0.000062
3522,0.964077,0.788522,0.789286,0.420914,0.762608,0.251437,0.253159,0.791271,0.382032,0.445066,...,0.583745,0.566312,0.482821,0.662646,0.337867,0.619014,0.434936,0.426721,0.319594,0.000068
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196428,0.913073,0.829052,0.704787,0.534739,0.765308,0.231535,0.246448,0.804463,0.330803,0.359930,...,0.584515,0.573324,0.489696,0.659720,0.343694,0.634484,0.492943,0.421684,0.319292,0.000039
197334,0.951162,0.755061,0.801153,0.375464,0.770673,0.261338,0.275103,0.784126,0.435326,0.505951,...,0.593845,0.569228,0.534920,0.684512,0.451783,0.544062,0.375623,0.416117,0.315744,0.017565
198453,0.860543,0.773353,0.728022,0.480642,0.754283,0.236042,0.217018,0.810846,0.356918,0.360892,...,0.580452,0.580467,0.518210,0.647447,0.322016,0.555144,0.398366,0.444299,0.320419,0.000039
198731,0.777435,0.829705,0.635171,0.642353,0.712025,0.226958,0.182400,0.854340,0.208950,0.242071,...,0.579774,0.602941,0.519679,0.640665,0.457481,0.576341,0.483315,0.421556,0.310021,0.000000


In [15]:
latent_size = 1
batch_size = 32
hidden_layer_nodes = 16

input_layer_encoder = Input(shape=(29,), name="Input_Layer_Encoder")
batch_normalize_input = BatchNormalization()(input_layer_encoder)
hidden_layer_encoder = Dense(hidden_layer_nodes, activation="relu", name="Hidden_Layer_Encoder")(batch_normalize_input)
batch_normalize_hidden_encoder = BatchNormalization()(hidden_layer_encoder)
code_layer = Dense(latent_size, name="Code")(batch_normalize_hidden_encoder)

encoder_model = Model(input_layer_encoder, code_layer)

In [16]:
input_layer_decoder = Input(shape=(latent_size,), name="Input_layer_Decoder")
batch_normalize_input_decoder = BatchNormalization()(input_layer_decoder)
hidden_layer_decoder = Dense(hidden_layer_nodes, activation="relu", name="Hidden_layer_Decoding")(batch_normalize_input_decoder)
batch_normalize_hidden_decoder = BatchNormalization()(hidden_layer_decoder)
output_layer = Dense(29, activation="linear", name="Output_Layer")(batch_normalize_hidden_decoder)

decoder_model = Model(input_layer_decoder, output_layer, name="Decoder")

In [17]:
encoder_decoder_model = decoder_model(encoder_model(input_layer_encoder))

autoencoder = Model(input_layer_encoder, encoder_decoder_model)
autoencoder.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Input_Layer_Encoder (InputL  [(None, 29)]             0         
 ayer)                                                           
                                                                 
 model (Functional)          (None, 1)                 677       
                                                                 
 Decoder (Functional)        (None, 29)                593       
                                                                 
Total params: 1,270
Trainable params: 1,146
Non-trainable params: 124
_________________________________________________________________


In [18]:
set_seed(1996)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
autoencoder.compile(loss="mean_squared_error", optimizer="adam")
history = autoencoder.fit(
    ae_fraud_set, ae_fraud_set,  
                 shuffle=True, epochs=400, batch_size=32, 
                 validation_split=0.2, verbose=2
                          ,callbacks=[callback]
                         ).history

Epoch 1/400
9/9 - 1s - loss: 0.8923 - val_loss: 0.3147 - 677ms/epoch - 75ms/step
Epoch 2/400
9/9 - 0s - loss: 0.7306 - val_loss: 0.2938 - 22ms/epoch - 2ms/step
Epoch 3/400
9/9 - 0s - loss: 0.6012 - val_loss: 0.2744 - 23ms/epoch - 3ms/step
Epoch 4/400
9/9 - 0s - loss: 0.4929 - val_loss: 0.2566 - 23ms/epoch - 3ms/step
Epoch 5/400
9/9 - 0s - loss: 0.4111 - val_loss: 0.2404 - 23ms/epoch - 3ms/step
Epoch 6/400
9/9 - 0s - loss: 0.3466 - val_loss: 0.2251 - 22ms/epoch - 2ms/step
Epoch 7/400
9/9 - 0s - loss: 0.2925 - val_loss: 0.2108 - 22ms/epoch - 2ms/step
Epoch 8/400
9/9 - 0s - loss: 0.2538 - val_loss: 0.1968 - 22ms/epoch - 2ms/step
Epoch 9/400
9/9 - 0s - loss: 0.2251 - val_loss: 0.1830 - 22ms/epoch - 2ms/step
Epoch 10/400
9/9 - 0s - loss: 0.2020 - val_loss: 0.1700 - 22ms/epoch - 2ms/step
Epoch 11/400
9/9 - 0s - loss: 0.1840 - val_loss: 0.1578 - 25ms/epoch - 3ms/step
Epoch 12/400
9/9 - 0s - loss: 0.1688 - val_loss: 0.1469 - 38ms/epoch - 4ms/step
Epoch 13/400
9/9 - 0s - loss: 0.1548 - val_loss

In [19]:
ae_fraud_output = pd.DataFrame(autoencoder.predict(ae_fraud_set), columns = ae_fraud_set.columns)
ae_fraud_output['Class'] = 1

##### Running ML Pipeline

In [20]:
ae_training_set = pd.concat([training_set, ae_fraud_output], axis = 0).sample(frac = 1)
y_train_ae = ae_training_set[['Class']]
X_train_ae = ae_training_set.drop(columns = ['Class'])

y_train_ae.value_counts()

Class
0        199019
1           690
dtype: int64

In [21]:
ae_df = train_test_ML2('Vanilla AE', X_train_ae, y_train_ae, X_test, y_test)
ae_df.to_excel('credit_vanillaAE.xlsx', sheet_name = 'Vanilla AE')
ae_df

LogisticRegression() is done in time:  1.05
KNeighborsClassifier(n_jobs=-1) is done in time:  340.12
SVC() is done in time:  7.24
DecisionTreeClassifier() is done in time:  7.85
RandomForestClassifier(n_jobs=-1) is done in time:  14.05
GradientBoostingClassifier() is done in time:  153.92


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Vanilla AE,LogisticRegression,1.0,0.69,0.57,0.88,84,63,1.05
1,Vanilla AE,KNeighborsClassifier(n_jobs=-,1.0,0.82,0.73,0.94,107,40,340.12
2,Vanilla AE,SVC,1.0,0.82,0.8,0.84,117,30,7.24
3,Vanilla AE,DecisionTreeClassifier,1.0,0.76,0.75,0.77,110,37,7.85
4,Vanilla AE,RandomForestClassifier(n_jobs=-,1.0,0.85,0.76,0.95,112,35,14.05
5,Vanilla AE,GradientBoostingClassifier,1.0,0.75,0.64,0.9,94,53,153.92


#### 5.2.10 Analysis:

In the experiment, 5 different modes of the credit card fraud data were evaluated in a machine learning machine learning and following results were acheived.
1. **Original dataset:** First, we applied machine learning models to an original imbalanced dataset to find a baseline performance and found that random forest, KNearest neighbors, and SVM performed better and gave an F1 score in the range of 0.82-0.84. logistic regression did not perform well and had the F1 score equal to 0.65. Also, since we had a very imabalanced dataset, the accuracy of all the models in all the modes of the dataset has been near perfect, so we have left it out of the analysis.
2. **Dataset balanced by Variational Autoencoder:** When machine learning was applied to the dataset balanced by the variational autoencoder, we observed that the performances of all models were in a close range, with an average performance increase. Out of the six models tested, five models had F1 scores in the range of 0.81 to 0.85, with the decision tree model being the exception with an F1 score of 0.76. These results suggest that since the VAE generates synthetic samples by learning a compact representation of the data distribution and sampling from this latent space, this could be the cause of this boost in performance. Compared to the original unbalanced dataset, this method might be more successful in creating synthetic samples that accurately reflect the data distribution and enhancing the performance of machine learning models.
3. **Dataset balanced by SMOTE:** Upon applying machine learning to the dataset balanced using SMOTE, we observed a significant decrease in the performance of all models, with some models performing worse than the baseline performance on the original imbalanced dataset. Only the random forest model had a satisfactory performance, with an F1 score of 0.85. SMOTE generates synthetic samples by oversampling the minority class and interpolating between existing samples, which can produce synthetic samples that are not representative of the true data distribution and result in decreased model performance. This could be one reason for the decrease in performance.
4. **Dataset balanced by GMM:** We observed a considerable decline in the performance of all models when machine learning was performed to the dataset balanced using Gaussian Mixture Models, with the average performance being the worst of all the experiments. The random forest model had its highest F1 score, 0.62. This leads us to believe that Gaussian Mixture Model may not be effective for generating synthetic samples that are representative of the true data distribution, particularly if the data has a complex or non-Gaussian distribution.

## 5.3 Marketing Dataset

#### 5.3.1 Loading Dataset

In [30]:
marketing_df = pd.read_csv('MarketingDataUCI.csv', sep='\t')  #Marketing Campaign Dataset
marketing_df.head()  #Bird's eye view of dataset

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


#### 5.3.2 Pre-Processing

In [31]:
#Removing missing values
marketing_df.dropna(inplace = True)

marketing_classes = marketing_df[['Response']]
marketing_df.drop(columns = ['Response', 'ID', 'Dt_Customer'], inplace = True)  #Dropping unnecessary columns

#dummy-encoding (One-hot encoding) the categorical variables
marketing_df = pd.get_dummies(marketing_df, drop_first = True)
marketing_df.shape

#Scaling and One hot Encoding
Scaler = MinMaxScaler()
marketing_df = pd.get_dummies(marketing_df)
marketing_df = pd.DataFrame(Scaler.fit_transform(marketing_df), columns = marketing_df.columns)
print('Shape of df now is: ', marketing_df.shape)
marketing_df.head()

Shape of df now is:  (2216, 35)


Unnamed: 0,Year_Birth,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,...,Education_Graduation,Education_Master,Education_PhD,Marital_Status_Alone,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single,Marital_Status_Together,Marital_Status_Widow,Marital_Status_YOLO
0,0.621359,0.084832,0.0,0.0,0.585859,0.425318,0.442211,0.316522,0.664093,0.335878,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.592233,0.067095,0.5,0.5,0.383838,0.007368,0.005025,0.003478,0.007722,0.003817,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.699029,0.105097,0.0,0.0,0.262626,0.285332,0.246231,0.073623,0.428571,0.080153,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.883495,0.037471,0.5,0.0,0.262626,0.007368,0.020101,0.011594,0.03861,0.01145,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.854369,0.085065,0.5,0.0,0.949495,0.115874,0.21608,0.068406,0.177606,0.103053,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [32]:
X_train, X_test, y_train, y_test = train_test_split(marketing_df, marketing_classes, test_size=0.30, random_state =43)  

X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)['Response']

X_test = X_test.reset_index(drop=True)
y_test  = y_test.reset_index(drop=True)['Response']
y_test.value_counts()

0    568
1     97
Name: Response, dtype: int64

#### 5.3.3 Machine Learning on Original Dataset

In [33]:
orig_df = train_test_ML2('Original', X_train, y_train, X_test, y_test)
orig_df.to_excel('marketing_Original.xlsx', sheet_name = 'Original')

orig_df

LogisticRegression() is done in time:  0.01
KNeighborsClassifier(n_jobs=-1) is done in time:  0.05
SVC() is done in time:  0.1
DecisionTreeClassifier() is done in time:  0.02
RandomForestClassifier(n_jobs=-1) is done in time:  0.16
GradientBoostingClassifier() is done in time:  0.27


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Original,LogisticRegression,0.89,0.49,0.37,0.71,36,61,0.01
1,Original,KNeighborsClassifier(n_jobs=-,0.86,0.27,0.18,0.59,17,80,0.05
2,Original,SVC,0.87,0.29,0.19,0.64,18,79,0.1
3,Original,DecisionTreeClassifier,0.84,0.46,0.47,0.44,46,51,0.02
4,Original,RandomForestClassifier(n_jobs=-,0.88,0.43,0.3,0.76,29,68,0.16
5,Original,GradientBoostingClassifier,0.89,0.49,0.37,0.72,36,61,0.27


#### 5.3.4 Machine Learning on Dataset balanced by Variational Auto Encoder

##### Generating Data

In [34]:
epochs = 200
batch_size = 64
hidden_dim = 32
latent_dim = 2
optimizer = 'adam'

X_min = X_train[y_train == 1]
encoder, decoder, vae = var_ae(X_min)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

In [35]:
majo = len(X_train[y_train == 0])
mino = len(X_train[y_train == 1])
dif = majo - mino
dif

1079

In [36]:
X_train_vae, y_train_vae = generate(X_train, y_train)


##### Running ML Pipeline

In [37]:
X_train_vae = pd.DataFrame(X_train_vae, columns=X_train.columns)
y_train_vae = pd.DataFrame(y_train_vae, columns=marketing_classes.columns)

vae_training_set = pd.concat([X_train_vae,y_train_vae ], axis = 1).sample(frac =1)
y_train_vae_shuffled = vae_training_set[['Response']]
X_train_vae_shuffled = vae_training_set.drop(columns = ['Response'])

y_train_vae_shuffled.value_counts()

Response
0.0         1315
1.0         1315
dtype: int64

In [38]:
vae_df = train_test_ML2('Variational AE', X_train_vae_shuffled, y_train_vae_shuffled, X_test, y_test)
vae_df.to_excel('marketing_VariationalAE.xlsx', sheet_name = 'Variational AE')
vae_df

LogisticRegression() is done in time:  0.02
KNeighborsClassifier(n_jobs=-1) is done in time:  0.06
SVC() is done in time:  0.16
DecisionTreeClassifier() is done in time:  0.02
RandomForestClassifier(n_jobs=-1) is done in time:  0.14
GradientBoostingClassifier() is done in time:  0.95


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Variational AE,LogisticRegression,0.86,0.55,0.6,0.51,58,39,0.02
1,Variational AE,KNeighborsClassifier(n_jobs=-,0.86,0.27,0.18,0.59,17,80,0.06
2,Variational AE,SVC,0.86,0.27,0.18,0.63,17,80,0.16
3,Variational AE,DecisionTreeClassifier,0.84,0.47,0.48,0.46,47,50,0.02
4,Variational AE,RandomForestClassifier(n_jobs=-,0.88,0.42,0.3,0.72,29,68,0.14
5,Variational AE,GradientBoostingClassifier,0.89,0.49,0.37,0.72,36,61,0.95


### 5.4 Heart Disease Dataset

#### 5.4.1 Loading Dataset

In [39]:
heart_df = pd.read_csv('HeartDataUCI.csv')   #Heart disease Dataset
print('Shape of heart disease dataframe is: ', heart_df.shape)
heart_df.head()  #Bird's eye view of dataset

Shape of heart disease dataframe is:  (319795, 18)


Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [40]:
heart_df['HeartDisease'].value_counts()[1]/(heart_df['HeartDisease'].value_counts()[0] +
 heart_df['HeartDisease'].value_counts()[1])   #Checking Class Distribution

0.08559545959130067

#### 5.4.2 Pre-Processing

In [41]:
heart_classes = heart_df[['HeartDisease']]
heart_df.drop(columns = ['HeartDisease'], inplace = True)  #Dropping unnecessary columns

#Scaling and One hot Encoding
Scaler = StandardScaler()
heart_df = pd.get_dummies(heart_df)
heart_df = pd.DataFrame(Scaler.fit_transform(heart_df), columns = heart_df.columns)
print('Shape of df now is: ', heart_df.shape)
heart_df.head()

Shape of df now is:  (319795, 50)


Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,Stroke_Yes,...,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,-1.84475,-0.046751,3.281069,-1.460354,-1.193474,1.193474,0.27032,-0.27032,0.19804,-0.19804,...,-0.348745,-0.640987,-0.191292,1.344886,-2.541515,2.541515,0.195554,-0.195554,-3.118419,3.118419
1,-1.256338,-0.42407,-0.490039,-0.067601,0.83789,-0.83789,0.27032,-0.27032,-5.049478,5.049478,...,-0.348745,-0.640987,-0.191292,1.344886,0.393466,-0.393466,0.195554,-0.195554,0.320675,-0.320675
2,-0.274603,2.091388,3.281069,0.628776,-1.193474,1.193474,0.27032,-0.27032,0.19804,-0.19804,...,2.867422,-0.640987,-0.191292,-0.743558,-2.541515,2.541515,0.195554,-0.195554,0.320675,-0.320675
3,-0.647473,-0.42407,-0.490039,-0.763977,0.83789,-0.83789,0.27032,-0.27032,0.19804,-0.19804,...,-0.348745,1.560094,-0.191292,-0.743558,0.393466,-0.393466,0.195554,-0.195554,-3.118419,3.118419
4,-0.726138,3.097572,-0.490039,0.628776,0.83789,-0.83789,0.27032,-0.27032,0.19804,-0.19804,...,-0.348745,-0.640987,-0.191292,1.344886,0.393466,-0.393466,0.195554,-0.195554,0.320675,-0.320675


In [42]:
heart_classes.loc[
    ((heart_classes['HeartDisease'] == 'Yes')), 'HeartDisease'] = 1     #Labeling the Yes case

heart_classes.loc[
    ((heart_classes['HeartDisease'] == 'No')), 'HeartDisease'] = 0   #Labeling the No case    

heart_classes = heart_classes[['HeartDisease']].astype('int')

In [43]:
X_train, X_test, y_train, y_test = train_test_split(heart_df, heart_classes, test_size=0.30, random_state =43)  

X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)['HeartDisease']

X_test = X_test.reset_index(drop=True)
y_test  = y_test.reset_index(drop=True)['HeartDisease']
y_test.value_counts()

0    87846
1     8093
Name: HeartDisease, dtype: int64

#### 5.4.3 Machine Learning on Original Dataset

In [44]:
orig_df = train_test_ML2('Original', X_train, y_train, X_test, y_test)
orig_df.to_excel('heart_Original.xlsx', sheet_name = 'Original')
orig_df

LogisticRegression() is done in time:  0.6
KNeighborsClassifier(n_jobs=-1) is done in time:  438.87
SVC() is done in time:  10824.45
DecisionTreeClassifier() is done in time:  1.51
RandomForestClassifier(n_jobs=-1) is done in time:  4.66
GradientBoostingClassifier() is done in time:  24.39


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Original,LogisticRegression,0.92,0.18,0.11,0.56,864,7229,0.6
1,Original,KNeighborsClassifier(n_jobs=-,0.91,0.21,0.15,0.36,1183,6910,438.87
2,Original,SVC,0.92,0.11,0.06,0.58,489,7604,10824.45
3,Original,DecisionTreeClassifier,0.87,0.25,0.27,0.24,2173,5920,1.51
4,Original,RandomForestClassifier(n_jobs=-,0.9,0.19,0.13,0.34,1087,7006,4.66
5,Original,GradientBoostingClassifier,0.92,0.16,0.09,0.57,756,7337,24.39


#### 5.4.4 Machine Learning on Dataset balanced by Variational Auto Encoder

##### Generating Data

In [45]:
epochs = 100
batch_size = 64
hidden_dim = 32
latent_dim = 2
optimizer = 'adam'

X_min = X_train[y_train == 1]
encoder, decoder, vae = var_ae(X_min)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [46]:
majo = len(X_train[y_train == 0])
mino = len(X_train[y_train == 1])
dif = majo - mino
dif

185296

In [47]:
X_train_vae, y_train_vae = generate(X_train, y_train)

##### Running ML Pipeline

In [48]:
X_train_vae = pd.DataFrame(X_train_vae, columns=X_train.columns)
y_train_vae = pd.DataFrame(y_train_vae, columns=heart_classes.columns)

vae_training_set = pd.concat([X_train_vae,y_train_vae ], axis = 1).sample(frac =1)
y_train_vae_shuffled = vae_training_set[['HeartDisease']]
X_train_vae_shuffled = vae_training_set.drop(columns = ['HeartDisease'])

y_train_vae_shuffled.value_counts()

HeartDisease
0.0             204576
1.0             204576
dtype: int64

In [49]:
vae_df = train_test_ML2('Variational AE', X_train_vae_shuffled, y_train_vae_shuffled, X_test, y_test)
vae_df.to_excel('heart_VariationalAE.xlsx', sheet_name = 'Variational AE')
vae_df

LogisticRegression() is done in time:  3.06
KNeighborsClassifier(n_jobs=-1) is done in time:  694.36
SVC() is done in time:  14388.38
DecisionTreeClassifier() is done in time:  3.4
RandomForestClassifier(n_jobs=-1) is done in time:  8.68
GradientBoostingClassifier() is done in time:  259.58


Unnamed: 0,Data Form,Model,Accuracy,F1 Score,Recall,Precision,True_Positive,False_Negative,Time Taken
0,Variational AE,LogisticRegression,0.92,0.18,0.11,0.56,865,7228,3.06
1,Variational AE,KNeighborsClassifier(n_jobs=-,0.91,0.21,0.15,0.36,1183,6910,694.36
2,Variational AE,SVC,0.92,0.12,0.07,0.54,558,7535,14388.38
3,Variational AE,DecisionTreeClassifier,0.87,0.25,0.27,0.24,2171,5922,3.4
4,Variational AE,RandomForestClassifier(n_jobs=-,0.9,0.19,0.14,0.33,1112,6981,8.68
5,Variational AE,GradientBoostingClassifier,0.92,0.17,0.1,0.57,790,7303,259.58


# 6. Critical Analysis
In the experiment, we employed a variational autoencoder to balance three unbalanced datasets before applying machine learning models to them. While the performance of the models was constant and similar across the ML models for each of balanced datasets produced by the VAE, we discovered that this was not a statistically significant improvement over the performance of the original imbalanced datasets.

There can be the following reasons for this lack of improvement:

1. The original imbalanced datasets already contained enough samples from the minority class for the machine learning models to learn from and make reliable predictions. As the models are already capable of accurately learning from the existing minority class examples, synthesizing more samples may not offer a meaningful improvement in such situations.
2. The VAE-generated synthetic samples are not as representative of the true data distribution as the original samples could be another explanation for the reduced model performance seen in some instances. By learning a condensed representation of the data distribution and selecting samples from this latent space, the VAE creates synthetic samples. The performance of the model may suffer if the synthetic samples are not representative of the real data distribution since the VAE may not have captured the underlying data distribution adequately.

These findings imply that while VAEs might be a useful tool for creating synthetic samples to balance out-of-balance datasets, they might not always lead to a significant improvement in model performance. A number of variables, including the complexity and distribution of the data, the particular machine learning model being used, and the performance of the initial imbalanced dataset, may affect how well VAEs perform when creating synthetic samples and enhancing model performance on imbalanced datasets. Understanding the circumstances in which VAEs are most useful for producing synthetic samples and enhancing model performance on unbalanced datasets will require further study.