# Initial investigation and exploration of tabular GAN data evaluation techniques


### First, install the Sythetic Data Vault Library
which include data evaluation code that we can use for our project.

**References:**
1. https://hub.gke2.mybinder.org/user/sdv-dev-sdv-uudgqste/notebooks/tutorials/evaluation/Evaluating_Synthetic_Data.ipynb
2. https://pypi.org/project/sdv/

In [33]:
#pip install sdv


In [35]:
#pip install pomegranate

In [1]:
from sdv.evaluation import evaluate
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# do not show warnings in jupyter notebook
import warnings
warnings.filterwarnings('ignore')

In [2]:
df_real_allcols = pd.read_csv('../data/data_3D_pasthistories.csv')
df_syn_CTGAN = pd.read_csv('../data/CTGAN_patientHist.csv')

In [3]:
# removing the "data" column which is not needed
df_syn_CTGAN = df_syn_CTGAN.drop(columns=['data'])
df_real = df_real_allcols[df_syn_CTGAN.columns]

In [9]:
evaluate(df_syn_CTGAN, df_real, metrics=['CSTest', 'KSTest','LogisticDetection','DiscreteKLDivergence','ContinuousKLDivergence'], aggregate=False)

Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal,error
0,CSTest,Chi-Squared,0.711462,0.711462,0.0,1.0,MAXIMIZE,
1,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.859303,0.859303,0.0,1.0,MAXIMIZE,
2,LogisticDetection,LogisticRegression Detection,0.000312,0.000312,0.0,1.0,MAXIMIZE,
3,DiscreteKLDivergence,Discrete Kullback–Leibler Divergence,0.73237,0.73237,0.0,1.0,MAXIMIZE,
4,ContinuousKLDivergence,Continuous Kullback–Leibler Divergence,0.890274,0.890274,0.0,1.0,MAXIMIZE,


In [10]:
evaluate(df_syn_CTGAN, df_real, metrics=['CSTest', 'KSTest','LogisticDetection','DiscreteKLDivergence','ContinuousKLDivergence'], aggregate=True)

0.6387516042339805

sdv.metrics.tabular.**GMLogLikelihood**: This metric fits multiple GaussianMixture models to the real data and then evaluates the average log likelihood of the synthetic data on them.

In [79]:
from sdv.metrics.tabular import GMLogLikelihood
raw_GMLL = GMLogLikelihood.compute(df_real, df_syn_CTGAN)
print("GaussianMixture Log Likelihood for CTGAN generated data: ")
print(raw_GMLL)
GMLogLikelihood.normalize(raw_GMLL)

GaussianMixture Log Likelihood for CTGAN generated data: 
-243.07341179150185


2.719935225491991e-106

In [78]:
from sklearn.model_selection import train_test_split

real_train, real_test = train_test_split(df_real, test_size=0.2, random_state=42)
raw_GMLL = GMLogLikelihood.compute(real_train, real_test)
print("GaussianMixture Log Likelihood for a train test split of real data: ")
print(raw_GMLL)
GMLogLikelihood.normalize(raw_GMLL)

GaussianMixture Log Likelihood for a train test split of real data: 
-86.52337116182395


2.650802361115944e-38

## Evaluate our GAN generation of age_unittype to the real data

In [22]:
# get the real data
ages_unit_np = np.load("../data/eICU_age_unittype.npy", allow_pickle=True)
print('length: ', len(ages_unit_np))
print(ages_unit_np[0:5])

ages_np = np.asarray(ages_unit_np[:,0].flatten().tolist()).flatten()
print('ages length: ', len(ages_np))
#print(ages_np[0:5])

unit_np = np.asarray(ages_unit_np[:,1].flatten().tolist()).flatten()
print('unit length: ', len(unit_np))
#print(ethnicity_np[0:5])

df_ages = pd.DataFrame(zip(ages_np, unit_np), columns=['age','unit'])
print(df_ages.shape)
print(df_ages.groupby('unit').count())

length:  250
[[(59,) ('CTICU',)]
 [(55,) ('CTICU',)]
 [(72,) ('Cardiac ICU',)]
 [(49,) ('CTICU',)]
 [(49,) ('CTICU',)]]
ages length:  250
unit length:  250
(250, 2)
             age
unit            
CSICU         65
CTICU         52
Cardiac ICU  133


In [23]:
# get the synthetic data
df_ages_ourGAN = pd.read_csv('../data/ourGAN_ages_ageunittype.csv')
print(df_ages.shape)
print(df_ages.groupby('unit').count())

(250, 2)
             age
unit            
CSICU         65
CTICU         52
Cardiac ICU  133


#### Create a train test split of the real data for comparison of evaluation metrics

In [82]:
from sklearn.model_selection import train_test_split

ages_train, ages_test = train_test_split(df_ages, test_size=0.2, random_state=42)

In [87]:
print("Overall evaluation score for ourGAN generated data: ")
print(evaluate(df_ages_ourGAN, df_ages, metrics=['CSTest', 'KSTest','LogisticDetection','DiscreteKLDivergence','ContinuousKLDivergence'], aggregate=True))
print(" ")
print("Individual evaluation scores for ourGAN generated data: ")
evaluate(df_ages_ourGAN, df_ages, metrics=['CSTest', 'KSTest','LogisticDetection'], aggregate=False)

Overall evaluation score for ourGAN generated data: 
0.7920447511211434
 
Individual evaluation scores for ourGAN generated data: 


Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal,error
0,CSTest,Chi-Squared,0.780798,0.780798,0.0,1.0,MAXIMIZE,
1,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.932,0.932,0.0,1.0,MAXIMIZE,
2,LogisticDetection,LogisticRegression Detection,0.624875,0.624875,0.0,1.0,MAXIMIZE,


In [90]:
print("Overall evaluation score for real data train / test split: ")
evaluate(ages_test, ages_train, metrics=['CSTest', 'KSTest','LogisticDetection','DiscreteKLDivergence','ContinuousKLDivergence'], aggregate=True)

print(" ")
print("Individual evaluation scores for real data train / test split: ")
evaluate(ages_test, ages_train, metrics=['CSTest', 'KSTest','LogisticDetection'], aggregate=False)

Overall evaluation score for real data train / test split: 
 
Individual evaluation scores for real data train / test split: 


Unnamed: 0,metric,name,raw_score,normalized_score,min_value,max_value,goal,error
0,CSTest,Chi-Squared,0.986597,0.986597,0.0,1.0,MAXIMIZE,
1,KSTest,Inverted Kolmogorov-Smirnov D statistic,0.85,0.85,0.0,1.0,MAXIMIZE,
2,LogisticDetection,LogisticRegression Detection,0.848221,0.848221,0.0,1.0,MAXIMIZE,


In [80]:
from sdv.metrics.tabular import GMLogLikelihood
raw_GMLL = GMLogLikelihood.compute(df_ages, df_ages_ourGAN)
print("GaussianMixture Log Likelihood for CTGAN generated data: ")
print(raw_GMLL)
GMLogLikelihood.normalize(raw_GMLL)

GaussianMixture Log Likelihood for CTGAN generated data: 
-5.661454959040057


0.0034654028851047436

In [81]:


raw_GMLL = GMLogLikelihood.compute(ages_train, ages_test)
print("GaussianMixture Log Likelihood for train/test split of real data: ")
print(raw_GMLL)
GMLogLikelihood.normalize(raw_GMLL)

GaussianMixture Log Likelihood for train/test split of real data: 
-2.8306316256209727


0.055691171625415016

In [72]:
from sdv.metrics.tabular import MulticlassDecisionTreeClassifier, MulticlassMLPClassifier
from sklearn.model_selection import train_test_split

ages_train, ages_test = train_test_split(df_ages, test_size=0.2, random_state=42)

print("=="*20)
print("MulticlassMLPClassifier Accuracy:")
print("train/test split of real data: ",MulticlassDecisionTreeClassifier.compute(ages_test, ages_train, target='unit'))
print("real vs. our synthetic: ", MulticlassDecisionTreeClassifier.compute(df_ages, df_ages_ourGAN, target='unit'))

print("=="*20)
print("MulticlassMLPClassifier Accuracy:")
print("train/test split of real data: ",MulticlassMLPClassifier.compute(ages_test, ages_train, target='unit'))
print("real vs. our synthetic: ", MulticlassMLPClassifier.compute(df_ages, df_ages_ourGAN, target='unit'))




BinaryDecisionTreeClassifier Accuracy:
train/test split of real data:  0.5104377104377105
real vs. our synthetic:  0.26141977598101035
BinaryAdaBoostClassifier Accuracy:
train/test split of real data:  0.2222222222222222
real vs. our synthetic:  0.1375661375661376


### Demonstration of Privacy Evaluations

This isn't really useful for our age data because of limited dimensions and it isn't a problem to generate a duplicate age, in fact it is expected.

In [73]:
from sdv.metrics.tabular import NumericalLR, NumericalMLP, CategoricalEnsemble

print(NumericalLR.compute(
    df_ages,
    df_ages_ourGAN,
    key_fields=['age'],
    sensitive_fields=['age']))

print(NumericalMLP.compute(
    df_ages,
    df_ages_ourGAN,
    key_fields=['age'],
    sensitive_fields=['age']))

print(CategoricalEnsemble.compute(
    df_ages,
    df_ages_ourGAN,
    key_fields=['unit'],
    sensitive_fields=['unit']))

0.0
7.442545319557037e-05
No attackers specified.
