<center>
  
# Synthetic Data Evaluation

</center>

In this notebook, we will evaluate the generated synthetic data using both the TabDDPM and TabSyn algorithms based on various metrics. The notebook is organized as follows:

1. [Imports and Setup]()


2. [Density Estimation of Single Column and Pair-wise Correlation]()
    
    
3. [$\alpha$-Precision and $\beta$-Recall ]()

    
4. [Machine Learning Efficiency]()


5. [Privacy Protection: Distance to Closest Record (DCR)]()


6. [Detection: Classifier Two Sample Tests (C2ST)]()

# Imports and Setup

First, we will import functions to evaluate the data using various metrics. Then, we will define the paths to the real train and test data, as well as the TabDDPM and TabSyn generated data.

In [1]:
import json
from pprint import pprint

from scripts.eval.eval_density import eval_density
from scripts.eval.eval_quality import eval_quality
from scripts.eval.eval_mle import eval_mle
from scripts.eval.eval_dcr import eval_dcr
from scripts.eval.eval_detection import eval_detection


dataname = "default"

DATA_DIR = "data/tabular"

TRAIN_DATA_PATH = f"{DATA_DIR}/processed_data/{dataname}/train.csv"
TEST_DATA_PATH = f"{DATA_DIR}/processed_data/{dataname}/test.csv"
TABDDPM_DATA_PATH = f"{DATA_DIR}/synthetic_data/{dataname}/tabddpm.csv"
TABSYN_DATA_PATH = f"{DATA_DIR}/synthetic_data/{dataname}/tabsyn.csv"
INFO_PATH = f"{DATA_DIR}/processed_data/{dataname}/info.json"

# Density Estimation of Single Column and Pair-wise Correlation

Two metrics are computed in this section: column shapes and column pair trends.
We explain each in the following subsections.

## Single Column Similarity Score

Essentially, each column in a table represents a single feature which accepts various values from a certain type; e.g. numerical, categorical, datetime or boolean. This feature can be described as a random variable where the values listed under the column are its samples. As a result, the density distribution of each column can be computed and compared between the real data and the synthetic data.

<p align="center">
<img src="figures/tabsyn_column_shapes_2.png" width="1000"/>
</p>

The better the distributions match each other, the better the quality of synthetic data.

The similarity of the two distriutions can be measured via different metrics for different data types; for example, **KSComplement (Kolmogorov-Smirnov Complement)** is used for numerical features and **TVComplement (Total Variation Distance Complement)** for categorical features. The overall Column Shape Score for the whole table is the average of column shape similarity scores of all columns.

**KST (Kolmogorov-Smirnov Test)**
computes the cumulative distribution function (CDF) of a numerical random variable for real and synthetic data. Then, finds the maximum different between the two CDFs $M$. Finally, the *KSComplement* score is defined as $1 - M$ so that the higher the score, the more similar the distributions.

<p align="center">
<img src="figures/tabsyn_kst.png" width="1000"/>
</p>

**TVD (Total Variation Distance)**
computes the frequency of each category's appearance under a certain column and defines it as the said categorie's probability. Then, it computes the sum of differences of probabilities between real and synthetic data as 
$\delta(R, S) = \frac{1}{2} \sum_{\omega \in \Omega}  | R_\omega - S_\omega |$
.

Here, $\omega$ describes all the possible categories in a column, $\Omega$. Meanwhile, $R$ and $S$ refer to the real and synthetic frequencies for those categories. The *TVComplement* returns $1-TVD$ so that a higher score means higher quality.

<p align="center">
<img src="figures/tabsyn_tvd.png" width="1000"/>
</p>

## Pair-wise Correlation Score

The correlation between two random variables describes how they vary in relation to each other. The higher the score, the more the trends are alike.

<p align="center">
<img src="figures/tabsyn_column_pair_trends.png" width="900"/>
</p>

Here, we use different metrics to compute the correlation between different pairs of data types:

| Column Type | Metric |
| ----------- | ------ |
| numerical & numerical | [correlation similarity](https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/correlationsimilarity) |
| categorical & categorical | [contingency similarity](https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/contingencysimilarity) |
| numerical & categorical | discretize the numerical columns into bins, then apply contingency similarity. |

This yields a score between every pair of columns. The **Column Pair Trends** score is the average of all the scores.

In [7]:
shape, trend = eval_density(TABDDPM_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH)
print("Shape:", shape)
print("Trend:", trend)

syn_path='data/tabular/synthetic_data/default/tabddpm.csv'
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 24/24 [00:00<00:00, 51.17it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 276/276 [00:12<00:00, 21.36it/s]

Overall Score: 90.23%

Properties:
- Column Shapes: 93.51%
- Column Pair Trends: 86.96%
Shape: 0.9350632716049384
Trend: 0.8696141571360197


In [None]:
shape, trend = eval_density(TABSYN_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH)
print("Shape:", shape)
print("Trend:", trend)

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 24/24 [00:00<00:00, 42.47it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 276/276 [00:14<00:00, 18.62it/s]

Overall Score: 97.1%

Properties:
- Column Shapes: 98.68%
- Column Pair Trends: 95.52%
Shape: 0.9868132716049383
Trend: 0.9551999968185801


# $\alpha$-Precision and $\beta$-Recall 

$\alpha$-Preicison and $\beta$-Recall are generalizations of Precision and Recall metrics proposed by [Sajjadi et al.](https://proceedings.neurips.cc/paper/2018/hash/f7696a9b362ac5a51c3dc8f098b73923-Abstract.html) in 2018. These metrics can range between $[0, 1]$ and the closest they are to $1$ the better.


Preicison measures the fidelity or quality of sythetic data. In more clear terms, it computes the proporation of synthetic datapoints that are *close* to real datapoints. 
Recall, on the other hand, measures the diversity of synthetic data; i.e. the extent to which these samples cover the full variability of real samples. More clearly, recall computes the proportion of real datapoints that are *close* to synthetic datapoints.


If we denote the distibution of real datapoints by $P(X)$ and the distribution of sythetic datapoints by $Q(Y)$, precision is the portion of $Q(Y)$ that can be generated by $P(X)$, while recall is the portion of $P(X)$ that can be generated by $Q(Y)$.

To better understand these concepts, let's assume that the real/generated dataponits are samples from an underlying real/generated manifold.
Precision measures the proportation of generated datapoints that fall on the real manifold, while recall measures the proportion of real datapoints that fall on the generated manifold.

<p align="center">
<img src="figures/tabsyn_precision.png" width="300"/>
</p>

<p align="center">
<img src="figures/tabsyn_recall.png" width="300"/>
</p>

The Precision-Recall metrics are very sensitive to outliers since even a few outliers can greatly change the shape of the underlying manifold.
To address this limitation, $\pmb{\alpha}$**-Precision** and $\pmb{\beta}$**-Recall** are defined by assuming that a fraction $1−\alpha$ (or $1−\beta$) of the real (and synthetic) data are “outliers”, and $\alpha$ (or $\beta$) are “typical”. 
$\alpha$-Precision is the fraction of synthetic samples that resemble the “most typical” fraction $\alpha$ of real samples, whereas $\beta$-Recall is the fraction of real samples covered by the most typical fraction $\beta$ of synthetic samples.
The two metrics are evaluated for all $\alpha, \beta \in [0, 1]$, providing entire precision and recall curves instead of single numbers.


To illustrate, consider the below image. Blue and red points are real and generated datapoints, respectively. The large blue and red spheres show the underlying manifold that was estimated from real and generated datapoints.
Good quality generated datapoints should fall within the blue sphere like image *(c)*. They should not lie far from the blue sphere like *(a)*. Moreover, they should not be placed too close (or *copied*) to a real datapoint like *(b)*.
Image *(d)* shows an outlier in the real datapoints which is cut outside of the manifold due to the application on $\alpha$ and $\beta$.
If we used vanilla Precision and Recall, the blue sphere's radius should have increased to include the outlier which would also lead it to include noisy synthetic datapoints like *(a)*.

<p align="center">
<img src="figures/tabsyn_alpha_precision_beta_recall.png" width="500"/>
</p>


In [None]:
alpha_precision_all, beta_recall_all = eval_quality(
    TABDDPM_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH
)
print("Alpha precision:", alpha_precision_all)
print("Beta recall:", beta_recall_all)

In [9]:
alpha_precision_all, beta_recall_all = eval_quality(
    TABSYN_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH
)
print("Alpha precision:", alpha_precision_all)
print("Beta recall:", beta_recall_all)

data/tabular/synthetic_data/default/tabsyn.csv
Data shape:  (27000, 93)
Alpha precision: 0.9901046402724564
Beta recall: 0.4702987654320988


# Machine Learning Efficiency

This method trains a machine learning model (in our case, an XGBoost model) on the synthetic data and evaluates it on the real data.

In [10]:
## does a grid search over given params and reports all scores for each best of them
# tabular dataload and tabular transformer look extra
overall_score = eval_mle(TABSYN_DATA_PATH, TEST_DATA_PATH, INFO_PATH)
print("TABSYN - Overall score:")
pprint(overall_score)

100%|██████████| 36/36 [00:17<00:00,  2.10it/s]


TABSYN - Overall score:
{'best_acc_scores': {'XGBClassifier': {'accuracy': 0.81,
                                       'binary_f1': 0.45297504798464494,
                                       'roc_auc': 0.7877443582482191,
                                       'weighted_f1': 0.554220929899447}},
 'best_auroc_scores': {'XGBClassifier': {'accuracy': 0.814,
                                         'binary_f1': 0.4571984435797665,
                                         'roc_auc': 0.7888965197353714,
                                         'weighted_f1': 0.5580960679415622}},
 'best_avg_scores': {'XGBClassifier': {'accuracy': 0.81,
                                       'binary_f1': 0.45297504798464494,
                                       'roc_auc': 0.7877443582482191,
                                       'weighted_f1': 0.554220929899447}},
 'best_f1_scores': {'XGBClassifier': {'accuracy': 0.81,
                                      'binary_f1': 0.45297504798464494,
              

In [None]:
overall_score = eval_mle(TABDDPM_DATA_PATH, TEST_DATA_PATH, INFO_PATH)
print("TABDDPM - Overall score:")
pprint(overall_score)

100%|██████████| 36/36 [00:35<00:00,  1.00it/s]


TABDDPM - Overall score:
{'best_acc_scores': {'XGBClassifier': {'accuracy': 0.8146666666666667,
                                       'binary_f1': 0.46124031007751937,
                                       'roc_auc': 0.7850053040919847,
                                       'weighted_f1': 0.5612639528642225}},
 'best_auroc_scores': {'XGBClassifier': {'accuracy': 0.8116666666666666,
                                         'binary_f1': 0.4684854186265287,
                                         'roc_auc': 0.7883066601188635,
                                         'weighted_f1': 0.5662194341712794}},
 'best_avg_scores': {'XGBClassifier': {'accuracy': 0.8133333333333334,
                                       'binary_f1': 0.47368421052631576,
                                       'roc_auc': 0.7745027065422088,
                                       'weighted_f1': 0.5704319144701299}},
 'best_f1_scores': {'XGBClassifier': {'accuracy': 0.8133333333333334,
                            

As baseline, we also evaluate a similar ML model (i.e. XGBoost) on the real training data.

In [None]:
overall_score = eval_mle(TRAIN_DATA_PATH, TEST_DATA_PATH, INFO_PATH)
print("BASELINE - Overall score:")
pprint(overall_score)

100%|██████████| 36/36 [00:31<00:00,  1.14it/s]


BASELINE - Overall score:
{'best_acc_scores': {'XGBClassifier': {'accuracy': 0.81,
                                       'binary_f1': 0.46629213483146065,
                                       'roc_auc': 0.7825501876094182,
                                       'weighted_f1': 0.5642753583567985}},
 'best_auroc_scores': {'XGBClassifier': {'accuracy': 0.814,
                                         'binary_f1': 0.46449136276391556,
                                         'roc_auc': 0.7882964420782628,
                                         'weighted_f1': 0.5636057524278798}},
 'best_avg_scores': {'XGBClassifier': {'accuracy': 0.81,
                                       'binary_f1': 0.46629213483146065,
                                       'roc_auc': 0.7825501876094182,
                                       'weighted_f1': 0.5642753583567985}},
 'best_f1_scores': {'XGBClassifier': {'accuracy': 0.81,
                                      'binary_f1': 0.46629213483146065,
         

# Privacy Protection: Distance to Closest Record (DCR)

One of the applications of synthetically generated data is protecting sensitive information while creating similar substitute data that could be used to train machine learning models or published on public platforms.
For this purpose, we must ensure that the synthetic datapoints are far enough from any real datapoints to prevent leaking of real sensitive information.

One metric that is used for this purpose is **Distance to Closest Record (DCR)**.
DCR is the Euclidean distance between a synthetic datapoint and its nearest real datapoint.
DCR equal to zero means that the synthetic datapoint will leak the real information, while higher DCR values mean less risk of privacy leakage.

`eval_dcr` computes the DCR of each synthetic datapoint to real datapoints in two different sets: training and test. Then, it returns the proportion of synthetic datapoints that are closer to the training dataset than the test dataset.
If the size of the training and test datasets are equal, this score should ideally be $0.5$ indicating that the model has not overfit to training data and the synthetic datapoints are not memorized copies of training data.
If the size of the training and test datasets are different, the ideal value for this score is #Train / (#Train + #Test).

In [11]:
# review json file and its contents
with open(INFO_PATH, "r") as file:
    data_info = json.load(file)
pprint(data_info)

{'cat_col_idx': [1, 2, 3, 5, 6, 7, 8, 9, 10],
 'column_info': {'0': {},
                 '1': {},
                 '10': {},
                 '11': {},
                 '12': {},
                 '13': {},
                 '14': {},
                 '15': {},
                 '16': {},
                 '17': {},
                 '18': {},
                 '19': {},
                 '2': {},
                 '20': {},
                 '21': {},
                 '22': {},
                 '23': {},
                 '3': {},
                 '4': {},
                 '5': {},
                 '6': {},
                 '7': {},
                 '8': {},
                 '9': {},
                 'categorizes': [0, 1],
                 'max': 528666.0,
                 'min': 0.0,
                 'type': 'categorical'},
 'column_names': ['LIMIT_BAL',
                  'SEX',
                  'EDUCATION',
                  'MARRIAGE',
                  'AGE',
                  'PAY_0',
   

In [12]:
ideal_dcr = data_info["train_num"] / (data_info["train_num"] + data_info["test_num"])

dcr_score = eval_dcr(TABDDPM_DATA_PATH, TRAIN_DATA_PATH, TEST_DATA_PATH, INFO_PATH)
print(f"DCR Score, a value closer to {ideal_dcr} is better")
print("Distance to Closest Record:", dcr_score)

DCR Score, a value closer to 0.9 is better
Distance to Closest Record: 0.9066666666666666


In [None]:
ideal_dcr = data_info["train_num"] / (data_info["train_num"] + data_info["test_num"])

dcr_score = eval_dcr(TABSYN_DATA_PATH, TRAIN_DATA_PATH, TEST_DATA_PATH, INFO_PATH)
print(f"DCR Score, a value closer to {ideal_dcr} is better")
print("Distance to Closest Record:", dcr_score)

DCR Score, a value closer to 0.9 is better
Distance to Closest Record: 0.8962222222222223


# Detection: Classifier Two Sample Tests (C2ST)

This metric evaluates if the synthetic data can be detected from the real data via a machine learning model, hence measuring how difficult it is to distinguish synthetic from real data. A logistic regression model is used in `eval_detection`.

This score is measured through below steps:
1. Create a single, augmented table that has all the rows of real data and all the rows of synthetic data. Add an extra column to keep track of whether each original row is real or synthetic.
2. Split the augmented data to create a training and validation sets.
3. Choose a machine learning model. Train the model on the training split. The model will predict whether each row is real or synthetic (i.e. predict the extra column we created in step #1).
4. Validate the model on the validation set.
5. Repeat steps #2-4 multiple times.
The final score is based on the average ROC-AUC score across all the cross validation splits,

$score = 1 - (max($ <span style="text-decoration:overline">ROC-AUC</span> $, 0.5) \times 2 - 1)$
.

This score can range between $[0, 1]$ with $0$ being the lowest (meaning that the machine learning model can perfectly identify synthetic data apart from the real data), and $1$ being the highest (meaning that the machine learning model cannot identify the synthetic data apart from the real data).



In [13]:
detection_score = eval_detection(TABSYN_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH, dataname, model="tabsyn")
print("TABSYN - Detection score:", detection_score)

TABSYN - Detection score: 0.9749056954732511


In [None]:
detection_score = eval_detection(TABDDPM_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH, dataname, model="tabsyn")
print("TABDDPM - Detection score:", detection_score)

TABDDPM - Detection score: 0.9630604403292181


In [None]:
detection_score = eval_detection(TRAIN_DATA_PATH, TRAIN_DATA_PATH, INFO_PATH, dataname, model="tabsyn")
print("BASELINE - Detection score:", detection_score)

BASELINE - Detection score: 1.0


# Missing Value Imputation for the Target Column

In [23]:
import importlib
importlib.reload(impute)

<module 'scripts.impute' from '/fs01/home/yaspar/Documents/GitHub/diffusion_model_bootcamp/reference_implementations/tabular_reference_impelementation/single_table_synthesis/scripts/impute.py'>

In [24]:
# impute missing data
from scripts import impute
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
impute.main(dataname="default", device=device)

device=device(type='cuda')
{'loss_params': {'lambd': 0.7, 'max_beta': 0.01, 'min_beta': 1e-05},
 'model_params': {'d_token': 4, 'factor': 32, 'n_head': 1, 'num_layers': 2},
 'task_type': 'binclass',
 'train': {'diffusion': {'batch_size': 4096,
                         'num_dataset_workers': 4,
                         'num_epochs': 9},
           'optim': {'diffusion': {'factor': 0.9,
                                   'lr': 0.001,
                                   'patience': 20,
                                   'weight_decay': 0},
                     'vae': {'factor': 0.95,
                             'lr': 0.001,
                             'patience': 10,
                             'weight_decay': 0}},
           'vae': {'batch_size': 4096,
                   'num_dataset_workers': 4,
                   'num_epochs': 10}},
 'transforms': {'cat_encoding': None,
                'cat_min_frequency': None,
                'cat_nan_policy': None,
                'normalization':

Traceback (most recent call last):
  File "/projects/aieng/diffusion_bootcamp/env/diffusion-models-bootcamp-z7DAirMd-py3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_3919/4142259032.py", line 7, in <module>
    impute.main(dataname="default", device=device)
  File "/fs01/home/yaspar/Documents/GitHub/diffusion_model_bootcamp/reference_implementations/tabular_reference_impelementation/single_table_synthesis/scripts/impute.py", line 221, in main
  File "/fs01/home/yaspar/Documents/GitHub/diffusion_model_bootcamp/reference_implementations/tabular_reference_impelementation/single_table_synthesis/scripts/impute.py", line 49, in step
    x_next = x_hat + (t_next - t_hat) * (0.5 * d_cur + 0.5 * d_prime)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/projects/aieng/diffusion_bootcamp/env/diffu

In [25]:
from scripts.eval import eval_impute
eval_impute.main(dataname="default")

FileNotFoundError: [Errno 2] No such file or directory: 'impute/tabsyn/default/11.csv'