This code is for the paper titled "Topology for preserving feature correlation in tabular synthetic data".
Summary: From reviewing the literature, we found that the GAN based synthetic tabular data generating models are outperforming other models like Variation Autoencoder. However, we identified that the tabular synthetic data generated by GAN cannot preserve the characteristics of the original data (Feature correlation, Manifold, temporal correlation). In this paper, we analyzed the fact that why GAN (CTGAN) can not preserve feature correlations in synthetic data
Abstract: Tabular synthetic data generating models based on Generative Adversarial Network (GAN) show significant contributions to enhancing the performance of deep learning models by providing a sufficient amount of training data. However, the existing GAN-based models cannot preserve the feature correlations in synthetic data during the data synthesis process. Therefore, the synthetic data become unrealistic and creates a problem for certain applications like correlation-based feature weighting. In this short theoretical paper, we showed a promising approach based on the topology of datasets to preserve correlation in synthetic data. We formulated our hypothesis for preserving correlation in synthetic data and used persistent homology to show that the topological spaces of the original and synthetic data have dissimilarity in topological features, especially in
The paper can be accessed at the following link- https://ieeexplore.ieee.org/document/9970505