Privacy-Preserving Synthetic Tabular Data Generation
TabularForge is a unified, production-ready Python library for generating high-quality synthetic tabular data with built-in privacy guarantees. It combines multiple state-of-the-art approaches (GANs, VAEs, Copulas) into a simple, one-line API.
Organizations have valuable tabular data (patient records, financial transactions, customer data) but often can't share it due to:
- Privacy regulations (GDPR, HIPAA, CCPA)
- Competitive sensitivity
- Data scarcity for ML development
Synthetic data solves this by generating realistic, statistically similar data that protects individual privacy while preserving analytical utility.
| Feature | Description |
|---|---|
| π€ Multiple Generators | CTGAN, TVAE, Gaussian Copula, and more |
| π Differential Privacy | Formal privacy guarantees with configurable epsilon |
| π Quality Metrics | Statistical similarity, ML utility, privacy leakage tests |
| π§ Auto Preprocessing | Handles mixed types, missing values, imbalanced data |
| β‘ One-Line API | Generate synthetic data in a single line of code |
| π Benchmarking | Compare generators on your specific data |
# Install from PyPI
pip install tabularforge-sgk
or
pip install git+https://github.com/ganeshreddy28/tabularforge.git
# Or install from source
git clone https://github.com/ganeshreddy28/tabularforge.git
cd tabularforge
pip install -e .from tabularforge import TabularForge
import pandas as pd
# Load your real data
real_data = pd.read_csv("your_data.csv")
# Generate synthetic data in ONE line!
forge = TabularForge(real_data)
synthetic_data = forge.generate(n_samples=1000)
# That's it! synthetic_data is a pandas DataFrame
print(synthetic_data.head())from tabularforge import TabularForge
# Generate with differential privacy (epsilon=1.0)
forge = TabularForge(real_data, privacy_epsilon=1.0)
private_synthetic = forge.generate(n_samples=1000)
# Check privacy metrics
privacy_report = forge.evaluate_privacy()
print(privacy_report)from tabularforge import TabularForge
# Benchmark all available generators
forge = TabularForge(real_data)
benchmark_results = forge.benchmark(generators=['ctgan', 'tvae', 'copula'])
# See which generator works best for your data
print(benchmark_results)TabularForge supports multiple synthetic data generators:
| Generator | Best For | Speed | Quality |
|---|---|---|---|
copula |
Simple distributions, fast generation | β‘β‘β‘ | βββ |
ctgan |
Complex relationships, mixed types | β‘β‘ | ββββ |
tvae |
High-dimensional data | β‘β‘ | ββββ |
# Specify a generator
forge = TabularForge(real_data, generator='ctgan')
synthetic = forge.generate(n_samples=500)TabularForge automatically detects and handles:
- Numerical columns (continuous and discrete)
- Categorical columns (including high-cardinality)
- DateTime columns
- Missing values
# Explicit column type specification (optional)
forge = TabularForge(
real_data,
categorical_columns=['gender', 'country', 'product_type'],
numerical_columns=['age', 'income', 'score'],
datetime_columns=['signup_date', 'last_purchase']
)from tabularforge import TabularForge
forge = TabularForge(real_data)
synthetic = forge.generate(n_samples=1000)
# Get comprehensive quality report
quality_report = forge.evaluate_quality(synthetic)
print(quality_report)
# Output:
# {
# 'statistical_similarity': 0.92,
# 'column_correlations': 0.89,
# 'distribution_match': 0.94,
# 'ml_utility': 0.87
# }Generate data satisfying specific conditions:
# Generate only high-income customers
synthetic = forge.generate(
n_samples=500,
conditions={'income': '>100000', 'country': 'UK'}
)TabularForge implements differential privacy to provide formal privacy guarantees:
# Lower epsilon = stronger privacy (but lower utility)
# Higher epsilon = weaker privacy (but higher utility)
forge = TabularForge(real_data, privacy_epsilon=0.1) # Strong privacy
forge = TabularForge(real_data, privacy_epsilon=1.0) # Balanced
forge = TabularForge(real_data, privacy_epsilon=10.0) # Weak privacyTest your synthetic data against common privacy attacks:
# Simulate membership inference attack
attack_results = forge.simulate_attack(
attack_type='membership_inference',
synthetic_data=synthetic
)
print(f"Attack success rate: {attack_results['success_rate']:.2%}")
# A good synthetic dataset should have ~50% (random guess)# Generate synthetic patient cohorts for research
patient_data = pd.read_csv("patient_records.csv")
forge = TabularForge(patient_data, privacy_epsilon=1.0)
synthetic_patients = forge.generate(n_samples=10000)
# Share with researchers without exposing real patients# Create synthetic transactions for fraud detection R&D
transactions = pd.read_csv("transactions.csv")
forge = TabularForge(transactions)
synthetic_transactions = forge.generate(n_samples=50000)
# Develop ML models without sensitive financial data# Augment small datasets
small_dataset = pd.read_csv("rare_events.csv") # Only 100 samples
forge = TabularForge(small_dataset)
augmented = forge.generate(n_samples=10000)
# Now you have enough data to train robust modelstabularforge/
βββ __init__.py # Main API exports
βββ forge.py # TabularForge main class
βββ generators/ # Synthetic data generators
β βββ base.py # Abstract base generator
β βββ copula.py # Gaussian Copula generator
β βββ ctgan.py # CTGAN generator
β βββ tvae.py # TVAE generator
βββ preprocessing/ # Data preprocessing
β βββ encoder.py # Column encoding/decoding
β βββ transformer.py # Data transformations
βββ privacy/ # Privacy mechanisms
β βββ differential.py # Differential privacy
β βββ attacks.py # Privacy attack simulations
βββ metrics/ # Quality & privacy metrics
β βββ statistical.py # Statistical similarity
β βββ utility.py # ML utility metrics
β βββ privacy.py # Privacy metrics
βββ utils/ # Utilities
βββ config.py # Configuration management
βββ logging.py # Logging utilities
# Clone the repository
git clone https://github.com/ganeshreddy28/tabularforge.git
cd tabularforge
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run linting
flake8 tabularforge/
black tabularforge/# Run all tests
pytest
# Run with coverage
pytest --cov=tabularforge --cov-report=html
# Run specific test file
pytest tests/test_generators.py -vContributions are welcome! Please see our Contributing Guide for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- SDV for inspiration on synthetic data APIs
- CTGAN Paper for the CTGAN architecture
- The differential privacy research community
- Author: Sai Ganesh Kolan
- Email: aiganesh1299@gmail.com
- LinkedIn: (https://linkedin.com/in/saiganeshkolan/)
Made with β€οΈ for the data science community
β Star us on GitHub if you find this useful!
