rsdv — The R Synthetic Data Vault

Synthetic data generation in R (Gaussian Copula based, extensible to deep generative models)

rsdv is an R implementation of Python’s Synthetic Data Vault (SDV) framework (Patki, Wedge, and Veeramachaneni 2016). It generates synthetic tabular data using Gaussian copula models, with built-in quality and privacy evaluation.

Installation

# Development version
remotes::install_github("kvenkita/rsdv")

Quick start

library(rsdv)
#> 
#> Attaching package: 'rsdv'
#> The following object is masked from 'package:base':
#> 
#>     sample

set.seed(42)

# Describe column types
meta <- metadata(adult_income) |>
  set_column_type("age",        "numerical") |>
  set_column_type("occupation", "categorical") |>
  set_column_type("income",     "categorical") |>
  set_primary_key("id")

# Fit a GaussianCopula synthesizer
syn       <- gaussian_copula_synthesizer(meta)
syn       <- fit(syn, adult_income)

# Generate 500 synthetic rows
synth_data <- sample(syn, n = 500)

# Evaluate quality
qr <- quality_report(real = adult_income, synthetic = synth_data,
                     metadata = meta)
print(qr)
#> == rsdv Quality Report ==
#> 
#> Column Similarity (KS, numerical):
#>   id                   0.958
#>   age                  0.948
#>   fnlwgt               0.950
#>   education_num        0.780
#>   capital_gain         0.468
#>   capital_loss         0.470
#>   hours_per_week       0.738
#> 
#> Column Similarity (TVD, categorical):
#>   workclass            0.961
#>   education            0.944
#>   marital_status       0.952
#>   occupation           0.951
#>   relationship         0.978
#>   race                 0.990
#>   sex                  0.992
#>   native_country       0.976
#>   income               0.980
#> 
#> Property scores:
#>   Column Shapes        0.877
#>   Column Pair Trends   0.903
#>     (correlation 0.967, contingency 0.865)
#> 
#> Overall Score:               0.890

quality_report() aggregates metrics into the two-property hierarchy used by SDMetrics — Column Shapes (per-column marginal fidelity) and Column Pair Trends (correlation similarity for numerical pairs, contingency similarity for categorical pairs) — with the overall score the mean of the two.

diagnostic_report() complements it with structural-validity checks (value ranges, category adherence, key uniqueness), and sample_conditions() generates rows that hold given categorical values fixed:

# Validity checks
diagnostic_report(adult_income, synth_data, meta)

# Conditional generation
sample_conditions(syn, data.frame(income = ">50K", .n = 20))

Related work

Python SDV: sdv-dev/SDV
Synthetic Data Vault paper: Patki et al., IEEE DSAA 2016
CTGAN: Xu et al., NeurIPS 2019 (implemented in companion package rsdv.torch)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
R		R
build		build
data		data
inst		inst
man		man
tests		tests
vignettes		vignettes
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
MD5		MD5
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rsdv — The R Synthetic Data Vault

Installation

Quick start

Related work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rsdv — The R Synthetic Data Vault

Installation

Quick start

Related work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages