DataFrameSampler generates synthetic mixed-type tabular rows by learning a fully numerical latent representation with supervised per-column categorical embeddings, then applying a local mutual-neighbor displacement operator in that latent space.
The package is an experimental tabular sampler. It is not a privacy mechanism: the model is fit directly on the input dataframe and decoded values can match values observed during training.
DataFrameSampler treats every non-continuous column as categorical. This
includes object/string columns, booleans, and binary numeric columns such as
0/1. Numeric columns are only numeric dtype columns with more than two
non-missing values.
During fit:
- Non-binary numeric columns are median-imputed and standardized.
- Every categorical column is one-hot encoded for context.
- For each categorical column
C_j, the sampler removes that column's own current block from the context and fits asklearn.neighbors.NeighborhoodComponentsAnalysisprojection to predictC_j. For larger dataframes, this NCA fit can be estimated on a deterministic row sample and then applied to all rows withnca_fit_sample_size. - The one-hot block for
C_jis replaced by its learned latent block. - The process repeats for
n_iterations. Ifn_iterations=0, the initial one-hot categorical blocks are kept as the latent categorical blocks. - A
RandomForestClassifierdecoder is fit for each categorical column from the final latent context excluding that target column's own block. Optional probability calibration can be enabled when calibrated decoder probabilities are needed.
The final latent matrix is:
Z = [standardized numeric columns | categorical block 1 | ...]
Categorical blocks are learned NCA blocks when n_iterations > 0, and one-hot
blocks when n_iterations = 0.
Recommended empirical setups used by the paper:
| Setup | n_components |
nca_fit_sample_size |
lambda_ |
n_iterations |
quantile_guard |
max_constraint_retries |
calibrate_decoders |
Use |
|---|---|---|---|---|---|---|---|---|
| Fast | 1 | 0.5 | 0.25 | 0 | 0.1 | 0 | False |
Smoke tests, previews, and cheap notebook checks. |
| Default | 1 | 0.5 | 0.25 | 0 | 0.1 | 5 | False |
General example-data workflows. |
| Accurate | 1 | 0.5 | 0.25 | 2 | 0.1 | 20 | True |
Slower diagnostic runs where calibrated probabilities matter. |
Generation uses the fitted latent matrix. For each synthetic row it picks an
anchor row A, a mutual neighbor B, and a mutual neighbor C of B, then
creates:
A' = A + lambda_ * (C - B)
A' is decoded back to a dataframe row. Numeric columns are inverse-scaled.
Categorical columns are sampled from the decoder's predicted class
probabilities.
Generated latent candidates are checked against fitted numeric quantiles by
default. With quantile_guard=0.1, a candidate is retried when any guarded
numeric latent coordinate falls outside the fitted 10th--90th percentile
interval. Each retry starts from a newly sampled anchor row. If the retry budget
is exhausted, the final candidate is accepted as-is rather than clipped.
This neighbor transport step is a heuristic. It assumes the learned latent space is locally linear enough for transferred displacements to stay near the empirical manifold.
pip install .For test dependencies:
pip install ".[test]"
pytestParquet data input/output is optional:
pip install ".[parquet]"Optional approximate-nearest-neighbor backends are installed only when needed:
pip install ".[pynndescent]"
pip install ".[hnswlib]"
pip install ".[annoy]"
pip install ".[ann]"Optional high-capacity reference experiments use SDV:
pip install ".[deep-baselines]"import pandas as pd
from dataframe_sampler import DataFrameSampler
df = pd.DataFrame(
{
"age": [21, 22, 35, 36, 48, 49, 63, 64],
"city": ["Porto", "Porto", "London", "London", "Paris", "Paris", "Rome", "Rome"],
"spend": [120, 130, 240, 260, 310, 330, 410, 430],
"member": [0, 1, 0, 1, 0, 1, 0, 1],
}
)
sampler = DataFrameSampler(
n_components=1,
n_iterations=0,
nca_fit_sample_size=0.5,
n_neighbours=3,
lambda_=0.25,
knn_backend="sklearn",
random_state=42,
)
sampler.fit(df)
latent = sampler.transform(df)
reconstructed = sampler.inverse_transform(latent, sample=False)
generated = sampler.generate(n_samples=5)Public methods:
fit(X, y=None): fit on a non-empty pandas DataFrame.yis ignored and exists only for sklearn compatibility.transform(X): transform a pandas DataFrame with the fitted columns into a NumPy latent matrix.inverse_transform(Z, sample=True): decode a NumPy latent matrix back into a pandas DataFrame.generate(n_samples=None): generate synthetic rows from the fitted latent matrix. Ifn_samples=None, the fitted row count is used.save(filename)andload(filename): pickle model persistence helpers.
Constructor arguments:
n_components: integer categorical latent width, or a dictionary keyed by categorical column. Defaults to2.n_iterations: number of NCA refinement rounds. Use0to keep the initial one-hot categorical blocks. Defaults to1.n_neighbours: nearest-neighbor count used for mutual-neighbor generation.lambda_: multiplier for the transferred neighbor displacement.knn_backend: one ofexact,sklearn,pynndescent,hnswlib, orannoy. Defaults tosklearn.knn_backend_kwargs: optional backend-specific options.random_state: optional seed.nca_kwargs: optional keyword arguments forNeighborhoodComponentsAnalysis.nca_fit_sample_size: optional row cap for fitting each NCA block. An integer uses at most that many rows; a float in(0, 1]is interpreted as a fraction of fitted rows, so0.1means 10%. The learned NCA projections are still applied to all rows.decoder_kwargs: optional keyword arguments forRandomForestClassifier. Decoders default to at leastn_estimators=100andn_jobs=-1to use all available cores.calibrate_decoders: whether to wrap categorical decoders withCalibratedClassifierCVwhen feasible. Defaults toFalse.calibration_kwargs: optional keyword arguments forCalibratedClassifierCV.quantile_guard: optional fitted quantile guard for generated numeric latent coordinates. A value of0.1accepts candidates inside the fitted 10th--90th percentile interval;0.0gives a min/max guard;Nonedisables rejection. Defaults to0.1.max_constraint_retries: number of neighbour-chain retries before accepting a quantile-violating candidate as-is. Defaults to5.
High-cardinality categorical columns are not dropped automatically. DataFrameSampler warns and proceeds, assuming such columns have been preprocessed deliberately.
Usage: dataframe-sampler [OPTIONS]
Generate a dataframe file similar to the input CSV or Parquet file.
Options:
-i, --input_filename PATH Path to input CSV or Parquet file.
-o, --output_filename PATH Path to CSV or Parquet file to generate.
-m, --input_model_filename PATH
Path to fit model.
-d, --output_model_filename PATH
Path to model to save.
-n, --n_samples INTEGER RANGE Number of samples to generate. If 0 then
generate the same number of samples as input.
--n_components INTEGER RANGE NCA latent dimensions per categorical column.
--n_iterations INTEGER RANGE Number of iterative categorical NCA
refinement rounds. Use 0 to keep one-hot
categorical blocks.
--nca_fit_sample_size FLOAT Optional NCA fit row cap. Values in (0, 1]
are fractions of fitted rows; values greater
than 1 are interpreted as row counts.
--n_neighbours INTEGER RANGE Number of neighbours.
--lambda FLOAT Latent neighbour displacement multiplier.
--random_state INTEGER Optional random seed for reproducible output.
--knn_backend [exact|sklearn|pynndescent|hnswlib|annoy]
KNN backend used for neighbour search.
--knn_backend_kwargs_filename PATH
Path to backend-specific KNN options
serialized in YAML.
--calibrate_decoders / --no_calibrate_decoders
Calibrate categorical decoder probabilities
when feasible.
--quantile_guard FLOAT RANGE Reject and retry generated candidates whose
numeric latent coordinates fall outside [q,
1-q] fitted quantiles. Use 0 for min/max.
--max_constraint_retries INTEGER RANGE
Retries per generated row before accepting an
out-of-quantile latent candidate.
-h, --help Show this message and exit.
Example:
dataframe-sampler \
--input_filename input.csv \
--output_filename generated.csv \
--n_samples 100 \
--n_components 1 \
--n_iterations 0 \
--nca_fit_sample_size 0.5 \
--lambda 0.25 \
--n_neighbours 5 \
--knn_backend sklearn \
--random_state 42The legacy source-checkout script path still works:
python dataframe_sampler.py --helpsrc/dataframe_sampler/sampler.py:DataFrameSampler, iterative NCA latent learning, probabilistic decoding, and neighbor transport generation.src/dataframe_sampler/knn.pyandsrc/dataframe_sampler/neighbours.py: exact, sklearn, and optional approximate nearest-neighbor backends.src/dataframe_sampler/cli.py: CSV/Parquet file workflow.experiments/: reusable experiment, metric, baseline, plotting, and notebook helpers.publication/: paper draft and generated tables.
python -m pytest tests/test_dataframe_sampler.py
python -m pytest tests/test_numeric_projection.py
python -m pytest tests/test_experiment_workflow.py tests/test_experiment_baselines.py tests/test_experiment_predictive.py
dataframe-sampler --help
python dataframe_sampler.py --help