Skip to content

Conversation

@mhauru
Copy link
Collaborator

@mhauru mhauru commented Mar 2, 2023

Adds functionality for user to specify statistical descriptors to be computed from source data using OpenDP, that can be used in place of hard constants when parametrising generators. Also includes various other small changes that I happened to make in the process.

Closes #24

@mhauru mhauru changed the base branch from main to test-chdir March 2, 2023 14:34
@mhauru mhauru changed the base branch from test-chdir to main March 2, 2023 14:34
@mhauru mhauru marked this pull request as ready for review March 2, 2023 16:43
@mhauru mhauru requested a review from Iain-S March 2, 2023 16:43
@Iain-S
Copy link
Collaborator

Iain-S commented Mar 3, 2023

config

Since the config file is getting quite complex now, should we add either a JSON schema to show what can go in a config file (and, if we like, verify that the config provided is in that format)?

docs

Shall we add a worked example and explanation to the docs? I'm happy to do this as it will give me a good excuse to play around with make-stats and smartnoise. If that sounds good, shall I make a PR that can be pulled into this branch (or would you rather just add it to this branch directly)?

warning

I get this warning when I run make-stats with the test example config

/Users/istenson/code/turing/sqlsynthgen/.venv/lib/python3.9/site-packages/snsql/sql/private_reader.py:121: UserWarning: Dimension censoring is enabled, with Mechanism.discrete_laplace as the thresholding mechanism.
This is an unsafe floating point mechanism. Counts used for censoring will be revealed in any queries that request COUNT DISTINCT(person),
leading to potential privacy leaks. If your query workload needs to reveal distinct counts of individuals, consider doing the dimension
censoring as a preprocessing step. See the documentation for more information.
warnings.warn(

Do you think we need to address it or silence it or intercept it and make it more readable?

mhauru and others added 3 commits March 6, 2023 10:15
Co-authored-by: Iain <25081046+Iain-S@users.noreply.github.com>
Co-authored-by: Iain <25081046+Iain-S@users.noreply.github.com>
@mhauru
Copy link
Collaborator Author

mhauru commented Mar 6, 2023

Re: config schema
We should definitely at least document what the format is. I should read on JSON schemas to understand what it would be like to specify an actual schema and verify.

Re: docs
That would be great. Either way is good, whatever you find easier.

Re: warning
I need to read more to understand why that warning is even raised.

@mhauru
Copy link
Collaborator Author

mhauru commented Mar 8, 2023

I silenced the smartnoise-sql warning by setting a configuration option differently for the tests.

As far as I can see, what remains is to document the flow of using make stats in README, and document the config file format. I'll leave those for a future PR.

@mhauru mhauru merged commit ba54ada into main Mar 8, 2023
@Iain-S Iain-S deleted the issue24 branch March 8, 2023 10:18
tim-band added a commit to tim-band/sqlsynthgen that referenced this pull request Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option to derive column value from src dataset (marginals)

3 participants