Data Simulator

Uses a statistical profile of ODBC source data to parameterize a generic probability model. Then generate a structurally equivalent random data set by sampling from the probability model.

Privacy considerations

Simulated data does not contain patient data. Simulation is one way-there is no-way to reproduce source data from simulated data.

Requirements

ODBC compatible source data (eg MS SQL Database, Denodo)

Statistical Profile

input:

connection string ODBC compliant DSN of source system
parameter file specifying list of tables that will be copied as is from source to destination

output:

rerunnable DDL scripts to create empty structures
rerunnable DML scripts to populate non-private data
collection of C++ objects repersenting source schema objects
- database,
- schema,
- table,
- column
  - parameterized multinomal probability model (ie collection of distinct column values (aka outcome values) and there respective probabilties)

Logical Flow

Extract meta data from source schema A. generate rerunnable DDL scripts
Instantiate c++ object model of schema from meta data
C++ column class decorated to add odbc connectivity for querying source system
Generate and execute queries against source data that determines pairwise functional dependency between columns within each table
Functional dependency hierarchies are modelled as a tree of column values. a. Only leaf level columns are simulated 2. value of parent columns is determined from functional dependency tree
Generate and execute queries against source data that return table row counts, column distinct counts, and column column value histograms
Assume columns and rows are pairwise statistically independent (within and between tables). a. Thus fk constraints are broken.
Simulate primary keys columns with increasing sequence of values
Model non unique columns as multinomial probability distribution and estimate parameters with column value histogram

Data Simulation

input:

output of Statistical Profile
desired row counts for simulated tables

output:

Fake data

Logical Flow

Execute DDL creation scripts
Execute DML insertion scripts
populate destination tables by sampling from column probility models

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
DataSimulator		DataSimulator
.gitattributes		.gitattributes
.gitignore		.gitignore
DataSimulator.sln		DataSimulator.sln
Readme.md		Readme.md
database_h.txt		database_h.txt
histogram_queries.sql		histogram_queries.sql
meta_queries.sql		meta_queries.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Simulator

Privacy considerations

Requirements

Statistical Profile

input:

output:

Logical Flow

Data Simulation

input:

output:

Logical Flow

About

Releases

Packages

Languages

VCHDecisionSupport/DataSimulator

Folders and files

Latest commit

History

Repository files navigation

Data Simulator

Privacy considerations

Requirements

Statistical Profile

input:

output:

Logical Flow

Data Simulation

input:

output:

Logical Flow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages