Synthetic Data Generation

A solution for using Machine Learning towards synthetic data generation by carrying out exploratory data analysis of existing production database and deriving statistical characteristics from the prod DB and then generating synthetic data that encapsulates similar statistical properties and maintains same referential integrity and correlation co-coefficients.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Requires Python >= 3.6.2, all dependent python frameworks requirements are stated in requirements.txt

Build and Run

Clone or download to your desired location

git clone https://github.com/aayush-jain18/synthetic-data-generation.git

cd to the installation directory synthetic-data-master and create a virtualenv to isolate project requirements

virtualenv testenv

Activate the virtualenv

testenv\Scripts\activate

Install all the frameworks requirements in your virtualenv

pip install -r requirements.txt

Change the input and output path for results as required in config.yaml
Run the framework from parent directory using below command

python synthetic_data_generation -c config.yaml

Deactivate the virtualenv once test is completed.

deactivate

Configuring Input

All the Input and output parameters for the tools can be configured via config.yaml, config.yaml can be passed to the process from main.py

python synthetic_data_generation -c tests\config.yaml

**Please use config.yaml as template for creating new configs

Reports

All the results are stored to reports, locations that is relative to input_path provided via config.yaml. Following files are generated as output.

log.out (process details and events log)
db_metadata.xlsx (if input source is database, generated database metadata)
Input data source stats and clusters representation
1. cluster.png
2. heatmap.png
3. pair_plot.png
4. summary.xlsx
Synthetic output results, stats and clusters representation
1. synth_cluster.png
2. synth_heatmap.png
3. synth_pair_plot.png
4. synth_summary.xlsx
5. synth_results.xlsx

Input data cluster representation

Synthetic Data Generated output cluster representation

Built With

Pandas - Data structures and Data analysis tools for the Python
NumPy - Data structures and Data analysis tools for the Python
scikit-learn - Data mining and data analysis
imbalanced-learn - re-sampling tools for datasets showing strong between-class imbalance.
matplotlib - 2D plotting tools
seaborn - statistical data visualization

Authors

Aayush Jain - Author -

License

This project is licensed under the _______ License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
synthetic-data-generation		synthetic-data-generation
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Jenkinsfile		Jenkinsfile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Generation

Getting Started

Prerequisites

Build and Run

Configuring Input

Reports

Built With

Authors

License

About

Releases

Packages

Contributors 2

Languages

aayush-jain18/oversampling-data

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generation

Getting Started

Prerequisites

Build and Run

Configuring Input

Reports

Built With

Authors

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages