Skip to content

aayush-jain18/oversampling-data

Repository files navigation

Synthetic Data Generation

A solution for using Machine Learning towards synthetic data generation by carrying out exploratory data analysis of existing production database and deriving statistical characteristics from the prod DB and then generating synthetic data that encapsulates similar statistical properties and maintains same referential integrity and correlation co-coefficients.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Requires Python >= 3.6.2, all dependent python frameworks requirements are stated in requirements.txt

Build and Run

  • Clone or download to your desired location
git clone https://github.com/aayush-jain18/synthetic-data-generation.git
  • cd to the installation directory synthetic-data-master and create a virtualenv to isolate project requirements
virtualenv testenv
  • Activate the virtualenv
testenv\Scripts\activate
  • Install all the frameworks requirements in your virtualenv
pip install -r requirements.txt
  • Change the input and output path for results as required in config.yaml
  • Run the framework from parent directory using below command
python synthetic_data_generation -c config.yaml 
  • Deactivate the virtualenv once test is completed.
deactivate

Configuring Input

All the Input and output parameters for the tools can be configured via config.yaml, config.yaml can be passed to the process from main.py

python synthetic_data_generation -c tests\config.yaml

**Please use config.yaml as template for creating new configs

Reports

All the results are stored to reports, locations that is relative to input_path provided via config.yaml. Following files are generated as output.

  1. log.out (process details and events log)
  2. db_metadata.xlsx (if input source is database, generated database metadata)
  3. Input data source stats and clusters representation
    1. cluster.png
    2. heatmap.png
    3. pair_plot.png
    4. summary.xlsx
  4. Synthetic output results, stats and clusters representation
    1. synth_cluster.png
    2. synth_heatmap.png
    3. synth_pair_plot.png
    4. synth_summary.xlsx
    5. synth_results.xlsx

Input data cluster representation

Original Cluster

Synthetic Data Generated output cluster representation

Synthetic Cluster

Built With

  • Pandas - Data structures and Data analysis tools for the Python
  • NumPy - Data structures and Data analysis tools for the Python
  • scikit-learn - Data mining and data analysis
  • imbalanced-learn - re-sampling tools for datasets showing strong between-class imbalance.
  • matplotlib - 2D plotting tools
  • seaborn - statistical data visualization

Authors

  • Aayush Jain - Author -

License

This project is licensed under the _______ License - see the LICENSE.md file for details

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages