

<div align="center">
<p align="center">


# 🚀 Synthetic Data Generator

</p>
</div>
The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data.

Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc.

In [1]:
# install from git
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting git+https://github.com/hitsz-ids/synthetic-data-generator.git
  Cloning https://github.com/hitsz-ids/synthetic-data-generator.git to /private/var/folders/hm/__2c0fzn4_s21ssm_xg8g0h80000gn/T/pip-req-build-lsn_db_7
  Running command git clone --filter=blob:none --quiet https://github.com/hitsz-ids/synthetic-data-generator.git /private/var/folders/hm/__2c0fzn4_s21ssm_xg8g0h80000gn/T/pip-req-build-lsn_db_7
  Resolved https://github.com/hitsz-ids/synthetic-data-generator.git to commit 861f5b6f2206923dc382b9f8258eb2a69c907a32
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


We demonstrate with a single table data synthetic example.

In [2]:
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata

# 1. Load data and understand the data

The demo data set for this demonstration is a risk control data set used to predict whether an individual will default on a loan. This dataset contains the following features:

| Column name | Meaning |
|-----------------------|-----------------------|
| loan_id | loan ID |
| user_id | user ID |
| total_loan | Total loan amount |
| year_of_loan | Loan period |
...

This code shows the process of loading real data:

In [3]:
# In the future, this part of the function will be integrated into `sdgx.processor`

import requests


def download_file(url, path):
    response = requests.get(url)
    if response.status_code == 200:
        with open(path, 'wb') as file:
            file.write(response.content)
        print(f"File downloaded successfully to {path}")
    else:
        print(f"Failed to download file from {url}")

# download dataset from github
# This datajset can be downloaded through sdgx.utils in the future

dataset_url = "https://raw.githubusercontent.com/aialgorithm/Blog/master/projects/一文梳理风控建模全流程/train_internet.csv"
download_file(dataset_url, "train_internet.csv")


File downloaded successfully to train_internet.csv


In [4]:
from pathlib import Path
file_path = './train_internet.csv'
path_obj = Path(file_path)

# Create a data connector and data loader for csv data
data_connector = CsvConnector(path=path_obj)
data_loader = DataLoader(data_connector)

In [5]:
loan_metadata = Metadata.from_dataloader(data_loader)
# Automatically infer discrete columns
loan_metadata.discrete_columns

[32m2024-03-22 18:35:56.105[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mfrom_dataloader[0m:[36m280[0m - [1mInspecting metadata...[0m
[32m2024-03-22 18:36:01.334[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mupdate_primary_key[0m:[36m482[0m - [1mPrimary Key updated: {'loan_id', 'user_id'}.[0m


{'class',
 'earlies_credit_mon',
 'employer_type',
 'industry',
 'issue_date',
 'sub_class',
 'work_type',
 'work_year'}