

<div align="center">
<p align="center">


# 🚀 Synthetic Data Generator

</p>
</div>

The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models.

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

In [1]:
# install dependencies
!pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git
!pip install table_evaluator
!pip install joblib==1.3.2

We demonstrate with a single table data synthetic example.

In [1]:
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata

  from tqdm import autonotebook as tqdm


# 1. Load data and understand the data

The demo data set for this demonstration is a risk control data set used to predict whether an individual will default on a loan. This dataset contains the following features:

| Column name | Meaning |
|-----------------------|-----------------------|
| loan_id | loan ID |
| user_id | user ID |
| total_loan | Total loan amount |
| year_of_loan | Loan period |
...


In [2]:
# download the example dataset
import os 
import requests

def download_file(url, path):
    response = requests.get(url)
    if response.status_code == 200:
        with open(path, 'wb') as file:
            file.write(response.content)
        print(f"File downloaded successfully to {path}")
    else:
        print(f"Failed to download file from {url}")
dataset_url = "https://raw.githubusercontent.com/aialgorithm/Blog/master/projects/一文梳理风控建模全流程/train_internet.csv"

if not os.path.exists("train_internet.csv"):
    download_file(dataset_url, "train_internet.csv")

This code shows the process of loading real data:

In [3]:
from pathlib import Path
file_path = './train_internet.csv'
path_obj = Path(file_path)

# Create a data connector and data loader for csv data
data_connector = CsvConnector(path=path_obj)
data_loader = DataLoader(data_connector)

# 2. Create synthetic data generation workflow

Below we will use SDG to create a data synthetic workflow. This workflow includes steps such as automated metadata identification, which will help generate high-quality synthetic data:

Firstly, we need to create a metadata object. SDG provides automated metadata identification functions that can extract key information from existing data sources, such as data type, data range, data distribution, etc.
For more info, you can read the metadata notebook. You can also specify the categorical columns encoder in it. Now for this example we use default one-hot encoder.

In [4]:
loan_metadata = Metadata.from_dataloader(data_loader)
# Automatically infer discrete columns
loan_metadata.discrete_columns

[32m2024-12-17 20:23:20.471[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mfrom_dataloader[0m:[36m333[0m - [1mInspecting metadata...[0m
[32m2024-12-17 20:23:22.492[0m | [1mINFO    [0m | [36msdgx.data_models.metadata[0m:[36mupdate_primary_key[0m:[36m564[0m - [1mPrimary Key updated: {'user_id', 'loan_id'}.[0m


{'class',
 'earlies_credit_mon',
 'employer_type',
 'industry',
 'issue_date',
 'sub_class',
 'work_type',
 'work_year'}

After the model definition is completed, we can use SDG to generate synthetic data.

In [7]:
# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    metadata= loan_metadata,
    model=CTGANSynthesizerModel(epochs=8),
    data_connector=data_connector,
)

[32m2024-12-17 20:25:58.475[0m | [1mINFO    [0m | [36msdgx.synthesizer[0m:[36m__init__[0m:[36m109[0m - [1mUsing data processors: ['specificcombinationtransformer', 'fixedcombinationtransformer', 'nonvaluetransformer', 'outliertransformer', 'emailgenerator', 'chnpiigenerator', 'intvalueformatter', 'datetimeformatter', 'constvaluetransformer', 'positivenegativefilter', 'emptytransformer', 'columnordertransformer'][0m


# 3. Train a model
CTGAN (Conditional Tabular GAN) is a generative neural network model used to generate synthetic tabular data.

Its principle is to achieve data generation by training a generator network and a discriminator network.

The main parameters of CTGAN include:

* embedding_dim: embedding dimension, used to convert discrete features into continuous vector representation.
* generator_dim: the hidden layer dimension of the generator network.
* discriminator_dim: the hidden layer dimension of the discriminator network.
* generator_lr: The learning rate of the generator network.
* discriminator_lr: The learning rate of the discriminator network.
*batch_size: The number of samples in each training batch.
* epochs: The number of iterations of training.

The significance of these parameters is to adjust the performance of the model and the stability of the training process. A larger embedding dimension can improve the representation ability of discrete features, while a larger hidden layer dimension can increase the complexity of the network. The learning rate and batch size can affect the convergence speed and stability of the model, while the number of iterations determines the training time of the model.

In [8]:
# Fit the model
synthesizer.fit()

[32m2024-12-17 20:25:59.706[0m | [1mINFO    [0m | [36msdgx.synthesizer[0m:[36mfit[0m:[36m298[0m - [1mFitting data processors...[0m
[32m2024-12-17 20:25:59.809[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.specific_combination[0m:[36mfit[0m:[36m70[0m - [1mFit data using SpecificCombinationTransformer(No specified)... Finished (No action).[0m
[32m2024-12-17 20:25:59.810[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.nan[0m:[36mfit[0m:[36m81[0m - [1mNonValueTransformer Fitted.[0m
[32m2024-12-17 20:25:59.812[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.nan[0m:[36mfit[0m:[36m97[0m - [1mNonValueTransformer get int columns: {'region', 'early_return_amount_3mon', 'post_code', 'f0', 'title', 'is_default', 'f3', 'marriage', 'early_return_amount', 'year_of_loan', 'f4', 'house_loan_status', 'house_exist', 'offsprings', 'scoring_low', 'censor_status', 'early_return', 'recircle_b', 'initial_list_status', 'f2', '

Preparing data:  59%|#####8    | 24/41 [00:03<00:02,  7.63it/s]

[32m2024-12-17 20:27:15.261[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_pre_fit[0m:[36m243[0m - [1mTransforming data...[0m
[32m2024-12-17 20:27:15.281[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.specific_combination[0m:[36mconvert[0m:[36m98[0m - [1mConverting data using SpecificCombinationTransformer(No specified)... Finished (No action).[0m
[32m2024-12-17 20:27:15.282[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.fixed_combination[0m:[36mconvert[0m:[36m162[0m - [1mConverting data using FixedCombinationTransformer... Finished (No action).[0m
[32m2024-12-17 20:27:15.283[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.nan[0m:[36mconvert[0m:[36m120[0m - [1mConverting data using NonValueTransformer...[0m
[32m2024-12-17 20:27:15.296[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.nan[0m:[36mconvert[0m:[36m140[0m - [1mConverting data using NonValueTransform

Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.42it/s]

[32m2024-12-17 20:27:21.216[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 1, Loss G:  3.3102, Loss D: -0.9413, Time:  4.4964[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.52it/s]

[32m2024-12-17 20:27:25.587[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 2, Loss G:  2.6172, Loss D: -0.7650, Time:  4.3705[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.51it/s]

[32m2024-12-17 20:27:30.043[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 3, Loss G:  2.0373, Loss D: -0.1943, Time:  4.4555[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.45it/s]

[32m2024-12-17 20:27:34.516[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 4, Loss G:  2.0157, Loss D:  0.0814, Time:  4.4720[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.45it/s]

[32m2024-12-17 20:27:38.980[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 5, Loss G:  1.7904, Loss D:  0.4252, Time:  4.4627[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.52it/s]

[32m2024-12-17 20:27:43.350[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 6, Loss G:  2.3208, Loss D: -0.0949, Time:  4.3685[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.50it/s]

[32m2024-12-17 20:27:47.745[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 7, Loss G:  2.0995, Loss D:  0.2305, Time:  4.3950[0m


Fitting batches:  73%|#######2  | 8/11 [00:03<00:01,  2.50it/s]

[32m2024-12-17 20:27:52.213[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_fit[0m:[36m369[0m - [1mEpoch 8, Loss G:  1.9914, Loss D:  0.1677, Time:  4.4672[0m
[32m2024-12-17 20:27:52.214[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36mfit[0m:[36m226[0m - [1mCTGAN training finished.[0m
[32m2024-12-17 20:27:52.216[0m | [1mINFO    [0m | [36msdgx.synthesizer[0m:[36mfit[0m:[36m328[0m - [1mModel fit... Finished[0m


# 4. Generate synthetic data

In [9]:
# Sample
real_data = data_loader.load_all()
sampled_data = synthesizer.sample(100)

print(sampled_data)

[32m2024-12-17 20:27:52.236[0m | [1mINFO    [0m | [36msdgx.synthesizer[0m:[36msample[0m:[36m352[0m - [1mSampling...[0m


Sampling:   0%|          | 0/100 [00:00<?, ?it/s]

Sampling in batch:  54%|#####3    | 268/500 [00:03<00:02, 88.82it/s]

Sampling batches: 100%|##########| 1/1 [00:05<00:00,  5.90s/it]

[32m2024-12-17 20:27:58.145[0m | [1mINFO    [0m | [36msdgx.models.ml.single_table.ctgan[0m:[36m_sample[0m:[36m433[0m - [1mCTGAN Generated 500 raw samples.[0m
[32m2024-12-17 20:27:58.196[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.specific_combination[0m:[36mreverse_convert[0m:[36m117[0m - [1mReverse converting data using SpecificCombinationTransformer(No specified)... Finished (No action).[0m
[32m2024-12-17 20:27:58.197[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.fixed_combination[0m:[36mreverse_convert[0m:[36m230[0m - [1mReverse converting data using FixedCombinationTransformer...[0m
[32m2024-12-17 20:28:12.552[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.fixed_combination[0m:[36mreverse_convert[0m:[36m256[0m - [1mReverse converting data using FixedCombinationTransformer... Finished.[0m
[32m2024-12-17 20:28:12.552[0m | [1mINFO    [0m | [36msdgx.data_processors.transformers.nan[0m:[36m

    loan_id  user_id  total_loan  year_of_loan  interest  monthly_payment  \
0    488535     1050       10500             3     14.98           255.29   
1    488535     1050       10500             3     14.98           255.29   
2    488535     1050       10500             5      7.49           255.29   
3    488535     1050       10500             3     14.98           255.29   
4    488535     1050       10500             3     14.98           255.29   
..      ...      ...         ...           ...       ...              ...   
95   488535     1050       10500             3      9.99           255.29   
96   636495      290        4000             3      9.99           128.94   
97   636495      290        4000             3      8.46           128.94   
98   488535     1050       10500             3      8.46           255.29   
99   636495      290        4000             3      7.26           128.94   

   class sub_class work_type employer_type  ... earlies_credit_mon title  \

In [10]:
sampled_data

# real_data

Unnamed: 0,loan_id,user_id,total_loan,year_of_loan,interest,monthly_payment,class,sub_class,work_type,employer_type,...,earlies_credit_mon,title,policy_code,f0,f1,f2,f3,f4,f5,is_default
0,488535,1050,10500,3,14.98,255.29,A,G2,公务员,政府机构,...,Dec-2003,0,1.0,10,0,24,13,37,15,1
1,488535,1050,10500,3,14.98,255.29,D,B2,职员,政府机构,...,Nov-2007,0,1.0,10,0,24,13,37,15,1
2,488535,1050,10500,5,7.49,255.29,A,C3,工人,世界五百强,...,Jan-1996,0,1.0,13,0,24,13,38,15,1
3,488535,1050,10500,3,14.98,255.29,B,B5,其他,普通企业,...,Jul-2002,0,1.0,10,0,24,13,37,15,1
4,488535,1050,10500,3,14.98,255.29,A,B5,其他,政府机构,...,May-1989,0,1.0,10,0,24,13,37,15,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,488535,1050,10500,3,9.99,255.29,B,C4,公务员,普通企业,...,Dec-1972,0,1.0,17,0,24,13,41,15,1
96,636495,290,4000,3,9.99,128.94,C,C3,其他,普通企业,...,Feb-2003,10,1.0,17,0,47,7,41,13,1
97,636495,290,4000,3,8.46,128.94,B,C3,其他,政府机构,...,Jun-2008,10,1.0,16,0,47,7,44,13,1
98,488535,1050,10500,3,8.46,255.29,B,C2,职员,幼教与中小学校,...,Apr-2001,0,1.0,16,0,24,13,44,15,1


# 5. Data quality assessment on synthetic data

In the field of simulated data, the comparison of the mean and variance of each feature between real data and simulated data is very important:

1. The mean and variance of real data reflect the data distribution in the real world. By comparing the mean and variance of real and simulated data on each feature, the accuracy and credibility of the simulated data can be evaluated.

2. The mean and variance are important statistics that describe the center and dispersion of the data set. The mean reflects the central tendency of the data set, and the variance reflects the dispersion of the data set. By comparing the mean and variance of real and simulated data on each feature, you can evaluate whether the simulated data accurately simulates the central tendency and dispersion of the real data.

In [None]:

import matplotlib.pyplot as plt
from matplotlib.font_manager import fontManager
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

# use table_evaluator for evaluation
from table_evaluator import TableEvaluator

table_evaluator = TableEvaluator(
    real_data[list(set(real_data.columns) - set(loan_metadata.discrete_columns))],
    sampled_data[list(set(real_data.columns) - set(loan_metadata.discrete_columns))])
table_evaluator.plot_mean_std()

There are 42 columns in the target list. We select some important features for **statistical distribution assessment**.

Through the analysis of these features, we can have a more comprehensive understanding of the borrower's professional status, work experience, industry background, and repayment so as to more accurately assess its loan risk and credit reliability, as follows:

1. work_type (work type): Indicates the borrower’s work type, such as clerk, worker, other, etc. This characteristic is important in understanding a borrower's professional identity and stability, as different job types may have an impact on a borrower's ability to repay.

2. work_year (working years): Indicates the borrower’s working years in the current workplace. This feature can reflect the borrower’s work experience and stability. Borrowers who have worked for the same unit for a long time may have greater repayment ability and credit reliability.

3. Industry: Indicates the industry in which the borrower is located, such as mining, information transmission, finance, etc. Understanding the borrower's industry can help us evaluate the borrower's career stability and industry prospects, and thus determine his or her repayment ability.

4. monthly_payment (monthly payment amount): Indicates the loan amount that the borrower needs to repay every month. This feature is an important indicator for evaluating the borrower's repayment ability. A higher monthly payment may mean that the borrower has heavier debts and requires higher repayment ability.

5. post_code (postal code): Indicates the postal code of the borrower's location. Postal codes can provide geographic location information about the borrower's location, which helps us understand the economic conditions and risk profile of the borrower's location.

6. recircle_b (revolving line usage rate): Indicates the proportion of the revolving line used by the borrower to the total revolving line. Revolving limit refers to a credit card or other credit card that can be used repeatedly. The borrower's utilization rate can reflect his or her credit card usage and repayment habits, which is of great significance for assessing the borrower's credit status and repayment ability.

In [None]:
target_cols = ['work_type','work_year','industry','monthly_payment',  'post_code',"recircle_b" ]

table_evaluator = TableEvaluator(real_data.loc[:, target_cols], sampled_data.loc[:, target_cols], cat_cols= ["work_year", "work_type", 'industry'])
table_evaluator.plot_distributions(nr_cols=3)