 ## Runs Code to Test Components to Ensure they are Working

In [1]:
import pandas as pd

# predata
train_df = pd.read_csv('test_data/adults_train-test.csv')
control_df = pd.read_csv('test_data/adults_control-test.csv')

## Preprocessing - CTGAN

In [3]:
from data_synthesis import prep_metadata, prep_bin_data
# detect metadata
# the metadata for train and control is expected to be the same
metadata = prep_metadata(train_df)


## Data Synthesis - CTGAN

In [5]:
from data_synthesis import DataSynthesis
# ran into error when running above: missing OpenDP even though it was already installed
# solution: pip install opendp each time this happens

synthesizer = DataSynthesis(metadata)
approaches = synthesizer.get_approaches()
print(approaches)
print("Selecing Approach[0]")
params = synthesizer.get_default_params(approaches[0])
print(params)
params['sample_size'] = 75
params['epochs'] = 10
params['save_synthesizer'] = True
params['save_filepath'] = 'test_data/dataset_adults_train_ctgan_synthesizer.pkl'
print(params)
synth_df = synthesizer.synth_data(data=train_df, approach=approaches[0], parameters=params)
print("Synthesis completed. You can view the resultant data in Jupyter:Variables if you are on VS Code.")
synth_df.to_parquet('test_data/adults_syn_ctgan.parquet')

['ctgan', 'dpctgan']
Selecing Approach[0]
{'sample_size': 1000, 'enforce_rounding': False, 'epochs': 500, 'verbose': True, 'save_synthesizer': False, 'save_filepath': ''}
{'sample_size': 75, 'enforce_rounding': False, 'epochs': 10, 'verbose': True, 'save_synthesizer': True, 'save_filepath': 'test_data/dataset_adults_train_ctgan_synthesizer.pkl'}




Epoch 1, Loss G:  1.9354,Loss D: -0.0001
Epoch 2, Loss G:  1.9032,Loss D: -0.0105
Epoch 3, Loss G:  1.8343,Loss D: -0.0299
Epoch 4, Loss G:  1.9397,Loss D: -0.0870
Epoch 5, Loss G:  1.9355,Loss D: -0.1288
Epoch 6, Loss G:  1.9641,Loss D: -0.1193
Epoch 7, Loss G:  1.8743,Loss D: -0.1719
Epoch 8, Loss G:  1.9071,Loss D: -0.2038
Epoch 9, Loss G:  1.9138,Loss D: -0.2713
Epoch 10, Loss G:  1.8267,Loss D: -0.3127
Synthesis completed. You can view the resultant data in Jupyter:Variables if you are on VS Code.


  if _pandas_api.is_sparse(col):


## Data Analysis - CTGAN

In [6]:
synthesizer.run_data_diagnosis(train_df, synth_df)
synthesizer.run_column_diagnosis(train_df, synth_df, 'age')
synthesizer.run_column_diagnosis(train_df, synth_df, 'occupation')

=== Quality Report ===
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 15/15 [00:00<00:00, 345.41it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 105/105 [00:01<00:00, 59.13it/s]

Overall Quality Score: 54.12%

Properties:
- Column Shapes: 61.25%
- Column Pair Trends: 46.98%
=== Diagnostic Report ===
Generating report ...
(1/3) Evaluating Coverage: : 100%|██████████| 15/15 [00:00<00:00, 1139.05it/s]
(2/3) Evaluating Boundary: : 100%|██████████| 15/15 [00:00<00:00, 2292.80it/s]
(3/3) Evaluating Synthesis: : 100%|██████████| 1/1 [00:00<00:00,  1.14it/s]

Diagnostic Results:

SUCCESS:
✓ The synthetic data covers over 90% of the categories present in the real data
✓ The synthetic data covers over 90% of the numerical ranges present in the real data
✓ The synthetic data follows over 90% of the min/max boundaries set by the real data
✓ Over 90% of the synthetic rows are not copies of the real data


## Preprocessing - DP-CTGAN

In [7]:
from data_synthesis import prep_metadata, prep_bin_data
#if we are using DP-CTGAN, we will need to bin any continuous data

bin_size = 50
columns = [
    'age', 
    'fnlwgt', 
    'education_num', 
    'capital_gain', 
    'capital_loss', 
    'hr_per_week'
]
# #all numerical columns are distributed into 50 bins labeled from 1 to 50
train_df = prep_bin_data(train_df, columns, bin_size)

#do the same for control data
control_df = prep_bin_data(control_df, columns, bin_size)

# then detect metadata
# the metadata for train and control is expected to be the same
metadata = prep_metadata(train_df)

## Data Synthesis - DP-CTGAN

In [8]:
from data_synthesis import DataSynthesis
# ran into error when running above: missing OpenDP even though it was already installed
# solution: pip install opendp each time this happens

synthesizer = DataSynthesis(metadata)
approaches = synthesizer.get_approaches()
print(approaches)
print("Selecing Approach[1]")
params = synthesizer.get_default_params(approaches[1])
print(params)
params['sample_size'] = 75
params['epochs'] = 10
synth_df = synthesizer.synth_data(data=train_df, approach=approaches[1], parameters=params)
print("Synthesis completed. You can view the resultant data in Jupyter:Variables if you are on VS Code.")


['ctgan', 'dpctgan']
Selecing Approach[1]
{'sample_size': 1000, 'generator_decay': 1e-05, 'discriminator_decay': 0.001, 'batch_size': 64, 'epochs': 100, 'epsilon': 32, 'verbose': True, 'preprocessor_eps': 1.0}



The sample rate will be defined from ``batch_size`` and ``sample_size``.The returned privacy budget will be incorrect.


Secure RNG turned off. This is perfectly fine for experimentation as it allows for much faster training performance, but remember to turn it on and retrain one last time before production with ``secure_rng`` turned on.


Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.



Epoch 1, Loss G: 0.6849, Loss D: 1.3893
epsilon is 0.050758636885846496, alpha is 63.0
Epoch 2, Loss G: 0.6822, Loss D: 1.3929
epsilon is 0.41099203158650366, alpha is 22.0
Epoch 3, Loss G: 0.6799, Loss D: 1.3905
epsilon is 0.5895550915085035, alpha is 17.0
Epoch 4, Loss G: 0.6812, Loss D: 1.3916
epsilon is 0.7314390555226387, alpha is 15.0
Epoch 5, Loss G: 0.6759, Loss D: 1.3907
epsilon is 0.85295344754532, alpha is 13.0
Epoch 6, Loss G: 0.6770, Loss D: 1.3928
epsilon is 0.9621828864060221, alpha is 12.0
Epoch 7, Loss G: 0.6774, Loss D: 1.3887
epsilon is 1.0628032345112661, alpha is 10.9
Epoch 8, Loss G: 0.6793, Loss D: 1.4059
epsilon is 1.1555857820715685, alpha is 10.5
Epoch 9, Loss G: 0.6768, Loss D: 1.3892
epsilon is 1.2433065858958203, alpha is 10.0
Epoch 10, Loss G: 0.6774, Loss D: 1.4016
epsilon is 1.326490144829143, alpha is 9.5
Synthesis completed. You can view the resultant data in Jupyter:Variables if you are on VS Code.


## Data Analysis - DP-CTGAN

In [9]:
# synthesizer.run_data_diagnosis(train_df, synth_df)
# synthesizer.run_column_diagnosis(train_df, synth_df, 'age')
# synthesizer.run_column_diagnosis(train_df, synth_df, 'occupation')

=== Quality Report ===
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 15/15 [00:00<00:00, 981.87it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 105/105 [00:02<00:00, 51.51it/s]

Overall Quality Score: 29.66%

Properties:
- Column Shapes: 41.64%
- Column Pair Trends: 17.69%
=== Diagnostic Report ===
Generating report ...
(1/3) Evaluating Coverage: : 100%|██████████| 15/15 [00:00<00:00, 733.60it/s]
(2/3) Evaluating Boundary: : 100%|██████████| 15/15 [00:00<00:00, 6628.17it/s]
(3/3) Evaluating Synthesis: : 100%|██████████| 1/1 [00:00<00:00,  1.43it/s]

Diagnostic Results:

SUCCESS:
✓ Over 90% of the synthetic rows are not copies of the real data

! The synthetic data is missing more than 10% of the categories present in the real data
