# Generate high-quality synthetic data with Dedomena AI

### Install Nucleus Edge library

Nucleus allows to train synthesizers that are later on used to generate unlimited production-like copies of synthetic data. These synthesizers can be created in Dedomena´s platform (Cloud) or in the data source environment or other on-premise and private cloud enviroments, through the Nucleus Edge component.

In [None]:
!pip install /content/nucleus-edge-3.2.1-python-310-x86_64-linux-gnu.tar.gz

# Synthetic data algorithms examples

Dedomena offers 4 synthesization algorithms:

**Generic:** Learn from structured tabular data from any industry where recurrent patterns or other complex date based patterns are not the main characteristic that must keep the synthetic data. This algorithm provides the ability to fine-tune pre-trained synthesizers.

**Transactional:** Learn from event/transaction based data (irregular time series or not a constant timestamp interval between events or transactions). This algorithm was intended to be used for transactional datasets, therefore there are certain essential columns for it to work correctly. It brings extra parameters only for bank transactional data, as a special type of transactional data, getting superior performance generating bank transactions with text descriptions, amog other capabilities.

**Timeseries:** As the name suggests, it is used to learn from data that follows a time series structure with equally distanced timestamps. Variables like user/product id and date of the series are mandatory to guarantee excellent global and entity-type results. It’s flexible enough to learn monthly, weekly, daily or hourly time series. Other variables related with the event/transaction can be learned too.

**Relational:** The relational algorithm is designed to synthesize multi-table datasets. They must adhere to the following conditions:

- All tables should be connected in some way. Disconnected tables can be synthesized separately.
- There should be no missing references (also known as orphan rows). If table A references table B, then every reference must be found. Otherwise, the row will be removed and not generated. It's okay if a parent row doesn't have children.
- There cannot be cyclic references. A table cannot reference itself. Or if table A references table B, then table B cannot reference table A again.
- Every foreign key must be a primary key of the table it references.

Below you will find examples of how to train each type of algorithm using Nucleus Edge.

## Generic

In [None]:
from nucleus import synthesizer

float_columns = ["MonthlyCharges","TotalCharges"]
date_columns=[]
integer_columns=['MonthAge']
categorical_columns=['Gender', 'MaritalStatus', 'FixedLine', 'InternetService','OnlineSecurity', 'OnlineBackup', 
                     'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 
                     'PaymentMethod', 'MonthlyCharges', 'Churn']

token = '1234' # Token provided by Dedomena at Nucleus Edge section. It is only available under subscription or contracted project.

synthesizer(data_dir='data/telco-customer-churn.csv', # https://www.kaggle.com/datasets/blastchar/telco-customer-churn
            token=token,
            algorithm='generic',
            batch_size=128,
            epochs=150,
            synthesizer_name="Dedomena AI Generic Algorithm Example",
            synthesizer_description="Synthesizer trained on Kaggle's telecom churn dataset.",
            categorical_columns=categorical_columns,
            date_columns=date_columns,
            integer_columns=integer_columns,
            boolean_columns=[],
            float_columns=float_columns,
            output_dir='my_synthesizers/generic_example',
            transform_descriptions=False,
            cuda=False,
            constraints = [],
            target='Churn',
            sensitive={},
            amplify="quality"
           )



INFO:root:Initializing generic algorithm



    ███╗   ██╗██╗   ██╗ ██████╗██╗     ███████╗██╗   ██╗███████╗    ███████╗██████╗  ██████╗ ███████╗
    ████╗  ██║██║   ██║██╔════╝██║     ██╔════╝██║   ██║██╔════╝    ██╔════╝██╔══██╗██╔════╝ ██╔════╝
    ██╔██╗ ██║██║   ██║██║     ██║     █████╗  ██║   ██║███████╗    █████╗  ██║  ██║██║  ███╗█████╗  
    ██║╚██╗██║██║   ██║██║     ██║     ██╔══╝  ██║   ██║╚════██║    ██╔══╝  ██║  ██║██║   ██║██╔══╝  
    ██║ ╚████║╚██████╔╝╚██████╗███████╗███████╗╚██████╔╝███████║    ███████╗██████╔╝╚██████╔╝███████╗
    ╚═╝  ╚═══╝ ╚═════╝  ╚═════╝╚══════╝╚══════╝ ╚═════╝ ╚══════╝    ╚══════╝╚═════╝  ╚═════╝ ╚══════╝  Version: 3.2.1                                                  
                    


INFO:root:Final data structure:
INFO:root:- Categorical columns: ['MonthlyCharges', 'MaritalStatus', 'Contract', 'Gender', 'PaymentMethod', 'InternetService']
INFO:root:- Boolean columns: ['FixedLine', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']
INFO:root:- Date columns: []
INFO:root:- Integer columns: ['MonthAge']
INFO:root:- Float columns: ['TotalCharges']
INFO:root:Starting data transformation
INFO:root:* Missing values imputed
INFO:root:Starting training



Final data structure:
- Categorical columns:  ['MonthlyCharges', 'MaritalStatus', 'Contract', 'Gender', 'PaymentMethod', 'InternetService']
- Boolean columns:  ['FixedLine', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']
- Date columns:  []
- Integer columns:  ['MonthAge']
- Float columns:  ['TotalCharges']

Starting data transformation
* Missing values imputed

Starting training
2024-10-14 13:52:30.241547


  0%|          | 0/150 [00:00<?, ?it/s]INFO:root:Epoch 1 - 2024-10-14 13:52:31.982621
  1%|          | 1/150 [00:00<00:21,  6.86it/s]INFO:root:Epoch 2 - 2024-10-14 13:52:32.092426
  1%|▏         | 2/150 [00:00<00:18,  8.06it/s]INFO:root:Epoch 3 - 2024-10-14 13:52:32.188588
INFO:root:Epoch 4 - 2024-10-14 13:52:32.277242
  3%|▎         | 4/150 [00:00<00:15,  9.62it/s]INFO:root:Epoch 5 - 2024-10-14 13:52:32.378399
  3%|▎         | 5/150 [00:00<00:14,  9.68it/s]INFO:root:Epoch 6 - 2024-10-14 13:52:32.469938
INFO:root:Epoch 7 - 2024-10-14 13:52:32.568650
  5%|▍         | 7/150 [00:00<00:14, 10.07it/s]INFO:root:Epoch 8 - 2024-10-14 13:52:32.669655
  5%|▌         | 8/150 [00:00<00:14, 10.03it/s]INFO:root:Epoch 9 - 2024-10-14 13:52:32.764918
INFO:root:Epoch 10 - 2024-10-14 13:52:32.911242
  7%|▋         | 10/150 [00:01<00:15,  9.20it/s]INFO:root:Epoch 11 - 2024-10-14 13:52:33.053208
  7%|▋         | 11/150 [00:01<00:16,  8.60it/s]INFO:root:Epoch 12 - 2024-10-14 13:52:33.193716
  8%|▊         |


Synthesizer training completed

Starting model evaluation


100%|██████████| 2/2 [00:00<00:00, 14.62it/s]
INFO:root:Starting model evaluation
INFO:root:Privacy score: 97.57
INFO:root:                                   Metrics    Values
0     Distance to the Closest Record (DCR)  0.215900
1                  Exact Match Score (EMS)  0.000000
2  Nearest Neighbour Distance Ratio (NNDR)  0.919684
3         Attribute Inference Attack (AIA)  0.988300



Privacy score:  97.57
                                   Metrics    Values
0     Distance to the Closest Record (DCR)  0.215900
1                  Exact Match Score (EMS)  0.000000
2  Nearest Neighbour Distance Ratio (NNDR)  0.919684
3         Attribute Inference Attack (AIA)  0.988300


INFO:root:Quality score: 99.14
INFO:root:                                  Metrics    Values
0            Mean Correlation Score (MCS)  0.003103
1             Cramer's V MSE Score (CVMS)  0.000029
2            MSE Correlation Score (MSCS)  0.073520
3  Jensen-Shannon Divergence Score (JSDC)  0.007345



Quality score:  99.14
                                  Metrics    Values
0            Mean Correlation Score (MCS)  0.003103
1             Cramer's V MSE Score (CVMS)  0.000029
2            MSE Correlation Score (MSCS)  0.073520
3  Jensen-Shannon Divergence Score (JSDC)  0.007345


INFO:root:Utility score: 90.2
INFO:root:             TRTR    TRTS    TSTR    TSTS
Metrics                                  
F1         0.6890  0.8049  0.8230  0.6771
Recall     0.6871  0.8006  0.8249  0.6724
Precision  0.6910  0.8096  0.8211  0.6832
Accuracy   0.7627  0.8452  0.8627  0.7475



Utility score:  90.2
             TRTR    TRTS    TSTR    TSTS
Metrics                                  
F1         0.6890  0.8049  0.8230  0.6771
Recall     0.6871  0.8006  0.8249  0.6724
Precision  0.6910  0.8096  0.8211  0.6832
Accuracy   0.7627  0.8452  0.8627  0.7475


INFO:root:Saving synthesizer and metadata



Model evaluation completed

Saving synthesizer and metadata


INFO:root:All done!


Successfully serialized model and metadata

All done!


## Transactional

In [None]:
from nucleus import synthesizer

float_columns = ['amount', 'balance']
date_columns=['txn_date']
integer_columns=[]
categorical_columns=['bank_id', 'descriptions', 'mcc_code', 'cat_id', 'account_id']

token = '1234' # Token provided by Dedomena at Nucleus Edge section. It is only available under subscription or contracted project.

synthesizer(data_dir='data/banking_data.csv',
            token=token,
            algorithm='transactional',
            batch_size=512,
            epochs=150,
            synthesizer_name='dedomena',
            synthesizer_description='Generate transactional banking data',
            categorical_columns=categorical_columns,
            date_columns=date_columns,
            integer_columns=integer_columns,
            boolean_columns=[],
            float_columns=float_columns,
            output_dir='output',
            transform_descriptions='level2',
            amplify='quality',
            cuda=True,
            constraints = ['descriptions<->mcc_code<->cat_id', 'account_id<->bank_id'],
            target='cat_id',
            sensitive={'cat_id':'name'},
            columns_mapping={'concept':'descriptions', 'user_id':'account_id'}
           )

## Timeseries

In [None]:
from nucleus import synthesizer

float_columns=['amount']
date_columns=['txn_date']
categorical_columns=['cat_id', 'user_id']

token = '1234' # Token provided by Dedomena at Nucleus Edge section. It is only available under subscription or contracted project.

synthesizer(data_dir='train_series.parquet',
            token=token,
            algorithm='timeseries',
            batch_size=128,
            epochs=100,
            categorical_columns=categorical_columns,
            date_columns=date_columns,
            integer_columns=[],
            boolean_columns=[],
            float_columns=float_columns,
            cuda=False,
            time_step='MS',
            output_dir='my_synthesizers/timeseries_example',
            )

## Relational

In [None]:
from nucleus import synthesizer

datasets_config = {
                    'loan': {
                               'data_dir': 'loan.csv',
                               'data_format':'CSV', # Optional
                               'categorical_columns': ['loan_id', 'account_id', 'status'],
                               'integer_columns': ['amount', 'duration'],
                               'date_columns': {'date':'%y%m%d'},
                               'float_columns': ['payments'],
                               'foreign_key': {'account_id':'account'},
                               'primary_key':'loan_id',
                             },
                    'order': {
                               'data_dir': 'order.csv',
                               'categorical_columns': ['order_id', 'account_id', 'bank_to', 'account_to', 'k_symbol'],
                               'float_columns': ['amount'],
                               'foreign_key': {'account_id':'account'},
                               'primary_key':'order_id',
                              },
                    'transactions': {
                                       'algorithm': 'transactional',
                                       'data_dir': 'trans_desc_1k.csv',
                                       'categorical_columns': ['type', 'operation', 'k_symbol', 'bank', 'account', 'title'],
                                       'date_columns': {'date':'%y%m%d'},
                                       'float_columns': ['amount', 'balance'],
                                       'foreign_key': {'account_id':'account'},
                                       'primary_key':'trans_id',
                                       'constraints':[
                                                      'type<->operation',
                                                      'bank<->account'
                                                     ],
                                        'columns_mapping':{'user_id': 'account_id',
                                                      'txn_date':'date',
                                                      'concept':'title'},
                                        'transform_descriptions':'level2'

                                     },
                    'account': {
                                   'data_dir': 'account.csv',
                                   'categorical_columns': ['account_id', 'district_id', 'frequency'],
                                   'date_columns': {'date':'%y%m%d'},
                                   'foreign_key': {'district_id':'demograph'},
                                   'primary_key':'account_id',
                                 },
                    'cards': {
                                   'data_dir': 'card.csv',
                                   'categorical_columns': ['card_id', 'disp_id', 'type'],
                                   'date_columns': {'issued':'%y%m%d'},
                                   'foreign_key': {'disp_id':'disposition'},
                                   'primary_key':'card_id',
                              },
                    'disposition': {
                                   'data_dir': 'disp.csv',
                                   'categorical_columns': ['disp_id', 'client_id', 'account_id', 'type'],
                                   'foreign_key': {'client_id':'clients',
                                                   'account_id':'account'},
                                   'primary_key':'disp_id',
                                   'constraints':[
                                                  'account_id<->type',
                                                  ],
                                   },
                    'demograph': {
                                   'data_dir': 'district.csv',
                                   'categorical_columns': ['district_id', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A11',
                                                           'A14','A15','A16'],
                                   'float_columns': ['A10', 'A12', 'A13'],
                                   'primary_key':'district_id',
                                 },
                    'clients': {
                                   'data_dir': 'client.csv',
                                   'categorical_columns': ['client_id', 'birth_number', 'district_id'],
                                   'foreign_key': {'district_id':'demograph'},
                                   'primary_key':'client_id',
                                },
                }

token = '1234' # Token provided by Dedomena at Nucleus Edge section. It is only available under subscription or contracted project.

synthesizer(token=token,
            algorithm='relational',
            batch_size=128,
            epochs=150,
            datasets_config=datasets_config,
            output_dir='loan_data/synthesizers')

When the synthesization is completed and the synthesizer created, users need to upload the encrypted file into Dedomena´s platform through the SYNTHESIZERS section. Through the button ADD SYNTHESIZER users can upload the encrypted file to register the synthesizer with the associated information, metrics, report and generate synthetic data. 

Dedomena AI provides a REST API for consuming synthetic data from user synthesizers, enabling a variety of integrations and powering real-time use cases. It allows to generate 5000 synthetic rows per call. The API only works with synthesizers trained with the latest version of the available Nucleus Edge.

Work with confidence, data quality and privacy is assured.