<center>
  
# TABSYN: Tabular Data Synthesis with Diffusion Models

</center>

Two challenges regarding the extention of diffusion models to tabular data are:
1. **Diverse data types:** a single table can have different columns each containing data of different types, including numerical, categorical, text, etc.
2. **Varied distributions:** the distribution of data under different columns in a single table varry widely from column to column.

**TabSyn** addresses these challenges by introducing a latent space where tabular data of all columns are jointly represented. It then proceedes to train a diffusion model on the latent representations.
This tactic allows TabSyn to:
1. Train a single diffusion model for all data types in the dataset (i.e. Generality).
2. Optimize the distribution of latent embeddings to facilitate training of the subsequent diffusion model, thus generating higher quality synthetic data (i.e. Quality).
3. Require much fewer reverse steps during training of the diffusion model, and synthesize data faster (i.e. Speed).

In this notebook, we review and implement the TabSyn model. The notebook is organized as follows:

1. [Imports and Setup]()


2. [Default Dataset]()
    
    
3. [TabSyn Algorithm]()
    
    3.1. [Load Config]()
    
    3.2. [Make Dataset]()
    
    3.3. [Instantiate Model]()
    
    3.4. [Train Model]()
        
    3.5. [Load Pretrained Model]()
    
    3.6. [Sample Data]()
    
    3.7. [Review Synthetic Data]()


# Imports and Setup

In this section, we import all necessary libraries and modules required for setting up the environment.
Most of the libraries we need to implement TabSyn are the same as TabDDPM.
We also specify `NAME_URL_DICT_UCI`, `DATA_NAME`, `DATA_DIR` and other paths as in TabDDPM's implementation.


In [11]:
import os
import shutil
import src
import json
import pandas as pd
from pprint import pprint
import os
import numpy as np
import torch
from torch.utils.data import DataLoader

from scripts.download_dataset import download_from_uci
from scripts.process_dataset import process_data

from src.data import preprocess, TabularDataset
from src.baselines.tabsyn.pipeline import TabSyn

from src.util import visualize_default

In [12]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:

os.getcwd()

'/fs01/home/ws_aabboud/diffusion_model_bootcamp/deloitte_team/single_table_synthesis'

In [14]:
# os.chdir('/fs01/home/ws_aabboud/diffusion_model_bootcamp/deloitte_team/single_table_synthesis/')

In [15]:

from scripts.process_dataset import process_data

In [16]:
import pandas as pd

df = pd.read_csv('data/raw_data/IBM_AML_FeatureEngineered_FS1_raw.csv')
df=df.dropna()
# Ignore the The Patern ID and The Source columns
df=df.iloc[::,2::]   
# Convert the first Currency columns to boolean
df.iloc[:, :16] = df.iloc[:, :16].astype(float)   # Destination Col
df.iloc[:, 16:29] = df.iloc[:, 16:29].astype(bool).astype(str).replace({'True': 'Y', 'False': 'N'})
# Convert the last column Is Fanout to boolean
df.iloc[:, -1] = df.iloc[:, -1].astype(bool).astype(str).replace({'True': 'Y', 'False': 'N'})
df.rename(columns={'Is FanOut': 'Is_FanOut'}, inplace=True)

# df=df.iloc[:, ~df.columns.isin(df.columns[16:29])]  # Exclude Currency Columns for test purposes
# df=df.iloc[:, ~df.columns.isin(df.columns[17:20])]  # Exclude erronous columns Columns for test purposes

  df.iloc[:, :16] = df.iloc[:, :16].astype(float)   # Destination Col


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6794 entries, 1 to 7393
Data columns (total 37 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Destination_1                           6794 non-null   float64
 1   Destination_2                           6794 non-null   float64
 2   Destination_3                           6794 non-null   float64
 3   Destination_4                           6794 non-null   float64
 4   Destination_5                           6794 non-null   float64
 5   Destination_6                           6794 non-null   float64
 6   Destination_7                           6794 non-null   float64
 7   Destination_8                           6794 non-null   float64
 8   Destination_9                           6794 non-null   float64
 9   Destination_10                          6794 non-null   float64
 10  Destination_11                          6794 non-null   floa

# Feature Selection

In [18]:

from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import logging

def select_important_columns(df, target_column, threshold, random_state=42):
    """
    Selects the most important columns for binary classification based on mutual information.
    Handles categorical features by encoding them and prints feature importance in sorted order.

    Parameters:
    df (pandas.DataFrame): The input DataFrame containing features and target variable.
    target_column (str): The name of the target variable column in the DataFrame.
    threshold (float): The minimum mutual information score for a feature to be considered important.
    random_state (int, optional): Random state for reproducibility. Defaults to 42.

    Returns:
    list: A list of column names that are deemed important for classification.
    """

    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    logger = logging.getLogger(__name__)

    X = df.drop(columns=[target_column])
    y = df[target_column]

    unique_values = y.unique()
    if len(unique_values) != 2:
        logger.warning(f"Target column '{target_column}' is not binary. It has {len(unique_values)} unique values.")
        return []

    logger.info(f"Unique values in target column: {unique_values}")

    # Apply label encoding to the target variable
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)

    # Preprocess features
    X_processed = X.copy()
    for column in X_processed.columns:
        if X_processed[column].dtype == 'object':
            # For categorical columns, use label encoding
            X_processed[column] = LabelEncoder().fit_transform(X_processed[column].astype(str))
        else:
            # For numeric columns, fill NaN values with the mean
            imputer = SimpleImputer(strategy='mean')
            X_processed[column] = imputer.fit_transform(X_processed[[column]])

    # Calculate mutual information scores
    mi_scores = mutual_info_classif(X_processed, y_encoded, random_state=random_state)

    # Create a dictionary of feature names and their mutual information scores
    feature_scores = dict(zip(X.columns, mi_scores))

    # Sort features by importance
    sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

    # Print feature importance in sorted order
    print("\nFeature Importance (sorted):")
    for feature, score in sorted_features:
        print(f"{feature}: {score:.4f}")

    important_features = [feature for feature, score in feature_scores.items() if score > threshold]

    logger.info(f"Number of columns before selection: {len(X.columns)}")
    logger.info(f"Number of columns after selection: {len(important_features)}")

    if not important_features:
        logger.warning("No features met the threshold criteria. Consider lowering the threshold.")

    return important_features





# Example usage:

important_cols = select_important_columns(df=df, target_column='Is_FanOut', threshold=0.05)
print("Important columns:", important_cols)


2024-10-01 14:38:35,709 - INFO - Unique values in target column: ['Y' 'N']
2024-10-01 14:38:38,466 - INFO - Number of columns before selection: 36
2024-10-01 14:38:38,468 - INFO - Number of columns after selection: 26



Feature Importance (sorted):
transaction_frequency_variance_in_days: 0.6891
Destination_1: 0.4032
avg_transaction_frequency_in_days: 0.3777
Destination_2: 0.3687
min_day_to_max_day_range: 0.3426
avg_transaction_value_in_cad: 0.3171
Destination_3: 0.2376
number_transactions_above_9k_cad: 0.1812
Destination_4: 0.1666
Destination_14: 0.1375
Destination_15: 0.1340
Destination_12: 0.1323
Destination_5: 0.1266
Destination_16: 0.1252
Destination_11: 0.1251
Destination_13: 0.1215
Destination_10: 0.1192
Destination_9: 0.1038
Euro: 0.1022
Destination_6: 0.1002
Destination_8: 0.0932
variance_from_10k_cad: 0.0925
Destination_7: 0.0893
US Dollar: 0.0882
transaction_value_variance_in_cad: 0.0673
Yuan: 0.0583
Bitcoin: 0.0169
Shekel: 0.0162
Yen: 0.0132
Rupee: 0.0082
UK Pound: 0.0071
Ruble: 0.0071
Canadian Dollar: 0.0000
Australian Dollar: 0.0000
Brazil Real: 0.0000
Swiss Franc: 0.0000
Important columns: ['Destination_1', 'Destination_2', 'Destination_3', 'Destination_4', 'Destination_5', 'Destination

In [19]:
# Select only important columns
df= df[important_cols + ['Is_FanOut']]

In [21]:
df.head()

Unnamed: 0,Destination_1,Destination_2,Destination_3,Destination_4,Destination_5,Destination_6,Destination_7,Destination_8,Destination_9,Destination_10,...,Euro,Yuan,number_transactions_above_9k_cad,avg_transaction_value_in_cad,transaction_value_variance_in_cad,variance_from_10k_cad,min_day_to_max_day_range,avg_transaction_frequency_in_days,transaction_frequency_variance_in_days,Is_FanOut
1,12748.0,11218.0,11161.0,6452.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Y,N,3,10395.23,7449777.0,22974160.0,13,3.25,7.849218,Y
2,238562565.0,11577.0,8583.0,2453.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Y,Y,2,59646300.0,1.422713e+16,5.690733e+16,12,3.0,7.849218,Y
3,23108.0,13930.0,7568.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Y,Y,2,14868.91,61033910.0,193186700.0,3,1.0,7.849218,Y
4,8692.0,4644.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Y,N,0,6668.545,8196547.0,30393730.0,10,5.0,7.849218,Y
5,26771.0,19889.0,5768.0,1954.0,0.0,0.0,0.0,0.0,0.0,0.0,...,Y,N,2,13595.92,136665500.0,461719000.0,13,3.25,7.849218,Y


## Export Dataset to CSV

In [24]:
# Export Dataset
df.to_csv('data/raw_data/IBM_AML_FeatureEngineered_FS1/IBM_AML_FeatureEngineered_FS1.csv', index=False)

In [23]:
def get_column_info(df):
    column_info = {col: str(df[col].dtype).replace('float64', 'float').replace('int64', 'int').replace('object', 'str') for col in df.columns}
    return column_info

# Get column info
column_info = get_column_info(df)

# Print in desired format
print({"column_info": column_info})

{'column_info': {'Destination_1': 'float', 'Destination_2': 'float', 'Destination_3': 'float', 'Destination_4': 'float', 'Destination_5': 'float', 'Destination_6': 'float', 'Destination_7': 'float', 'Destination_8': 'float', 'Destination_9': 'float', 'Destination_10': 'float', 'Destination_11': 'float', 'Destination_12': 'float', 'Destination_13': 'float', 'Destination_14': 'float', 'Destination_15': 'float', 'Destination_16': 'float', 'US Dollar': 'str', 'Euro': 'str', 'Yuan': 'str', 'number_transactions_above_9k_cad': 'int', 'avg_transaction_value_in_cad': 'float', 'transaction_value_variance_in_cad': 'float', 'variance_from_10k_cad': 'float', 'min_day_to_max_day_range': 'int', 'avg_transaction_frequency_in_days': 'float', 'transaction_frequency_variance_in_days': 'float', 'Is_FanOut': 'str'}}


## Prepare the Dataset Config File

In [25]:
import pandas as pd
import json

def generate_json_config(df):
    # Get column names
    column_names = df.columns.tolist()
    
    # Identify the target column index (last column)
    target_col_idx = [len(column_names) - 1]
    
    # Initialize lists for numerical and categorical column indices
    num_col_idx = []
    cat_col_idx = []
    
    # Initialize dictionary for column info
    column_info = {}
    
    # Determine column types and indices
    for idx, col in enumerate(df.columns):
        col_type = df[col].dtype
        
        # Fill in column_info with appropriate type
        if pd.api.types.is_integer_dtype(col_type):
            column_info[col] = "int"
            if idx != target_col_idx[0]:  # Exclude target column from num_col_idx
                num_col_idx.append(idx)
        elif pd.api.types.is_float_dtype(col_type):
            column_info[col] = "float"
            if idx != target_col_idx[0]:
                num_col_idx.append(idx)
        elif pd.api.types.is_bool_dtype(col_type):
            column_info[col] = "bool"
            if idx != target_col_idx[0]:
                cat_col_idx.append(idx)
        else:
            column_info[col] = "str"
            if idx != target_col_idx[0]: # Exclude target column from cat_col_idx_col_idx
                cat_col_idx.append(idx)
    
    # Calculate train_num and test_num
    total_rows = len(df)
    train_num = int(total_rows * 1.0)
    test_num = total_rows - train_num
    
    # Construct JSON configuration object
    config = {
        "name": "IBM_AML_FeatureEngineered_FS1",
        "task_type": "binclass",
        "header": "infer",
        "column_names": column_names,
        "num_col_idx": num_col_idx,
        "cat_col_idx": cat_col_idx,
        "target_col_idx": target_col_idx,
        "file_type": "csv",
        "data_path": "data/raw_data/IBM_AML_FeatureEngineered_FS1/IBM_AML_FeatureEngineered_FS1.csv",
        "test_path":"data/test_data/IBM_AML_FeatureEngineered_FS1/test.csv", 
        "column_info": column_info,
        "train_num": train_num,
        "test_num": test_num
    }
    
    # Print JSON object
    print(json.dumps(config, indent=4))

In [26]:
generate_json_config(df)

{
    "name": "IBM_AML_FeatureEngineered_FS1",
    "task_type": "binclass",
    "header": "infer",
    "column_names": [
        "Destination_1",
        "Destination_2",
        "Destination_3",
        "Destination_4",
        "Destination_5",
        "Destination_6",
        "Destination_7",
        "Destination_8",
        "Destination_9",
        "Destination_10",
        "Destination_11",
        "Destination_12",
        "Destination_13",
        "Destination_14",
        "Destination_15",
        "Destination_16",
        "US Dollar",
        "Euro",
        "Yuan",
        "number_transactions_above_9k_cad",
        "avg_transaction_value_in_cad",
        "transaction_value_variance_in_cad",
        "variance_from_10k_cad",
        "min_day_to_max_day_range",
        "avg_transaction_frequency_in_days",
        "transaction_frequency_variance_in_days",
        "Is_FanOut"
    ],
    "num_col_idx": [
        0,
        1,
        2,
        3,
        4,
        5,
        6,
 

# AML Dataset

For more explanation of different steps in this section, please refer to TabDDPM's notebook.

In [27]:
DATA_DIR = "data/"
RAW_DATA_DIR = os.path.join(DATA_DIR, "raw_data")
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed_data")
SYNTH_DATA_DIR = os.path.join(DATA_DIR, "synthetic_data")
DATA_NAME = "IBM_AML_FeatureEngineered_FS1"

MODEL_PATH = "models/tabsyn"
# process data
INFO_DIR = "data_info"


In [33]:
# import pandas as pd
# from sklearn.model_selection import train_test_split

# def split_and_save_dataset(df, train_size=0.9, random_state=42):
#     # Split the dataset
#     train_df, test_df = train_test_split(df, train_size=train_size, random_state=random_state)
    
#     # Save to CSV files
#     train_df.to_csv('data/raw_data/IBM_AML_FeatureEngineered_FS1/train.csv', index=False)
#     test_df.to_csv('data/test_data/IBM_AML_FeatureEngineered_FS1/test.csv', index=False)
    
#     # Print the shapes of the resulting datasets
#     print(f"Original dataset shape: {df.shape}")
#     print(f"Train dataset shape: {train_df.shape}")
#     print(f"Test dataset shape: {test_df.shape}")

# # Assuming you have a DataFrame called 'df'
# split_and_save_dataset(df)

Original dataset shape: (90, 37)
Train dataset shape: (81, 37)
Test dataset shape: (9, 37)


In [134]:
# NAME_URL_DICT_UCI = {
#     "adult": "https://archive.ics.uci.edu/static/public/2/adult.zip",
#     "default": "https://archive.ics.uci.edu/static/public/350/default+of+credit+card+clients.zip",
#     "magic": "https://archive.ics.uci.edu/static/public/159/magic+gamma+telescope.zip",
#     "shoppers": "https://archive.ics.uci.edu/static/public/468/online+shoppers+purchasing+intention+dataset.zip",
#     "beijing": "https://archive.ics.uci.edu/static/public/381/beijing+pm2+5+data.zip",
#     "news": "https://archive.ics.uci.edu/static/public/332/online+news+popularity.zip",
# }

# # For shared directory you can change it to "/projects/diffusion_bootcamp/data/tabular"
# DATA_DIR = "data/"
# RAW_DATA_DIR = os.path.join(DATA_DIR, "raw_data")
# PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed_data")
# SYNTH_DATA_DIR = os.path.join(DATA_DIR, "synthetic_data")
# # DATA_NAME = "adult"

# MODEL_PATH = "models/tabsyn"

In [135]:
# DATA_NAME = "IBM_AML_FeatureEngineered_FS1"

In [28]:
# download data
# download_from_uci(DATA_NAME, RAW_DATA_DIR, NAME_URL_DICT_UCI)

# process data
INFO_DIR = "data_info"
process_data(DATA_NAME, INFO_DIR, DATA_DIR)

# review data
df = pd.read_csv(os.path.join(PROCESSED_DATA_DIR, DATA_NAME, "train.csv"))
# visualize_default(df).head(10)

XXXXXXXXXXXXXXXXXX(6794, 27)
num Col index:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 19, 20, 21, 22, 23, 24, 25], Cat col index: [16, 17, 18]
Processing and Saving IBM_AML_FeatureEngineered_FS1 Successfully!
Dataset Name: IBM_AML_FeatureEngineered_FS1
Total Size: 6794
Train Size: 6114
Test Size: 680
Number of Numerical Columns: 23
Number of Categorical Columns: 4


In [30]:
# review json file and its contents
with open(f"{PROCESSED_DATA_DIR}/{DATA_NAME}/info.json", "r") as file:
    data_info = json.load(file)
pprint(data_info)

{'cat_col_idx': [16, 17, 18],
 'column_info': {'0': {},
                 '1': {},
                 '10': {},
                 '11': {},
                 '12': {},
                 '13': {},
                 '14': {},
                 '15': {},
                 '16': {},
                 '17': {},
                 '18': {},
                 '19': {},
                 '2': {},
                 '20': {},
                 '21': {},
                 '22': {},
                 '23': {},
                 '24': {},
                 '25': {},
                 '26': {},
                 '3': {},
                 '4': {},
                 '5': {},
                 '6': {},
                 '7': {},
                 '8': {},
                 '9': {},
                 'categorizes': ['Y', 'N'],
                 'max': 7.849218007657018,
                 'min': 1.6115712240084643,
                 'type': 'categorical'},
 'column_names': ['Destination_1',
                  'Destination_2',
         

# TabSyn Algorithm

In this section, we will describe the design of TabSyn as well as its main hyperparameters loaded through config, which affect the model’s effectiveness. 

**TabSyn** consists of two parts:
1. A *variational auto-encoder (VAE)* which learns a joint representation space for the given tabular data.
2. A *Diffusion model* which learns the distribution of data in the joint representation space.

The figure below shows a diagram of the TabSyn model.

<p align="center">
<img src="figures/tabsyn.jpg" width="1000"/>
</p>

**VAE**

The left-side of the figure shows the VAE which operates in the original data space. The VAE itself consists of two parts: an encoder and a decoder. It also contains the corresponding tokenizer and detokenizer.
Each row of the input tabular data ($\pmb{x}$) is tokenized, then embedded by a transformer. Another transformer decodes the embeddings and a detokenizer reconstructs the table ($\pmb{\tilde{x}}$). The VAE is trained by minimizing the reconstruction loss between $\pmb{x}$ and $\pmb{\tilde{x}}$.

After the VAE is fully trained, the whole data ($\pmb{x}$) is tokenized and embedded. The embedding of each row is flattened to form a 1-dimensional vector $\pmb{z}$.
These 1-dimensional embeddings for all rows are stored on disk, and will later be used to train the diffusion model.

**Diffusion**

The right-side of the figure shows the diffusion model which operates in the latent representation space; in other words, it only *sees* the embeddings obtained by the VAE, not the original tabular data.
The diffusion model can be similarly divided into two parts: a forward process, and a reverse process.

The forward process receives the embedded data points. A single data point is denoted by $\pmb{z_0}$ in the figure. Gaussian noise is incrementally added to the embeddings in numerous incremental steps during the forward process. The number of the steps is denoted by $T$ in the figure. $T$ should be high enough that the distribution of embeddings at step $t=T$ is essentially a standard Gaussian distribution; in other words, the signal-to-noise ratio is practically zero.

The reverse process, on the other hand, learns to *predict* an earlier-step embedding (e.g. $\pmb{z_{t-\Delta t}}$) from a later-step embedding (e.g. $\pmb{z_t}$) via a neural network.

After the diffusion model is fully trained, the reverse process can estimate the data distribution at step $t=0$ if it receives a standard Gaussian distribution at step $t=T$. New data points can be synthesized by sampling from this estimated distribution.


## Load Config

In this section, we will load the configuration file that contains the hyperparameters for the TabSyn model. 

In [31]:
config_path = os.path.join("src/baselines/tabsyn/configs", f"{DATA_NAME}.toml")
raw_config = src.load_config(config_path)

pprint(raw_config)

{'impute': {'N': 20,
            'SIGMA_MAX': 80,
            'SIGMA_MIN': 0.002,
            'S_churn': 1,
            'S_max': inf,
            'S_min': 0,
            'S_noise': 1,
            'num_steps': 30,
            'num_trials': 30,
            'rho': 7},
 'loss_params': {'lambd': 0.7, 'max_beta': 0.01, 'min_beta': 1e-05},
 'model_params': {'d_token': 4, 'factor': 32, 'n_head': 1, 'num_layers': 2},
 'task_type': 'binclass',
 'train': {'diffusion': {'batch_size': 4096,
                         'num_dataset_workers': 4,
                         'num_epochs': 9},
           'optim': {'diffusion': {'factor': 0.9,
                                   'lr': 0.001,
                                   'patience': 20,
                                   'weight_decay': 0},
                     'vae': {'factor': 0.95,
                             'lr': 0.001,
                             'patience': 10,
                             'weight_decay': 0}},
           'vae': {'batch_size': 4096

The configuration file is a TOML file that contains the following hyperparameters:

1. **model_params:** specifies the structure of the transformers (both encoder and decoder) in the VAE model, including number of transformer layers, number of self-attnetion heads and token dimension.

2. **transforms:** specifies the transformations and preprocessing of the data before tokenization, such as cleaning, normalization, and encoding.
    - For preprocessing numerical features, we use the gaussian quantile transformation and replace the NaN values with mean of each row.
    - For categorical features, we use the one-hot encoding method. NaN values are left unchanged, but we have the option to replace them. We have the option to drop the values that appear with less than a given minimum frequency under each column. Furthermore, we have the option to add an extra encoding step for categorical features during tokenization.

3. **train.vae:** specifies training parameters of the VAE, including batch size, number of epochs, and number of dataset workers.

4. **train.diffusion:** specifies the same training parameters as above for the diffusion model.

5. **train.optim.vae:** specifies the parameters of the *Adam* optimizer and the `ReduceLROnPlateau` learning rate scheduler used to train the VAE. Optimizer parameters include initial learning rate and weight decay. LR scheduler parameters includer `factor` and `patience`.

6. **train.optim.diffusion:** specifies the same parameters as above for the diffusion model.

7. **loss_params:** specifies parameters of the loss function used to train the VAE including `max_beta`, `min_beta` and `lambd`.

$\beta$ is the coefficient of the KL divergence term in the VAE loss formula,

$\mathcal{L}_{vae} = \mathcal{L}_{mse} + \mathcal{L}_{ce} + \beta \mathcal{L}_{kl}$
.

Parameters `max_beta` and `min_beta` determine the range of $\beta$. $\beta$ is first set to `max_beta`. If the loss stops decreasing for a certain number of epochs (e.g. $10$ epochs), then at the end of each epoch after that (e.g. epoch $11$, $12$, etc.) $\beta$ is decreased by a factor of `lambd`,
$\beta_{new} = \lambda \beta_{curr}$,
until it reaches `beta_min`.


## Make Dataset

In this section, we pre-process the data and make a dataset object.

First, we determine transformations needed for the dataset, such as normalization and cleaning, in `transforms`. Next, using `preprocess` function we load the data from disk in arrays that contain both training and test data (`X_num` and `X_cat`), as well as the number of categories for each categorical feature (`categories`) and the number of numerical features (`d_numerical`).

We then separate the train and test data in different arrays and convert them to Pytorch tensors.
We create a dataset object (`TabularDataset`) with the train data. `TabularDataset` is a simple module which returns the tokens of a single row at a time. Each row constiutes a single data sample in TabSyn. Afterwards, we create a Dataloader for the train data using the `batch_size` and `num_workers` specified in config.

In contrast, we keep the test data as tensors (`X_test_num` and `X_test_cat`). If a GPU is available, we move these tensors to GPU so that they can be accessed by the model later on.

In [32]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
# preprocess data
X_num, X_cat, categories, d_numerical = preprocess(os.path.join(PROCESSED_DATA_DIR, DATA_NAME),
                                                   transforms = raw_config["transforms"],
                                                   task_type = raw_config["task_type"])

# separate train and test data
X_train_num, X_test_num = X_num
X_train_cat, X_test_cat = X_cat

# convert to float tensor
X_train_num, X_test_num = torch.tensor(X_train_num).float(), torch.tensor(X_test_num).float()
X_train_cat, X_test_cat =  torch.tensor(X_train_cat), torch.tensor(X_test_cat)

# create dataset module
train_data = TabularDataset(X_train_num.float(), X_train_cat)

# move test data to gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_test_num = X_test_num.float().to(device)
X_test_cat = X_test_cat.to(device)

# create train dataloader
train_loader = DataLoader(
    train_data,
    batch_size = raw_config["train"]["vae"]["batch_size"],
    shuffle = True,
    num_workers = raw_config["train"]["vae"]["num_dataset_workers"],
)

No NaNs in numerical features, skipping




## Instantiate Model

Next, we instantiate the model using the `TabSyn` class. `TabSyn` class takes the following arguments:

1. `train_loader`: dataloader for train data.
2. `X_test_num`: numerical features of the test data.
3. `X_test_cat`: categorical features of the train data.
4. `num_numerical_features`: number of numerical features in the dataset.
5. `num_classes`: number of classes (i.e. categories) of each categorical feature in the dataset.
6. `device`: the device on which the model and data exist, either "cpu" or "cuda".

In [33]:
tabsyn = TabSyn(train_loader,
                X_test_num, X_test_cat,
                num_numerical_features = d_numerical,
                num_classes = categories,
                device = device)

`TabSyn` class has the tools to instantiate VAE and diffusion models, train both, and sample from the trained diffusion model.
We will demonstrate how to use these tools in the following sections.

## Train Model


The VAE and the diffusion model are trained independently. The following subsections explain each training process.


### A. Train VAE

First, we need to instantiate the VAE using the `instantiate_vae` method. This method takes the VAE model hyperparameters, optimizer and lr scheduler parameters from config, and instantiates them.

In [34]:
# instantiate VAE model for training
tabsyn.instantiate_vae(**raw_config["model_params"], optim_params = raw_config["train"]["optim"]["vae"])

Successfully instantiated VAE model.


Now that we have instantiated the VAE, we can train it using the `train_vae` function.
This function receives the loss hyperparameters and number of epochs from the config.
Moreover, it recieves `save_path` which is the directory where trained model checkpoints will be saved.

In [35]:
# os.makedirs(f"{MODEL_PATH}/{DATA_NAME}/vae")
# Define your paths
directory_path = f"{MODEL_PATH}/{DATA_NAME}/vae"

# Check if directory exists
if os.path.exists(directory_path):
    # Remove the existing directory and its contents
    shutil.rmtree(directory_path)

# Create the directory
os.makedirs(directory_path)

In [36]:


tabsyn.train_vae(**raw_config["loss_params"],
                 num_epochs = raw_config["train"]["vae"]["num_epochs"],
                 save_path = os.path.join(MODEL_PATH, DATA_NAME, "vae"))

Epoch 1/10:   0%|                                                                                                                                                                             | 0/2 [00:00<?, ?it/s]

Epoch 1/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.32s/it]


epoch: 0, beta = 0.010000, Train MSE: 13.445180, Train CE:1.044495, Train KL:1.004395, Val MSE:12.604588, Val CE:1.001523, Train ACC:0.487116, Val ACC:0.482353


Epoch 2/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.38it/s]


epoch: 1, beta = 0.010000, Train MSE: 12.738208, Train CE:0.992967, Train KL:0.940369, Val MSE:12.074593, Val CE:0.960849, Train ACC:0.459985, Val ACC:0.467279


Epoch 3/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.52it/s]


epoch: 2, beta = 0.010000, Train MSE: 12.173989, Train CE:0.957056, Train KL:0.926207, Val MSE:11.531956, Val CE:0.957460, Train ACC:0.457136, Val ACC:0.444118


Epoch 4/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.50it/s]


epoch: 3, beta = 0.010000, Train MSE: 11.703219, Train CE:0.954443, Train KL:0.950759, Val MSE:11.155245, Val CE:0.934099, Train ACC:0.433598, Val ACC:0.445588


Epoch 5/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.41it/s]


epoch: 4, beta = 0.010000, Train MSE: 11.284771, Train CE:0.951732, Train KL:1.000865, Val MSE:10.822515, Val CE:0.940759, Train ACC:0.428271, Val ACC:0.442279


Epoch 6/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.50it/s]


epoch: 5, beta = 0.010000, Train MSE: 10.939717, Train CE:0.944744, Train KL:1.063959, Val MSE:10.521458, Val CE:0.931749, Train ACC:0.431739, Val ACC:0.448529


Epoch 7/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.37it/s]


epoch: 6, beta = 0.010000, Train MSE: 10.629172, Train CE:0.930393, Train KL:1.129864, Val MSE:10.205504, Val CE:0.911699, Train ACC:0.433598, Val ACC:0.454779


Epoch 8/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.46it/s]


epoch: 7, beta = 0.010000, Train MSE: 10.347938, Train CE:0.917326, Train KL:1.193349, Val MSE:9.910804, Val CE:0.888954, Train ACC:0.442517, Val ACC:0.454044


Epoch 9/10: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.56it/s]


epoch: 8, beta = 0.010000, Train MSE: 10.040994, Train CE:0.899176, Train KL:1.251430, Val MSE:9.653733, Val CE:0.875695, Train ACC:0.445738, Val ACC:0.455515


Epoch 10/10: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.47it/s]

epoch: 9, beta = 0.010000, Train MSE: 9.760778, Train CE:0.883426, Train KL:1.304872, Val MSE:9.344542, Val CE:0.872610, Train ACC:0.446110, Val ACC:0.464706
Training time: 0.1392 mins
Successfully trained and saved the VAE model!





After training the VAE, we embed the training data with the trained encoder and store the embeddings in a direcotry specified by `vae_ckpt_dir`.

In [37]:
# embed all inputs in the latent space
tabsyn.save_vae_embeddings(X_train_num, X_train_cat,
                           vae_ckpt_dir = os.path.join(MODEL_PATH, DATA_NAME, "vae"))

Successfully saved pretrained embeddings on disk!


### B. Train Diffusion Model

Now that we have stored the training data embeddings, we need to load and prepare them for the diffusion model.
We load the embeddings using `load_vae_embeddings`. We normalize the embeddings by subtracting the mean and dividing by the standard deviation. Then, we create a Dataloader with the specified `batch_size` and `num_workers` from the config.

In [38]:
# load latent space embeddings
train_z, _ = tabsyn.load_latent_embeddings(os.path.join(MODEL_PATH, DATA_NAME, "vae"))  # train_z dim: B x in_dim

# normalize embeddings
mean, std = train_z.mean(0), train_z.std(0)
train_z = (train_z - mean) / std
latent_train_data = train_z

# create data loader
latent_train_loader = DataLoader(
    latent_train_data,
    batch_size = raw_config["train"]["diffusion"]["batch_size"],
    shuffle = True,
    num_workers = raw_config["train"]["diffusion"]["num_dataset_workers"],
)

Now that the data is ready, we instantiate the diffusion model with `instantiate_diffusion`. The input dimension and hidden dimention of the diffusion model is determined by the dimension of the embeddings. 
Moreover, we instantiate the optimizer and lr scheduler using hyperparameters from config.

In [39]:
# instantiate diffusion model for training
tabsyn.instantiate_diffusion(in_dim = train_z.shape[1], hid_dim = train_z.shape[1], optim_params = raw_config["train"]["optim"]["diffusion"])

MLPDiffusion(
  (proj): Linear(in_features=108, out_features=1024, bias=True)
  (mlp): Sequential(
    (0): Linear(in_features=1024, out_features=2048, bias=True)
    (1): SiLU()
    (2): Linear(in_features=2048, out_features=2048, bias=True)
    (3): SiLU()
    (4): Linear(in_features=2048, out_features=1024, bias=True)
    (5): SiLU()
    (6): Linear(in_features=1024, out_features=108, bias=True)
  )
  (map_noise): PositionalEmbedding()
  (time_embed): Sequential(
    (0): Linear(in_features=1024, out_features=1024, bias=True)
    (1): SiLU()
    (2): Linear(in_features=1024, out_features=1024, bias=True)
  )
)
The number of parameters: 10715244
Successfully instantiated diffusion model.


We train the diffusion model with `train_diffusion` function.
This function takes the following arguements:
1. `latent_train_loader`: dataloader for the latent representations which are used to train the diffusion model.
2. `num_epochs`: number of training epochs.
3. `ckpt_path`: directory where the model checkpoints will be stored.

In [40]:
# os.makedirs(f"{MODEL_PATH}/{DATA_NAME}")
# train diffusion model
tabsyn.train_diffusion(latent_train_loader,
                       num_epochs = raw_config["train"]["diffusion"]["num_epochs"],
                       ckpt_path = os.path.join(MODEL_PATH, DATA_NAME))

Epoch 1/9: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.90it/s, Loss=1.9]
Epoch 2/9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.78it/s, Loss=1.67]
Epoch 3/9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.99it/s, Loss=1.79]
Epoch 4/9: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.67it/s, Loss=1.69]
Epoch 5/9: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

Time:  5.429706335067749


## Load Pretrained Model

Instead of training model from scratch, we can also load weights of a pre-trained model from a given checkpoint with `load_model_state` function.
If we haven't instantiated the VAE and diffusion model beforehand, we need to instantiate them first using `instantiate_vae` and `instantiate_diffusion` methods.

In [41]:
# instantiate VAE model
tabsyn.instantiate_vae(**raw_config["model_params"], optim_params = None)


Successfully instantiated VAE model.


In [42]:
os.getcwd()

'/fs01/home/ws_aabboud/diffusion_model_bootcamp/deloitte_team/single_table_synthesis'

In [43]:
# latent_embeddings_path = "/projects/diffusion_bootcamp/models/tabular/tabsyn/default/vae"
latent_embeddings_path =os.path.join(MODEL_PATH, DATA_NAME, "vae")  # f"{os.getcwd()}/models/tabsyn/{DATA_NAME}/vae"  #f"/models/tabsyn/{DATA_NAME}/vae"
# load latent embeddings of input data
train_z, token_dim = tabsyn.load_latent_embeddings(latent_embeddings_path)

In [44]:
# instantiate diffusion model
tabsyn.instantiate_diffusion(in_dim = train_z.shape[1], hid_dim = train_z.shape[1], optim_params = None)

MLPDiffusion(
  (proj): Linear(in_features=108, out_features=1024, bias=True)
  (mlp): Sequential(
    (0): Linear(in_features=1024, out_features=2048, bias=True)
    (1): SiLU()
    (2): Linear(in_features=2048, out_features=2048, bias=True)
    (3): SiLU()
    (4): Linear(in_features=2048, out_features=1024, bias=True)
    (5): SiLU()
    (6): Linear(in_features=1024, out_features=108, bias=True)
  )
  (map_noise): PositionalEmbedding()
  (time_embed): Sequential(
    (0): Linear(in_features=1024, out_features=1024, bias=True)
    (1): SiLU()
    (2): Linear(in_features=1024, out_features=1024, bias=True)
  )
)
The number of parameters: 10715244
Successfully instantiated diffusion model.


In [45]:
os.path.join(MODEL_PATH, DATA_NAME) 

'models/tabsyn/IBM_AML_FeatureEngineered_FS1'

In [46]:

pretrained_model_path = os.path.join(MODEL_PATH, DATA_NAME)  #f"{os.getcwd()}/models/tabsyn/{DATA_NAME}"#"/projects/diffusion_bootcamp/models/tabular/tabsyn/default"
# load state from checkpoint
tabsyn.load_model_state(ckpt_dir = pretrained_model_path,
                        dif_ckpt_name = "model.pt")

Loaded model state from models/tabsyn/IBM_AML_FeatureEngineered_FS1


## Sample Data

Now that we trained the model effectively, using `sample` function we can generate synthetic data starting from compelete noise. The input of this function is as follows:

1. `train_z`: latent embeddings of the training data.
2. `info`: info about the data from the json file we reviewed at the beginning of this notebook.
3. `num_inverse`: detokenizer for numerical features.
4. `cat_inverse`: detokenizer for categorical features.
5. `save_path`: file-path where the synthetic table will be saved.

In [47]:
# load data info file
with open(os.path.join(PROCESSED_DATA_DIR, DATA_NAME, "info.json"), "r") as file:
    data_info = json.load(file)
data_info["token_dim"] = token_dim

# get inverse tokenizers
_, _, categories, d_numerical, num_inverse, cat_inverse = preprocess(os.path.join(PROCESSED_DATA_DIR, DATA_NAME),
                                                                     transforms = raw_config["transforms"],
                                                                     task_type = raw_config["task_type"],
                                                                     inverse = True)



No NaNs in numerical features, skipping


In [48]:
# sample data
num_samples = train_z.shape[0]
in_dim = train_z.shape[1] 
mean_input_emb = train_z.mean(0)
tabsyn.sample(num_samples,
              in_dim,
              mean_input_emb,
              info = data_info,
              num_inverse = num_inverse,
              cat_inverse = cat_inverse,
              save_path = os.path.join(SYNTH_DATA_DIR, DATA_NAME, "tabsyn.csv"))

(6114, 4)
Time: 3.793548345565796
Saving sampled data to data/synthetic_data/IBM_AML_FeatureEngineered_FS1/tabsyn.csv


In [49]:
train_z.shape[0],train_z.shape[1] 

(6114, 108)

## Review Synthetic Data

Finally here, we review the synthesized data. In the following `evaluate_synthetic_data.ipynb` notebook, we will evaluate this synthesized data with respect to various metrics.

In [50]:
df = pd.read_csv(os.path.join(SYNTH_DATA_DIR, DATA_NAME, "tabsyn.csv"))
df.head()


Unnamed: 0,Destination_1,Destination_2,Destination_3,Destination_4,Destination_5,Destination_6,Destination_7,Destination_8,Destination_9,Destination_10,...,Euro,Yuan,number_transactions_above_9k_cad,avg_transaction_value_in_cad,transaction_value_variance_in_cad,variance_from_10k_cad,min_day_to_max_day_range,avg_transaction_frequency_in_days,transaction_frequency_variance_in_days,Is_FanOut
0,13908490.0,8217370.0,6055.899,8086.7256,0.0,16482.225,698.387,0.0,0.0,0.0,...,N,Y,3.0,2908.7935,8801453000.0,560004700.0,7.0,3.0,1.611571,N
1,24396.17,49619.93,49703.95,4790.0186,1204.3182,6830.724,0.0,0.0,0.0,0.0,...,N,Y,0.0,624.10065,1005576000.0,364897400.0,7.0,2.333333,1.611571,N
2,16527.71,24442.8,160.51692,16035.83,5811.115,795.93964,0.0,0.0,0.0,0.0,...,N,Y,0.0,14006.7705,37583780.0,236394500000000.0,7.0,1.249369,1.611571,N
3,26023.15,1585.963,17285.348,0.0,0.0,0.0,0.0,945.8458,0.0,0.0,...,N,Y,0.0,5265.311,76443110.0,21117550000.0,3.0,2.449506,7.849218,Y
4,23614.75,19190.21,108474.41,2573.207,7965.3657,176386.66,0.0,0.0,0.0,0.0,...,N,Y,0.0,1426.7645,36326200000.0,429846400.0,7.0,3.5,1.611571,Y


# Missing Value Imputation for the Target Column

In [51]:

from pprint import pprint
from scripts.impute import impute
from scripts.eval.eval_impute import eval_impute

from scripts.eval.eval_impute import eval_impute
from scripts.eval.eval_density import eval_density
from scripts.eval.eval_quality import eval_quality
from scripts.eval.eval_mle import eval_mle
from scripts.eval.eval_dcr import eval_dcr
from scripts.eval.eval_detection import eval_detection



In [52]:
dataname = "IBM_AML_FeatureEngineered_FS1"

# For shared directory you can change it to "/projects/diffusion_bootcamp/data/tabular"
DATA_DIR = "data"
PROCESSED_DATA_DIR = os.path.join(DATA_DIR, "processed_data")

TRAIN_DATA_PATH = f"{DATA_DIR}/processed_data/{dataname}/train.csv"
TEST_DATA_PATH = f"{DATA_DIR}/test_data/{dataname}/test.csv"
# TABDDPM_DATA_PATH = f"{DATA_DIR}/synthetic_data/{dataname}/tabddpm.csv"
TABSYN_DATA_PATH = f"{DATA_DIR}/synthetic_data/{dataname}/tabsyn.csv"
INFO_PATH = f"{DATA_DIR}/processed_data/{dataname}/info.json"
# Change that path to your local modal path to impute
MODEL_PATH = os.path.join('models','tabsyn') #"/deloitte_team/single_table_synthesis/models/tabsyn/IBM_AML_FeatureEngineered_FS1"
IMPUTE_PATH = "impute/tabsyn"

In [53]:
os.getcwd()

'/fs01/home/ws_aabboud/diffusion_model_bootcamp/deloitte_team/single_table_synthesis'

In [54]:


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
impute(dataname, PROCESSED_DATA_DIR, INFO_PATH, MODEL_PATH, IMPUTE_PATH, device)

Trial 0 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 1 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 2 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 3 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 4 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 5 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 6 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 7 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 8 started!
No NaNs in numerical features, skipping
No NaNs in numerical features, skipping
(680, 4)
Trial 9 started!
No NaNs in numerical features

In [55]:
raw_config

{'task_type': 'binclass',
 'model_params': {'n_head': 1, 'factor': 32, 'num_layers': 2, 'd_token': 4},
 'transforms': {'normalization': 'quantile',
  'num_nan_policy': 'mean',
  'cat_nan_policy': None,
  'cat_min_frequency': None,
  'cat_encoding': None,
  'y_policy': 'default'},
 'train': {'vae': {'num_epochs': 10,
   'batch_size': 4096,
   'num_dataset_workers': 4},
  'diffusion': {'num_epochs': 9, 'batch_size': 4096, 'num_dataset_workers': 4},
  'optim': {'vae': {'lr': 0.001,
    'weight_decay': 0,
    'factor': 0.95,
    'patience': 10},
   'diffusion': {'lr': 0.001,
    'weight_decay': 0,
    'factor': 0.9,
    'patience': 20}}},
 'loss_params': {'max_beta': 0.01, 'min_beta': 1e-05, 'lambd': 0.7},
 'impute': {'num_trials': 30,
  'SIGMA_MIN': 0.002,
  'SIGMA_MAX': 80,
  'rho': 7,
  'S_churn': 1,
  'S_min': 0,
  'S_max': inf,
  'S_noise': 1,
  'num_steps': 30,
  'N': 20}}

In [56]:
# Uncomment below line to evaluate pre-imputed data
# IMPUTE_PATH = "/projects/diffusion_bootcamp/data/tabular/impute_data/tabsyn"

eval_impute(dataname, PROCESSED_DATA_DIR, IMPUTE_PATH)

Micro-F1: 0.8632352941176471
AUC: 0.9175778546712803

Confusion Matrix:
          Predicted N  Predicted Y
Actual N          347           28
Actual Y           65          240


In [None]:
# Random
# Micro-F1: 0.2222222222222222
# AUC: 0.25925925925925924

# Confusion Matrix:
#           Predicted N  Predicted Y
# Actual N            2            0
# Actual Y            7            0

# Micro-F1: 0.8632352941176471
# AUC: 0.9175778546712803

# Confusion Matrix:
#           Predicted N  Predicted Y
# Actual N          347           28
# Actual Y           65          240

## References

**Zhang, Hengrui, et al.** "Mixed-type tabular data synthesis with score-based diffusion in latent space." *International Conference on Learning Representations (ICLR)* (2023).

**GitHub Repository:** [Amazon Science - Tabsyn](https://github.com/amazon-science/tabsyn)