<center>
  
# TABDDPM: Modelling Tabular Data with Diffusion Models

</center>

Directly applying diffusion models to general tabular problems can be challenging because data points are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data complicates accurate modeling, as individual features can vary widely in nature; some may be continuous, while others are discrete. In this notebook, we explore **TabDDPM** — a diffusion model that can be universally applied to tabular datasets and effectively handles both categorical and numerical features.

Our primary focus in this work is synthetic data generation, which is in high demand for many tabular tasks. Firstly, tabular datasets are often limited in size, unlike vision or NLP problems where large amounts of additional data are readily available online. Secondly, properly generated synthetic datasets do not contain actual user data, thus avoiding GDPR-like regulations and allowing for public sharing without compromising anonymity.

In the following sections, we will delve deeper into the implementation of this method.

# Imports and Setup

In this section, we import all necessary libraries and modules required for setting up the environment. This includes basic libraries for such numpy and pandas. We also import essential modules for data loading, model creation, and dataset downloading and processing dataset. We also specify list of possible datasets and their download URL in `NAME_URL_DICT_UCI` where you can use each of these datasets for the rest of this notebook. Furthermore based on `DATA_DIR` we specify path to raw and processed data in `RAW_DATA_DIR` and `PROCESSED_DATA_DIR` to further download the data and process it in a desired format.  Here we will focus on `"adult"` dataset thus we will specify it in `DATA_NAME`. 

In [17]:
import os
import src
import json
import numpy as np
import pandas as pd

from scripts.download_dataset import download_from_uci
from scripts.process_dataset import process_data

from src.data import make_dataset
from src.baselines.tabddpm.pipeline import TabDDPM


NAME_URL_DICT_UCI = {
    "adult": "https://archive.ics.uci.edu/static/public/2/adult.zip",
    "default": "https://archive.ics.uci.edu/static/public/350/default+of+credit+card+clients.zip",
    "magic": "https://archive.ics.uci.edu/static/public/159/magic+gamma+telescope.zip",
    "shoppers": "https://archive.ics.uci.edu/static/public/468/online+shoppers+purchasing+intention+dataset.zip",
    "beijing": "https://archive.ics.uci.edu/static/public/381/beijing+pm2+5+data.zip",
    "news": "https://archive.ics.uci.edu/static/public/332/online+news+popularity.zip",
}

DATA_DIR = "/projects/aieng/diffusion_bootcamp/data/tabular"
RAW_DATA_DIR = f"{DATA_DIR}/raw_data"
PROCESSED_DATA_DIR = f"{DATA_DIR}/processed_data"
SYNTH_DATA_DIR = f"{DATA_DIR}/synthetic_data"
DATA_NAME = "adult"

MODEL_PATH = f"/projects/aieng/diffusion_bootcamp/models/tabular/tabddpm"

# Adult Dataset

In this section, we will download the Adult dataset from the UCI repository and load it into a pandas DataFrame. The Adult dataset contains demographic information about individuals, such as age, education, and occupation, and is commonly used for classification tasks. We will use this dataset to demonstrate the TabDDPM method.

## Download Data

We can download the required adult dataset to the specified directory in `RAW_DATA_DIR` using the download_from_uci function. This function takes the dataset name, the download path, and the URL of the data, and retrieves it from the UCI repository.

In [18]:
download_from_uci(DATA_NAME, RAW_DATA_DIR, NAME_URL_DICT_UCI)

Start processing dataset adult from UCI.
Aready downloaded.


## Process Data

Now that we have downloaded the dataset, we need to process it into the desired CSV format using the `process_data` function. To do this, we provide the dataset name, the directory containing the information required for preprocessing, and the original data directory. The `INFO_DIR` contains a JSON file for each dataset, specifying the following:

1. task_type: This must be specified and can be binclass (binary classification) or regression, depending on the type of task for each dataset. For the adult dataset, the task type is binclass.
2. column_names: This is optional and contains the names of each column.
3. num_col_idx: This is necessary to specify the columns with numerical values.
4. cat_col_idx: This is necessary to specify the columns with categorical values.
5. target_col_idx: This is necessary to specify the column containing the target value for the regression or classification task.
6. file_type: This should be set to "csv" by default, as we want to preprocess the files as CSV.
7. data_path: Optional
8. test_path: Optional
9. column_info: Optional
10. train_num: Optional
11. test_num: Optional

The `process_data` function divides the raw data into training and test splits, saving them in `PROCESSED_DATA_DIR`. It also saves the processed information of the data as a JSON file in the same directory. Finally, it prints out general information about the training and test data after preprocessing, including:

1. Size of the training, validation, and test tables
2. Size of the numerical values in the training table
3. Size of the categorical values in the training table

In [19]:
INFO_DIR = "data_info"
process_data(DATA_NAME, INFO_DIR, DATA_DIR)

adult (32561, 15) (16281, 15) (32561, 15)
Numerical (32561, 6)
Categorical (32561, 8)
Processing and Saving adult Successfully!
adult
Total 48842
Train 32561
Test 16281
Num 6
Cat 9


## Review Data

Following preprocessing, here we will review the train dataset.

Adult dataset consist of 15 columns in total where 6 columns are numerical and 9 columns are categorical. The target column is `income` which is a binary classification task. The dataset contains 32561 rows in total. You can see 20 first rows of the dataset below.

In [20]:
df = pd.read_csv(f"{PROCESSED_DATA_DIR}/{DATA_NAME}/train.csv")

# Display the first few rows of the DataFrame
df.head(20)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
3,53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
4,28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
5,37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
6,49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
8,31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
9,42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K


Also this dataset contains missing values which are represented as `?` in the dataset. We will replace these missing values with the most frequent value in the column.

In [21]:
value = " ?"
if value in df.values:
    print(f"{value} exists in the DataFrame.")
else:
    print(f"{value} does not exist in the DataFrame.")

 ? exists in the DataFrame.


In [22]:
# Open the JSON file and read its contents
with open(f"{PROCESSED_DATA_DIR}/{DATA_NAME}/info.json", "r") as file:
    data_info = json.load(file)

# Display the JSON data
print(data_info)

{'name': 'adult', 'task_type': 'binclass', 'header': None, 'column_names': ['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income'], 'num_col_idx': [0, 2, 4, 10, 11, 12], 'cat_col_idx': [1, 3, 5, 6, 7, 8, 9, 13], 'target_col_idx': [14], 'file_type': 'csv', 'data_path': '/projects/aieng/diffusion_bootcamp/data/tabular/raw_data/adult/adult.data', 'test_path': '/projects/aieng/diffusion_bootcamp/data/tabular/raw_data/adult/adult.test', 'column_info': {'0': {}, 'type': 'categorical', 'max': 99.0, 'min': 1.0, '2': {}, '4': {}, '10': {}, '11': {}, '12': {}, '1': {}, 'categorizes': [' <=50K', ' >50K'], '3': {}, '5': {}, '6': {}, '7': {}, '8': {}, '9': {}, '13': {}, '14': {}}, 'train_num': 32561, 'test_num': 16281, 'idx_mapping': {'0': 0, '1': 6, '2': 1, '3': 7, '4': 2, '5': 8, '6': 9, '7': 10, '8': 11, '9': 12, '10': 3, '11': 4, '12': 5, '13': 13, '1

# TabDDPM Algorithem

In this section, we will describe the design of TabDDPM as well as its main hyperparameters loaded through config, which affect the model’s effectiveness. 

**TabDDPM:** uses the multinomial diffusion to model the categorical and binary features, and the Gaussian diffusion to model the numerical ones. The model is trained using the diffusion process, which is a continuous-time Markov chain that models the data distribution. In more detail, for a tabular data sample that consists of N numerical featuresand C categorical features with Ki categories each, TabDDPM takes one-hot encoded versions of categorical features as an input, and normalized numerical features. The figure below illustrates the diffusion process for classification problems; t, y and l denote a diffusion timestep, a class label, and logits, respectively.

<p align="center">
<img src="figures/tabddpm.png" width="1000"/>
</p>

**Diffusion models:**  are likelihood-based generative models that handle the data through forward and reverse Markov processes. The forward process gradually adds noise to an initial sample x0 from the data distribution q(x0) sampling noise from the predefined distributions q(xt|xt−1) with variances {β1, ..., βT}.

<p align="center">
<img src="figures/forward.png" width="300"/>
</p>

The reverse diffusion proces gradually denoises a latent variable xT∼q(xT) and allows generating new data samples from q(x0). Distributions p(xt−1|xt) are usually unknown and approximated by a neural network with parameters θ.

<p align="center">
<img src="figures/backward.png" width="280"/>
</p>

**Gaussian diffusion models:** operate in continuous spaces where forward and reverse processes are characterized by Gaussian distributions:

<p align="center">
<img src="figures/gaussian.png" width="440"/>
</p>

While in general θ parameters are learned from the data by optimizing a variational lower bound, in practice for Gaussian modeling, this objective can be simplified to the sum of mean-squared errors between εθ(xt ,t) and ε over all timesteps t as follows:

<p align="center">
<img src="figures/gaussian_loss.png" width="330"/>
</p>

**Multinomial diffusion models:** are designed to generate categorical data where samples are a one-hot encoded categorical variable with K values. The multinomial forward diffusion process defines q(xt|xt−1) as a categorical distribution that corrupts the data by uniform noise over K classes: 

<p align="center">
<img src="figures/multinomial.png" width="440"/>
</p>

The reverse distribution pθ(xt−1|xt) is parameterized as q(xt−1|xt,xˆ0(xt,t)), where xˆ0 is predicted by a neural network. 

## Load Config

In this section, we will load the configuration file that contains the hyperparameters for the TabDDPM model. 

In [None]:
config_path = f"../code/src/baselines/tabddpm/configs/{DATA_NAME}.toml"
## Tabddpm is trained unconditional in this repo but conditional in main paper
raw_config = src.load_config(config_path)

print(raw_config)

The configuration file is a TOML file that contains the following hyperparameters:

1. **model_type:**  specifies the type of backbone model to be used for learning the denoising process. For Adult dataset, this should be set to "mlp". The reverse diffusion step in TabDDPM is modelled by a multi-layer neural network that has an output of the same dimensionality as x0, where the first N coordinates are the predictions of ε for the Gaussian diffusion and the rest are the predictions of x_ohe for the multinomial diffusions. This model takes as input the corrupted data xt and the timestep t, and outputs the denoised data xt−1 as follows:

<p align="center">
<img src="figures/architecture.png" width="440"/>
</p>

2. **model_params:** contains the hyperparameters for the backbone model. For the Adult dataset, we use an MLP model with the following hyperparameters:
    - is_y_cond: Whether the model is trained to be conditioned on the target value or not. By default, this should be set to True.
    - d_layers: The dimension of layers in the MLP model.
    - dropout: The dropout rate.


    
3. **task_type:** specifice the task type that we will use to conidition our training to and can be binclass (binary classification) or regression, depending on each dataset. For the Adult dataset, the task type is binclass. For classification datasets, we use a class- conditional model, i.e. pθ(xt−1|xt, y) is learned. For regression datasets, we consider a target value as an additional numerical feature, and the joint distribution is learned.

4. **diffusion_params:** contains number of total diffusion steps, the diffusion step size, and type of loss used to minimize the predicited noise in Gaussian diffusion. 

5. **train.main:** contains the basic hyperparameters for training the model such number of epochs (steps), the batch size (batch_size), the learning rate(lr), and the rate of weight decay (weight_decay).

6. **train.T:** contains the defined transformations on train data such as normalization, standardization, and one-hot encoding. 
    - For preprocessing numerical features, we use the gaussian quantile transformation and replace the Nan values with mean of each row. 
    - For categorical features, we use the one-hot encoding method. Each categorical feature is handled by a separate forward diffusion process, i.e., the noise components for all features are sampled independently. 

7. **sample:** contains the hyperparameters for sampling the data from the trained model. It includes the number of samples to generate, the batch size of sampling, and the seed used for random noise initialization.

8. **eval.T:** contains the defined transformations on evaluation data such as normalization, standardization, and one-hot encoding. 


## Make Dataset

Now that we have processed the data, we can create a dataset object. First we instantiate transformations needed for the dataset, such as normalization, standardization, and one-hot encoding in `T`. Next using make_dataset function we create a dataset object that contains the training and test data, as well as the column names, numerical column indices, and categorical column indices. This function takes the directory containing the processed data, the transformation and task type as input. Also it takes a boolean argument `change_val` that if it set true it will change the validation data to a split of train data rather than test data.
It returns a dataset object that contains the training and test data, as well as the column names, numerical column indices, and categorical column indices.

In [24]:
T = src.Transformations(**raw_config["train"]["T"])

dataset = make_dataset(
    f"{PROCESSED_DATA_DIR}/{DATA_NAME}",
    T,
    task_type=raw_config["task_type"],
    change_val=False,
)

data_path /projects/aieng/diffusion_bootcamp/data/tabular/processed_data/adult
No NaNs in numerical features, skipping


In [25]:
print("train dataset num shape: ", dataset.X_num["train"].shape)
print("test dataset num shape: ", dataset.X_num["test"].shape)

print("train dataset cat shape: ", dataset.X_cat["train"].shape)
print("test dataset cat shape: ", dataset.X_cat["test"].shape)

print(dataset.y_info)
print(dataset.task_type)
print(dataset.n_classes)

train dataset num shape:  (32561, 6)
test dataset num shape:  (16281, 6)
train dataset cat shape:  (32561, 9)
test dataset cat shape:  (16281, 9)
{'policy': 'default'}
binclass
None


## Instantiate Model

Next, we instantiate the TabDDPM model. To do so first we need to apply some modifactions to loaded configs based on the dataset. We will set the number of numerical and size of one-hot coded categorical features based on the dataset.

In [26]:
dim_categorical_features = np.array(dataset.get_category_sizes("train"))
print(dim_categorical_features)
if len(dim_categorical_features) == 0 or raw_config["train"]["T"]["cat_encoding"] == "one-hot":
    dim_categorical_features = np.array([0])

num_numerical_features = (
    dataset.X_num["train"].shape[1] if dataset.X_num is not None else 0
)

[ 2  9 16  7 15  6  5  2 42]
[ 2  9 16  7 15  6  5  2 42]
110


We will also set the input size of the model as sum of size of categorical plus length of numerical features.

In [None]:
dim_input = np.sum(dim_categorical_features) + num_numerical_features
raw_config["model_params"]["d_in"] = dim_input
print(dim_input)

Also we set device to be "cuda" if available otherwise we will use "cpu".

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

Finally, we will instantiate the model using the `TabDDPM` class. TabDDPM class takes the following arguments:

1. dataset: The dataset object containing the training and test data, as well as the column names, numerical column indices, and categorical column indices.
2. num_numerical_features: The number of numerical features in the dataset.
3. dim_categorical_features: The list of size of the one-hot encoded categorical features.
4. model_type: The type of backbone model to be used for learning the denoising process.
5. model_params: The hyperparameters for the backbone model.
6. diffusion_params: The hyperparameters for the diffusion process.
7. device: The device to use for training the model.

In [27]:
tabddpm = TabDDPM(
    dataset=dataset,
    num_numerical_features=num_numerical_features,
    num_classes=dim_categorical_features,
    model_type=raw_config["model_type"],
    model_params=raw_config["model_params"],
    device=device,
    **raw_config["diffusion_params"],
    )

{'num_classes': 2, 'is_y_cond': False, 'rtdl_params': {'d_layers': [1024, 2048, 2048, 1024], 'dropout': 0.0}, 'd_in': 110}
mlp
the number of parameters 11768942


TabDDPM class first instantiates the backbone model, which is a multi-layer neural network that models the reverse diffusion process using `get_model` function. It then passes this model to `GaussianMultinomialDiffusion` class to initializes the diffusion process.

## Train Model

Now that we have instantiated the model, we can train it using the `train` function. This function takes the training arguments from the config file and path to save the trained model. It trains the model on the training data using the `Trainer` class. It returns the trained model and the training loss.

In [28]:
tabddpm.train(
    **raw_config["train"]["main"],
    model_save_path=f"{MODEL_PATH}/{DATA_NAME}",
)

Steps:  100000
Step 1/100000 MLoss: 2.1247 GLoss: 1.2061 Sum: 3.3308
Time:  0.20956921577453613
Step 2/100000 MLoss: 46.4899 GLoss: 80031.0 Sum: 80077.4899
Time:  0.11751985549926758
Step 3/100000 MLoss: 1.6049 GLoss: 2.4237 Sum: 4.0286
Time:  0.09335041046142578
Step 4/100000 MLoss: 1.6683 GLoss: 1.6124 Sum: 3.2807
Time:  0.15784645080566406
Step 5/100000 MLoss: 1.7956 GLoss: 1.0445 Sum: 2.8401
Time:  0.13161206245422363
Step 6/100000 MLoss: 1.3912 GLoss: 1.0906 Sum: 2.4818
Time:  0.1342473030090332
Step 7/100000 MLoss: 1.7725 GLoss: 1.3167 Sum: 3.0892
Time:  0.0675048828125


KeyboardInterrupt: 

## Load pretrained Model

## Sample Data

In [30]:
model_name= "model_100000.pt"

tabddpm.sample(
    info_path=f"{PROCESSED_DATA_DIR}/{DATA_NAME}/info.json",
    num_samples=raw_config["sample"]["num_samples"],
    batch_size=raw_config["sample"]["batch_size"],
    disbalance=raw_config["sample"].get("disbalance", None),
    ckpt_path=f"{MODEL_PATH}/{DATA_NAME}/{model_name}",
    sample_save_path=f"{SYNTH_DATA_DIR}/{DATA_NAME}/tabddpm.csv",
    ddim=True,
    steps=1000,
)

Loaded model from /projects/aieng/diffusion_bootcamp/models/tabular/tabddpm/adult/model_100000.pt
Sample using DDIM.
Sample timestep  999
Sample timestep  999
Sample timestep  999
Sample timestep  999
Shape torch.Size([32561, 15])
(32561, 9)
Sampling time: 232.29829478263855


# Synthetic Data

In [31]:
df = pd.read_csv(f"{SYNTH_DATA_DIR}/{DATA_NAME}/tabddpm.csv")

# Display the first few rows of the DataFrame
df.head(20)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,41.0,Never-worked,98944.16,9th,7.0,Never-married,Farming-fishing,Husband,White,Male,0.0,0.0,40.0,United-States,<=50K
1,20.0,Private,187110.6,Some-college,10.0,Married-civ-spouse,Prof-specialty,Not-in-family,White,Female,0.0,0.0,50.0,United-States,<=50K
2,25.0,Private,346842.62,Bachelors,13.0,Married-civ-spouse,Sales,Husband,White,Male,0.0,0.0,40.0,United-States,<=50K
3,63.0,Self-emp-not-inc,230567.36,HS-grad,9.0,Divorced,Exec-managerial,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
4,32.0,Private,123303.305,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
5,27.0,Private,201552.97,Some-college,10.0,Never-married,Exec-managerial,Husband,White,Male,4386.0,0.0,40.0,United-States,>50K
6,53.0,Self-emp-not-inc,175929.92,Masters,14.0,Never-married,Prof-specialty,Husband,White,Male,0.0,0.0,25.0,United-States,<=50K
7,20.0,?,100824.17,HS-grad,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,60.0,United-States,<=50K
8,49.0,?,204162.78,HS-grad,9.0,Divorced,Machine-op-inspct,Husband,White,Male,0.0,0.0,40.0,Italy,<=50K
9,42.0,?,183482.92,Assoc-voc,13.0,Never-married,Exec-managerial,Husband,White,Male,0.0,0.0,50.0,United-States,>50K
