# Membership Inference over Diffusion-models-based Synthetic Tabular Data (MIDST) Challenge @ SaTML 2025.

## White Box Multi Table Competition
Welcome to the MIDST challenge!

The MIDST challenge is a multi-track competition aiming to quantitatively evaluate the privacy of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs).

This competition focuses on Black Box MIA on tabular diffusion models trained on a multi table dataset (Berka). There are 8 tables in total which are related as follows:

![image](https://github.com/user-attachments/assets/cae8fc1a-52e4-49d3-bd3b-98d56488b226)

In particular, MIA will be explored over a state-of the art method Multi-relational Tabular Diffusion Model [ClavaDDPM](https://arxiv.org/abs/2405.17724). A collection of ClavaDDPM models will be trained on random subsets of the transaction dataset. Given the synthesis algorithm, model checkpoints, and output for all the tables in the Berka dataset, you are expected to perform MIA on challenge points selected from the Transaction table. You can use the other tables as auxiliary information in your attack development, or ignore them. The `final` set includes 20 models, each with its own set of challenge points (ie train and holdout data), to evaluate solutions on. To facilitate designing an attack, 30 `train` models are provided with comprehensive information about the model, training data and output synthetic data. Additionally, 20 `dev` models are provided to assist in evaluating the effectiveness of attacks prior to making a final submission to the `final` set. A high level summary of the competition is below:
![white_box_diagram](https://github.com/user-attachments/assets/c6683d1a-0461-42c1-9cdc-3645c76d474d)

This notebook will walk you through the process of creating and packaging a submission to the white box mutli table challenge.

## Package Imports and Evironment Setup

Ensure that you have installed the proper dependenices to run the notebook. The environment installation instructions are available [here](https://github.com/VectorInstitute/MIDSTModels/tree/main/starter_kits). Now that we have verfied we have the proper packages installed, lets import them and define global variables:

In [1]:
import csv
import os
import random
import zipfile

from pathlib import Path
from functools import partial
from typing import Callable, Any

import numpy as np
import torch

from tqdm.notebook import tqdm
from midst.data import get_challenge_points
from midst.metrics import get_tpr_at_fpr

In [2]:
CLAVADDPM_DATA_DIR = "clavaddpm_white_box"

## Data

Next, lets download and extract the data for the competition:

In [None]:
!gdown 1yiqQVTi4iXbdnbKi6vm7qNucihQkCtHH
!unzip -qq -o clavaddpm_white_box.zip

**Note:** If there is an issue with the download (ie throttled for downloading too many files with gdown) you can simply download the zip manually from this [link](https://drive.google.com/file/d/1WzgDhNFySjX3RMTxmDVX5BnXnq9q_sGU/view?usp=drive_link) and extract it in the same directory this notebook exists.

### Contents
The archives extracted under the `clavaddpm_white_box` contain 3 subdirectories:

- `train`: Comprehensive information (ie model weights+architecture, training data, output synthetic data etc.) about the set of shadow models. Use these to develop your attacks without having to train your own models.
- `dev`: Set of challenge points. Membership predictions for these challenges will be used to evaluate submissions during the competition and update the live scoreboard in CodaBench.
- `final`: Set of challenge points. Membership predictions for these challenges will be used to evaluate submissions when the competition closes and to determine the final ranking.

The contents of the `train`, `dev` and `final` subdirectory of `clavaddpm_white_box` contain the following files: 
<table>
  <thead>
    <tr>
      <th>Model - Stage</th>
      <th>File Name</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <!-- White-Box Models - Train -->
    <tr>
      <td rowspan="36"><strong>White-Box Models - Train</strong></td>
      <!-- Train data with IDs -->
      <td>account.csv</td>
      <td>Account samples used to train the model</td>
    </tr>
    <!-- Remaining Train data files -->
    <tr>
      <td>card.csv</td>
      <td>Account samples used to train the model</td>
    </tr>
    <tr>
      <td>client.csv</td>
      <td>Client samples used to train the model</td>
    </tr>
    <tr>
      <td>disp.csv</td>
      <td>Disposition samples used to train the model</td>
    </tr>
    <tr>
      <td>district.csv</td>
      <td>District samples used to train the model</td>
    </tr>
    <tr>
      <td>loan.csv</td>
      <td>Loan samples used to train the model</td>
    </tr>
    <tr>
      <td>order.csv</td>
      <td>Order samples used to train the model</td>
    </tr>
    <tr>
      <td>trans.csv</td>
      <td>Transaction samples used to train the model</td>
    </tr>
    <!-- Data domain files -->
    <tr>
      <td>account_domain.json</td>
      <td>Account data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>card_domain.json</td>
      <td>Account data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>client_domain.json</td>
      <td>Client data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>disp_domain.json</td>
      <td>Disposition data domain file indicating the domain information for each column
    <tr>
      <td>district_domain.json</td>
      <td>District data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>loan_domain.json</td>
      <td>Loan data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>order_domain.json</td>
      <td>Order data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>trans_domain.json</td>
      <td>Transaction data domain file indicating the domain information for each column</td>
    </tr>
    <!-- Challenge data and labels -->
    <tr>
      <td>challenge_with_id.csv</td>
      <td>Challenge points sampled from transaction train data and holdout data</td>
    </tr>
    <tr>
      <td>challenge_label.csv</td>
      <td>The labels for the set of challenge points</td>
    </tr>
    <!-- Label encoders -->
    <tr>
      <td>account_label_encoders.pkl</td>
      <td>Pickled label encoders used in account data preprocessing</td>
    </tr>
    <tr>
      <td>card_label_encoders.pkl</td>
      <td>Pickled label encoders used in account data preprocessing</td>
    </tr>
    <tr>
      <td>client_label_encoders.pkl</td>
      <td>Pickled label encoders used in client data preprocessing</td>
    </tr>
    <tr>
      <td>disp_label_encoders.pkl</td>
      <td>Pickled label encoders used in disposition data preprocessing</td>
    </tr>
    <tr>
      <td>district_label_encoders.pkl</td>
      <td>Pickled label encoders used in district data preprocessing</td>
    </tr>
    <tr>
      <td>loan_label_encoders.pkl</td>
      <td>Pickled label encoders used in loan data preprocessing</td>
    </tr>
    <tr>
      <td>order_label_encoders.pkl</td>
      <td>Pickled label encoders used in order data preprocessing</td>
    </tr>
    <tr>
      <td>trans_label_encoders.pkl</td>
      <td>Pickled label encoders used in transaction data preprocessing</td>
    </tr>
    <tr>
      <td>workspace/train_1/cluster_ckpt.pkl</td>
      <td>Pickled cluster checkpoint used in ClavaDDPM relation-aware clustering</td>
    </tr>
    <tr>
      <td>workspace/train_1/models/*</td>
      <td>The trained model checkpoints for all the tables used in training</td>
    </tr>
    <!-- Synthetic data -->
    <tr>
      <td>workspace/train_1/account/_final/account_synthetic.csv</td>
      <td>Synthetic account data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/card/_final/card_synthetic.csv</td>
      <td>Synthetic credit card data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/client/_final/client_synthetic.csv</td>
      <td>Synthetic client data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/disp/_final/disp_synthetic.csv</td>
      <td>Synthetic disposition data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/district/_final/district_synthetic.csv</td>
      <td>Synthetic district data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/loan/_final/loan_synthetic.csv</td>
      <td>Synthetic loan data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/order/_final/order_synthetic.csv</td>
      <td>Synthetic order data generated using the corresponding trained model</td>
    </tr>
    <tr>
      <td>workspace/train_1/trans/_final/trans_synthetic.csv</td>
      <td>Synthetic order data generated using the corresponding trained model</td>
    </tr>
    <!-- White-Box Models - Dev -->
    <tr>
      <td rowspan="27"><strong>White-Box Models - Dev</strong></td>
      <td>Account data domain file indicating the domain information for each column</td>
      <td></td>
    </tr>
    <tr>
      <td>card_domain.json</td>
      <td>Credit card data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>client_domain.json</td>
      <td>Credit card data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>disp_domain.json</td>
      <td>Disposition data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>district_domain.json</td>
      <td>District data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>loan_domain.json</td>
      <td>District data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>order_domain.json</td>
      <td>Order data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>trans_domain.json</td>
      <td>Transaction data domain file indicating the domain information for each column</td>
    </tr>
    <!-- Challenge data -->
    <tr>
      <td>challenge_with_id.csv</td>
      <td>Challenge points sampled from transaction train data and holdout data</td>
    </tr>
    <!-- Model checkpoints and other artifacts -->
    <tr>
      <td>account_label_encoders.pkl</td>
      <td>Pickled label encoders used in account data preprocessing</td>
    </tr>
    <tr>
      <td>card_label_encoders.pkl</td>
      <td>Pickled label encoders used in credit card data preprocessing</td>
    </tr>
    <tr>
      <td>client_label_encoders.pkl</td>
      <td>Pickled label encoders used in client data preprocessing</td>
    </tr>
    <tr>
      <td>disp_label_encoders.pkl</td>
      <td>Pickled label encoders used in disposition data preprocessing</td>
    </tr>
    <tr>
      <td>district_label_encoders.pkl</td>
      <td>Pickled label encoders used in disposition data preprocessing</td>
    </tr>
    <tr>
      <td>loan_label_encoders.pkl</td>
      <td>Pickled label encoders used in loan data preprocessing</td>
    </tr>
    <tr>
      <td>order_label_encoders.pkl</td>
      <td>Pickled label encoders used in order data preprocessing</td>
    </tr>
    <tr>
      <td>trans_label_encoders.pkl</td>
      <td>Pickled label encoders used in order data preprocessing</td>
    </tr>
    <tr>
      <td>workspace/train_1/cluster_ckpt.pkl</td>
      <td>Pickled cluster checkpoint used in ClavaDDPM relation-aware clustering</td>
    </tr>
    <tr>
      <td>workspace/train_1/models/*</td>
      <td>The trained model checkpoints for all the tables used in training</td>
    </tr>
    <!-- Synthetic data -->
    <tr>
      <td>workspace/train_1/account/_final/account_synthetic.csv</td>
      <td>Account data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/card/_final/card_synthetic.csv</td>
      <td>Credit card data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/client/_final/client_synthetic.csv</td>
      <td>Credit card data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/disp/_final/disp_synthetic.csv</td>
      <td>Disposition data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/district/_final/district_synthetic.csv</td>
      <td>Disposition data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/loan/_final/loan_synthetic.csv</td>
      <td>Loan data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/order/_final/order_synthetic.csv</td>
      <td>Order data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/trans/_final/trans_synthetic.csv</td>
      <td>Order data domain file indicating the domain information for each column</td>
    </tr>
    <!-- White-Box Models - Eval -->
    <tr>
      <td rowspan="27"><strong>White-Box Models - Final</strong></td>
      <!-- Data domain files -->
      <td>account_domain.json</td>
      <td>Account data domain file indicating the domain information for each column</td>
    </tr>
    <!-- Remaining Eval data files -->
    <tr>
      <td>card_domain.json</td>
      <td>Credit card data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>client_domain.json</td>
      <td>Credit card data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>disp_domain.json</td>
      <td>Disposition data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>district_domain.json</td>
      <td>District data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>loan_domain.json</td>
      <td>Loan data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>order_domain.json</td>
      <td>Order data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>trans_domain.json</td>
      <td>Transaction data domain file indicating the domain information for each column</td>
    </tr>
    <!-- Challenge data -->
    <tr>
      <td>challenge_with_id.csv</td>
      <td>Challenge points sampled from transaction train data and holdout data</td>
    </tr>
    <!-- Model checkpoints and other artifacts -->
    <tr>
      <td>account_label_encoders.pkl</td>
      <td>Pickled label encoders used in account data preprocessing</td>
    </tr>
    <tr>
      <td>card_label_encoders.pkl</td>
      <td>Pickled label encoders used in credit card data preprocessing</td>
    </tr>
    <tr>
      <td>client_label_encoders.pkl</td>
      <td>Pickled label encoders used in client data preprocessing</td>
    </tr>
    <tr>
      <td>disp_label_encoders.pkl</td>
      <td>Pickled label encoders used in disposition data preprocessing</td>
    </tr>
    <tr>
      <td>district_label_encoders.pkl</td>
      <td>Pickled label encoders used in district data preprocessing</td>
    </tr>
    <tr>
      <td>loan_label_encoders.pkl</td>
      <td>Pickled label encoders used in loan data preprocessing</td>
    </tr>
    <tr>
      <td>order_label_encoders.pkl</td>
      <td>Pickled label encoders used in order data preprocessing</td>
    </tr>
    <tr>
      <td>trans_label_encoders.pkl</td>
      <td>Pickled label encoders used in transaction data preprocessing</td>
    </tr>
    <tr>
      <td>workspace/train_1/cluster_ckpt.pkl</td>
      <td>Pickled cluster checkpoint used in ClavaDDPM relation-aware clustering</td>
    </tr>
    <tr>
      <td>workspace/train_1/models/*</td>
      <td>The trained model checkpoints for all the tables used in training</td>
    </tr>
    <!-- Synthetic data -->
    <tr>
      <td>workspace/train_1/account/_final/account_synthetic.csv</td>
      <td>Account data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/card/_final/card_synthetic.csv</td>
      <td>Account data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/client/_final/client_synthetic.csv</td>
      <td>Client data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/disp/_final/disp_synthetic.csv</td>
      <td>Disposition data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/district/_final/district_synthetic.csv</td>
      <td>District data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/loan/_final/loan_synthetic.csv</td>
      <td>Loan data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/order/_final/order_synthetic.csv</td>
      <td>Order data domain file indicating the domain information for each column</td>
    </tr>
    <tr>
      <td>workspace/train_1/trans/_final/trans_synthetic.csv</td>
      <td>Transaction data domain file indicating the domain information for each column</td>
    </tr>
  </tbody>
</table>


## Task

Your task as a competitor is to produce, for each model in `dev` and `final`, a CSV file listing your confidence scores (values between 0 and 1) for the membership of the challenge examples. You must save these scores in a `prediction.csv` file and place it in the same folder as the corresponding model. A submission to the challenge is an an archive containing just these `prediction.csv` files.

**You must submit predictions for both `dev` and `final` when you submit to CodaBench.**

In the following, we will show you how to compute predictions from a basic membership inference attack and package them as a submission archive. To start, let's create a baseline attack model `clavaddpm_attack_model` based on it's respective shadow models: 

In [8]:
def get_attack_model(base_train_path: Path) -> Callable[[Any], float]:
    return lambda x : random.uniform(0, 1)

base_clavaddpm_train_path = os.path.join(CLAVADDPM_DATA_DIR, "train")
clavaddpm_attack_model = get_attack_model(base_clavaddpm_train_path)

Using the attack model, we can obtain predictions for each point in the challenge point set for train, dev and final:

In [9]:
phases = ["train", "dev", "final"]

for base_dir, attack_model in zip([CLAVADDPM_DATA_DIR], [clavaddpm_attack_model]):
    for phase in phases:
        root = os.path.join(base_dir, phase)
        for model_folder in sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])):
            path = os.path.join(root, model_folder)
    
            challenge_points = get_challenge_points(path)
    
            predictions = torch.Tensor([attack_model(cp) for cp in challenge_points])
           
            assert torch.all((0 <= predictions) & (predictions <= 1))
            with open(os.path.join(path, "prediction.csv"), mode="w", newline="") as file:
                writer = csv.writer(file)
    
                # Write each value in a separate row
                for value in list(predictions.numpy().squeeze()):
                    writer.writerow([value])

## Scoring

Let's see how the attack does on `train`, for which we have the ground truth.
When preparing a submission, you can use part of `train` to develop an attack and a held-out part to evaluate your attack.

In [10]:
tpr_at_fpr_list = []
for base_dir in [CLAVADDPM_DATA_DIR]:
    predictions = []
    solutions  = []
    root = os.path.join(base_dir, "train")
    for model_folder in sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])):
        path = os.path.join(root, model_folder)
        predictions.append(np.loadtxt(os.path.join(path, "prediction.csv")))
        solutions.append(np.loadtxt(os.path.join(path, "challenge_label.csv"), skiprows=1))
    
    predictions = np.concatenate(predictions)
    solutions = np.concatenate(solutions)
    
    tpr_at_fpr = get_tpr_at_fpr(solutions, predictions)
    tpr_at_fpr_list.append(tpr_at_fpr)
    
    print(f"{base_dir.split("_")[0]} Train Attack TPR at FPR==10%: {tpr_at_fpr}")

final_tpr_at_fpr = max(tpr_at_fpr_list)
print(f"Final Train Attack TPR at FPR==10%: {final_tpr_at_fpr}")

clavaddpm Train Attack TPR at FPR==10%: 0.097
Final Train Attack TPR at FPR==10%: 0.097


## Packaging the submission

Now we can store the predictions into a zip file, which you can submit to CodaBench. Importantly, we create a single zip file for dev and final. The structure of the submission is as follows:
```
└── root_folder
    ├── clavaddpm_white_box
       ├── dev
       │   └── clavaddpm_#
       │       └── prediction.csv
       └── final
           └── clavaddpm_#
                └── prediction.csv
```

**Note:** The `root_folder` can have any name but it is important all of the subdirectories follow the above structure and naming conventions. 

In [12]:
with zipfile.ZipFile(f"white_box_multi_table_submission.zip", 'w') as zipf:
    for phase in ["dev", "final"]:
        for base_dir in [CLAVADDPM_DATA_DIR]:
            root = os.path.join(base_dir, phase)
            for model_folder in sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])):
                path = os.path.join(root, model_folder)
                if not os.path.isdir(path): 
                    continue

                file = os.path.join(path, "prediction.csv")
                if os.path.exists(file):
                    arcname = os.path.join(
                        f"white_box_multi_table_submission/{CLAVADDPM_DATA_DIR}",
                        phase,  
                        model_folder,  
                        os.path.basename(file)
                    )
                    zipf.write(file, arcname=arcname)
                else:
                    raise FileNotFoundError(f"`prediction.csv` not found in {path}.")

The generated white_box_multi_table_submission.zip can be directly submitted to the dev phase in the CodaBench UI. Although this submission contains your predictions for both the dev and final set, you will only receive feedback on your predictions for the dev phase. The predictions for the final phase will be evaluated once the competiton ends using the most recent submission to the dev phase.