# Membership Inference over Diffusion-models-based Synthetic Tabular Data (MIDST) Challenge @ SaTML 2025.

## Black Box Single Table Competition
Welcome to the MIDST challenge!

The MIDST challenge is a series of competitions aiming to quantitatively evaluate the privacy of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs).

This competition focuses on Black Box MIA on tabular diffusion models trained on a single table transaction dataset. In particular, MIA will be performed over two state-of-the-art methods [TabSyn](https://arxiv.org/pdf/2310.09656) and [TabDDPM](https://arxiv.org/pdf/2209.15421). A collection of TabSyn and TabDDPM models will be trained on random subsets of the transaction dataset. The goal is to create an approach (MIA) that can distinguish between samples used to train a model (train data) and other data randomly sampled from the transaction dataset (holdout data) given only output synthetic data from the model. The `eval` set includes 10 models, each with its own set of challenge points (ie train and holdout data), to evaluate solutions on. To facilitate designing an attack, 30 `train` models are provided with comprehensive information about the model, training data and output synthetic data. Additionally, 10 `dev` models are provided to assist in evaluating the effectiveness of attacks prior to making a final submission to the `eval` set.

This notebook will walk you through the process of creating and packaging a submission to the black box single table challenge.

## Package Imports and Evironment Setup

To start, lets import the required packages and define global variables:

In [1]:
import csv
import os
import random
import zipfile

from pathlib import Path
from functools import partial
from typing import Callable, Any

import numpy as np
import torch

from tqdm.notebook import tqdm
from midst.data import get_challenge_points
from midst.metrics import get_tpr_at_fpr

In [2]:
TABDDPM_DATA_DIR = "tabddpm_black_box"
TABSYN_DATA_DIR = "tabsyn_black_box"

## Data

Next, lets download and extract the data for the competition:

In [3]:
# Download and unzip tabddpm data
!gdown 1lritJPfRYIMo8-rZgV48gpzgtjZ0yZpm
!unzip -qq -o tabddpm_black_box.zip

Downloading...
From (original): https://drive.google.com/uc?id=1lritJPfRYIMo8-rZgV48gpzgtjZ0yZpm
From (redirected): https://drive.google.com/uc?id=1lritJPfRYIMo8-rZgV48gpzgtjZ0yZpm&confirm=t&uuid=8b4bde80-9529-4280-902a-9c27eb7d2903
To: /Users/johnjewell/Desktop/github/MIDST/starter_kits/tabddpm_black_box.zip
100%|████████████████████████████████████████| 555M/555M [00:14<00:00, 39.2MB/s]


In [4]:
# Download and unzip tabsyn data
!gdown 1zJDAmFYfsHMP7PBhak11yc1PK0vAZGR8
!unzip -qq -o tabsyn_black_box.zip

Downloading...
From (original): https://drive.google.com/uc?id=1zJDAmFYfsHMP7PBhak11yc1PK0vAZGR8
From (redirected): https://drive.google.com/uc?id=1zJDAmFYfsHMP7PBhak11yc1PK0vAZGR8&confirm=t&uuid=d4aa0948-eb93-4d22-bc0b-be6bb89a3cf0
To: /Users/johnjewell/Desktop/github/MIDST/starter_kits/tabsyn_black_box.zip
100%|██████████████████████████████████████| 1.28G/1.28G [00:37<00:00, 34.5MB/s]


### Contents
The archives extracted under the `black_box_tabddpm` and `black_box_tabsyn` each contain 3 subdirectories:

- `train`: Comprehensive information (ie model weights+architecture, training data, output synthetic data etc.) about the set of shadow models. Use these to develop your attacks without having to train your own models.
- `dev`: Set of challenge points. Membership predictions for these challenges will be used to evaluate submissions during the competition and update the live scoreboard in CodaBench.
- `final`: Set of challenge points. Membership predictions for these challenges will be used to evaluate submissions when the competition closes and to determine the final ranking.

The contents of the `train` subdirectory of `black_box_tabddpm` and `black_box_tabsyn` slightly differ - this stems from the fact that each approach has its own set of artifacts from training. Below we outline the contents of the `train` for both TabSyn and TabDDPM along with the `dev` and `eval` which include the same file types for each method.

<table> <tr> <th>Model Eval</th> <th>File Name</th> <th>Description</th> </tr> <!-- TabDDPM - Train --> <tr> <td rowspan="8"><strong>TabDDPM - Train</strong></td> <td>train_with_id.csv</td> <td>Samples used to train the model</td> </tr> <tr> <td>trans_domain.json</td> <td>Data domain file</td> </tr> <tr> <td>challenge_with_id.csv</td> <td>Challenge points sampled from train data and holdout data</td> </tr> <tr> <td>challenge_label.csv</td> <td>The labels for the set of challenge points</td> </tr> <tr> <td>trans_label_encoders.pkl</td> <td>Pickled label encoder</td> </tr> <tr> <td>cluster_ckpt.pkl</td> <td>Pickled cluster model</td> </tr> <tr> <td>None_trans_ckpt.pkl</td> <td>Pickled checkpoint of trained model</td> </tr> <tr> <td>trans_synthetic.csv</td> <td>Synthetic data output from the trained model</td> </tr> <!-- TabSyn - Train --> <tr> <td rowspan="6"><strong>TabSyn - Train</strong></td> <td>train_with_id.csv</td> <td>Samples used to train the model</td> </tr> <tr> <td>challenge_with_id.csv</td> <td>Challenge points sampled from train data and holdout data</td> </tr> <tr> <td>challenge_label.csv</td> <td>The labels for the set of challenge points</td> </tr> <tr> <td>model.pt</td> <td>Pickled checkpoint of trained model</td> </tr> <tr> <td>vae/</td> <td>Model artifacts for trained VAE model</td> </tr> <tr> <td>trans_synthetic.csv</td> <td>Synthetic data</td> </tr> <!-- TabDDPM/TabSyn - Dev --> <tr> <td rowspan="2"><strong>TabDDPM/TabSyn - Dev</strong></td> <td>challenge_with_id.csv</td> <td>Challenge points sampled from train data and holdout data</td> </tr> <tr> <td>trans_synthetic.csv</td> <td>Synthetic data output from the trained model</td> </tr> <!-- TabDDPM/TabSyn - Eval --> <tr> <td rowspan="2"><strong>TabDDPM/TabSyn - Eval</strong></td> <td>challenge_with_id.csv</td> <td>Challenge points sampled from train data and holdout data</td> </tr> <tr> <td>trans_synthetic.csv</td> <td>Synthetic data output from the trained model</td> </tr> </table>




## Task

Your task as a competitor is to produce, for each model in `dev` and `final` in `tabddpm_black_box` and `tabsyn_black_box`, a CSV file listing your confidence scores (values between 0 and 1) for the membership of the challenge examples. You must save these scores in a `prediction.csv` file and place it in the same folder as the corresponding model. A submission to the challenge is an an archive containing just these `prediction.csv` files.

**You must submit predictions for both `dev` and `final` when you submit to CodaBench.**

In the following, we will show you how to compute predictions from a basic membership inference attack and package them as a submission archive. To start, let's a create baseline attack models `tabddpm_attack_model` and `tabsyn_attack_model` based on their respective shadow models: 

In [5]:
def get_attack_model(base_train_path: Path) -> Callable[[Any], float]:
    return lambda x : random.uniform(0, 1)

base_tabddpm_train_path = os.path.join(TABDDPM_DATA_DIR, "train")
base_tabsyn_train_path = os.path.join(TABSYN_DATA_DIR, "train")
tabddpm_attack_model = get_attack_model(base_tabddpm_train_path)
tabsyn_attack_model = get_attack_model(base_tabsyn_train_path)

Using the attack model, we can obtain predictions for each point in the challenge point set for train, dev and eval:

In [6]:
phases = ["train", "dev", "eval"]

for base_dir, attack_model in zip([TABDDPM_DATA_DIR, TABSYN_DATA_DIR], [tabddpm_attack_model, tabsyn_attack_model]):
    for phase in tqdm(phases, desc="phase"):
        root = os.path.join(base_dir, phase)
        for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
            path = os.path.join(root, model_folder)
    
            challenge_points = get_challenge_points(path)
    
            predictions = torch.Tensor([attack_model(cp) for cp in challenge_points])
           
            assert torch.all((0 <= predictions) & (predictions <= 1))
            with open(os.path.join(path, "prediction.csv"), mode="w", newline="") as file:
                writer = csv.writer(file)
    
                # Write each value in a separate row
                for value in list(predictions.numpy().squeeze()):
                    writer.writerow([value])

phase:   0%|          | 0/3 [00:00<?, ?it/s]

model:   0%|          | 0/30 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

phase:   0%|          | 0/3 [00:00<?, ?it/s]

model:   0%|          | 0/30 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

## Scoring

Let's see how the attack does on `train`, for which we have the ground truth.
When preparing a submission, you can use part of `train` to develop an attack and a held-out part to evaluate your attack.

In [7]:
tpr_at_fpr_list = []
for base_dir in [TABDDPM_DATA_DIR, TABSYN_DATA_DIR]:
    predictions = []
    solutions  = []
    root = os.path.join(base_dir, "train")
    for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
        path = os.path.join(root, model_folder)
        predictions.append(np.loadtxt(os.path.join(path, "prediction.csv")))
        solutions.append(np.loadtxt(os.path.join(path, "challenge_label.csv"), skiprows=1))
    
    predictions = np.concatenate(predictions)
    solutions = np.concatenate(solutions)
    
    tpr_at_fpr = get_tpr_at_fpr(solutions, predictions)
    tpr_at_fpr_list.append(tpr_at_fpr)
    
    print(f"{base_dir.split("_")[0]} Train Attack TPR at FPR==10%: {tpr_at_fpr}")

final_tpr_at_fpr = max(tpr_at_fpr_list)
print(f"Final Train Attack TPR at FPR==10%: {final_tpr_at_fpr}")

model:   0%|          | 0/30 [00:00<?, ?it/s]

tabddpm Train Attack TPR at FPR==10%: 0.094


model:   0%|          | 0/30 [00:00<?, ?it/s]

tabsyn Train Attack TPR at FPR==10%: 0.09333333333333334
Final Train Attack TPR at FPR==10%: 0.094


## Packaging the submission

Now we can store the predictions into a zip file, which you can submit to CodaBench. We create seperate zip files for dev and eval.

In [8]:
for phase in ["dev", "eval"]:
    with zipfile.ZipFile(f"black_box_{phase}_submission.zip", 'w') as zipf:
        for base_dir in [TABDDPM_DATA_DIR, TABSYN_DATA_DIR]:
            root = os.path.join(base_dir, phase)
            for model_folder in sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])):
                path = os.path.join(root, model_folder)
                if not os.path.isdir(path): continue

                file = os.path.join(path, "prediction.csv")
                if os.path.exists(file):
                    # Use `arcname` to remove the base directory and phase directory from the zip path
                    arcname = os.path.relpath(file, os.path.dirname(base_dir))
                    zipf.write(file, arcname=arcname)
                else:
                    raise FileNotFoundError(f"`prediction.csv` not found in {path}.")

The generated dev.zip and eval.zip can be used to directly submit to the respective phases in the CodaBench UI.