# Membership Inference over Diffusion-models-based Synthetic Tabular Data (MIDST) Challenge @ SaTML 2025.

## White Box Single Table Competition
Welcome to the MIDST challenge!

The MIDST challenge is a series of competitions aiming to quantitatively evaluate the privacy of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs).

This particular competition focuses on White-box MIA on a single table transaction dataset.

This notebook will walk you through the process of creating and packaging a submission to the white box single table challenge. 

## Package Imports and Evironment Setup

To start, lets import the required packages and define global variables:

In [1]:
import csv
import os
import random

from pathlib import Path
from functools import partial
from typing import Callable, Any

import numpy as np
import torch

from tqdm.notebook import tqdm
from midst.data import get_challenge_points
from midst.metrics import get_tpr_at_fpr

In [2]:
BASE_DATA_DIR = "whitebox_single_table_tabddpm"

## Data

Next, lets download and extract the data for the competition

In [3]:
!gdown 1BnwygJ8SVKGUCJ6OsexIVWDOdMCIy-DW
!unzip -qq -o whitebox_single_table_tabddpm.zip

Downloading...
From (original): https://drive.google.com/uc?id=1BnwygJ8SVKGUCJ6OsexIVWDOdMCIy-DW
From (redirected): https://drive.google.com/uc?id=1BnwygJ8SVKGUCJ6OsexIVWDOdMCIy-DW&confirm=t&uuid=8ca21d26-54af-49dc-be0a-6f7d8687c088
To: /Users/johnjewell/Desktop/github/MIDST/starter_kits/whitebox_single_table_tabddpm.zip
100%|████████████████████████████████████████| 958M/958M [00:44<00:00, 21.5MB/s]


### Contents

The archive was extracted under the `whitebox_single_table_tabddpm` folder contains 3 subdirectories:

- `train`: Models with metadata allowing to reconstruct their full training datasets. Use these to develop your attacks without having to train your own models.
- `dev`: Models with metadata allowing to reconstruct just the set of challenge examples. Membership predictions for these challenges will be used to evaluate submissions during the competition and update the live scoreboard in CodaLab.
- `eval`: Models with metadata allowing to reconstruct just the set of challenge examples. Membership predictions for these challenges will be used to evaluate submissions when the competition closes and to determine the final ranking.

## Task

Your task as a competitor is to produce, for each model in `dev` and `final`, a CSV file listing your confidence scores (values between 0 and 1) for the membership of the challenge examples. You must save these scores in a `prediction.csv` file and place it in the same folder as the corresponding model. A submission to the challenge is an an archive containing just these `prediction.csv` files.

**You must submit predictions for both `dev` and `final` when you submit to CodaLab.**

In the following, we will show you how to compute predictions from a basic membership inference attack and package them as a submission archive. To start, let's a create a baseline attack model using the provided training data:

In [4]:
def get_attack_model(base_train_path: Path) -> Callable[[Any], float]:
    return lambda x : random.uniform(0, 1)

base_train_path = os.path.join(BASE_DATA_DIR, "train")
attack_model = get_attack_model(base_train_path)

Using the attack model, we can obtain predictions for each point in the challenge point set for train, dev and eval:

In [5]:
phases = ["train", "dev", "eval"]

for phase in tqdm(phases, desc="phase"):
    root = os.path.join(BASE_DATA_DIR, phase)
    for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
        path = os.path.join(root, model_folder)

        challenge_points = get_challenge_points(path)

        predictions = torch.Tensor([attack_model(cp) for cp in challenge_points])
       
        assert torch.all((0 <= predictions) & (predictions <= 1))
        with open(os.path.join(path, "prediction.csv"), mode="w", newline="") as file:
            writer = csv.writer(file)

            # Write each value in a separate row
            for value in list(predictions.numpy().squeeze()):
                writer.writerow([value])

phase:   0%|          | 0/3 [00:00<?, ?it/s]

model:   0%|          | 0/30 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

## Scoring

Let's see how the attack does on `train`, for which we have the ground truth.
When preparing a submission, you can use part of `train` to develop an attack and a held-out part to evaluate your attack.

In [6]:
predictions = []
solutions  = []

root = os.path.join(BASE_DATA_DIR, "train")
for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
    path = os.path.join(root, model_folder)
    predictions.append(np.loadtxt(os.path.join(path, "prediction.csv")))
    solutions.append(np.loadtxt(os.path.join(path, "challenge_label.csv"), skiprows=1))

predictions = np.concatenate(predictions)
solutions = np.concatenate(solutions)

tpr_at_fpr = get_tpr_at_fpr(solutions, predictions)

print(f"Train TPR at FPR==10%: {tpr_at_fpr}")

model:   0%|          | 0/30 [00:00<?, ?it/s]

Train TPR at FPR==10%: 0.09266666666666666


## Packaging the submission

Now we can store the predictions into a zip file, which you can submit to CodaBench.

In [7]:
import zipfile

phases = ['dev', 'eval']

with zipfile.ZipFile("predictions_whitebox_single_table_tabddpm.zip", 'w') as zipf:
        for phase in tqdm(phases, desc="phase"):
            root = os.path.join(BASE_DATA_DIR, phase)
            for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
                path = os.path.join(root, model_folder)
                file = os.path.join(path, "prediction.csv")
                if os.path.exists(file):
                    zipf.write(file)
                else:
                    raise FileNotFoundError(f"`prediction.csv` not found in {path}. You need to provide predictions for all challenges")

phase:   0%|          | 0/2 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]