# Membership Inference over Diffusion-models-based Synthetic Tabular Data (MIDST) Challenge @ SaTML 2025.

## White Box Single Table Competition
Welcome to the MIDST challenge!

The MIDST challenge is a series of competitions aiming to quantitatively evaluate the privacy of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs).

This particular competition focuses on White-box MIA on a single table transaction dataset.

This notebook will walk you through the process of creating and packaging a submission to the white box single table challenge. To start, let's download and extract the competition archive.

In [2]:
!gdown 1BnwygJ8SVKGUCJ6OsexIVWDOdMCIy-DW
!unzip -qq -f whitebox_single_table_tabddpm.zip

Downloading...
From (original): https://drive.google.com/uc?id=1BnwygJ8SVKGUCJ6OsexIVWDOdMCIy-DW
From (redirected): https://drive.google.com/uc?id=1BnwygJ8SVKGUCJ6OsexIVWDOdMCIy-DW&confirm=t&uuid=4248a6af-a5af-4b9d-9ab2-28d3dfb3eea5
To: /Users/johnjewell/Desktop/github/MIDST/starter_kits/whitebox_single_table_tabddpm.zip
100%|████████████████████████████████████████| 958M/958M [00:54<00:00, 17.4MB/s]


## Contents

The archive was extracted under the `whitebox_single_table_tabddpm` folder contains 3 subdirectories:

- `train`: Models with metadata allowing to reconstruct their full training datasets. Use these to develop your attacks without having to train your own models.
- `dev`: Models with metadata allowing to reconstruct just the set of challenge examples. Membership predictions for these challenges will be used to evaluate submissions during the competition and update the live scoreboard in CodaLab.
- `final`: Models with metadata allowing to reconstruct just the set of challenge examples. Membership predictions for these challenges will be used to evaluate submissions when the competition closes and to determine the final ranking.

## Task

Your task as a competitor is to produce, for each model in `dev` and `final`, a CSV file listing your confidence scores (values between 0 and 1) for the membership of the challenge examples. You must save these scores in a `prediction.csv` file and place it in the same folder as the corresponding model. A submission to the challenge is an an archive containing just these `prediction.csv` files.

**You must submit predictions for both `dev` and `final` when you submit to CodaLab.**

In the following, we will show you how to compute predictions from a basic membership inference attack and package them as a submission archive.

In [3]:
import numpy as np
import torch
import os
import csv

from tqdm.notebook import tqdm
from midst.data import get_features_and_labels

def get_predictions(labels: torch.Tensor) -> torch.Tensor:
    """
    Placeholder function to generate predictions.
    """
    return torch.rand(size=labels.size())

In [4]:
BASE_DATA_DIR = "whitebox_single_table_tabddpm"
phases = ["train", "dev", "eval"]

for phase in tqdm(phases, desc="phase"):
    root = os.path.join(BASE_DATA_DIR, phase)
    for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
        path = os.path.join(root, model_folder)

        features, labels = get_features_and_labels(path)

        predictions = get_predictions(labels)
       
        assert torch.all((0 <= predictions) & (predictions <= 1))
        with open(os.path.join(path, "prediction.csv"), mode="w", newline="") as file:
            writer = csv.writer(file)

            # Write each value in a separate row
            for value in list(predictions.numpy().squeeze()):
                writer.writerow([value])

phase:   0%|          | 0/3 [00:00<?, ?it/s]

model:   0%|          | 0/30 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

## Scoring

Let's see how the attack does on `train`, for which we have the ground truth.
When preparing a submission, you can use part of `train` to develop an attack and a held-out part to evaluate your attack.

In [5]:
from midst.metrics import get_tpr_at_fpr

FPR_THRESHOLD = 0.1

all_scores = {}
phases = ['train']

for phase in tqdm(phases, desc="phase"):
    predictions = []
    solutions  = []

    root = os.path.join(BASE_DATA_DIR, phase)
    for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
        path = os.path.join(root, model_folder)
        predictions.append(np.loadtxt(os.path.join(path, "prediction.csv")))
        solutions.append(np.loadtxt(os.path.join(path, "challenge_label.csv"), skiprows=1))

    predictions = np.concatenate(predictions)
    solutions = np.concatenate(solutions)

    tpr_at_fpr = get_tpr_at_fpr(solutions, predictions)
    all_scores[phase] = tpr_at_fpr

phase:   0%|          | 0/1 [00:00<?, ?it/s]

model:   0%|          | 0/30 [00:00<?, ?it/s]

## Packaging the submission

Now we can store the predictions into a zip file, which you can submit to CodaLab.

In [6]:
import zipfile

phases = ['dev', 'eval']

with zipfile.ZipFile("predictions_whitebox_single_table_tabddpm.zip", 'w') as zipf:
        for phase in tqdm(phases, desc="phase"):
            root = os.path.join(BASE_DATA_DIR, phase)
            for model_folder in tqdm(sorted(os.listdir(root), key=lambda d: int(d.split('_')[1])), desc="model"):
                path = os.path.join(root, model_folder)
                file = os.path.join(path, "prediction.csv")
                if os.path.exists(file):
                    zipf.write(file)
                else:
                    raise FileNotFoundError(f"`prediction.csv` not found in {path}. You need to provide predictions for all challenges")

phase:   0%|          | 0/2 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]

model:   0%|          | 0/10 [00:00<?, ?it/s]