Skip to content

Conversation

emersodb
Copy link
Collaborator

PR Type

Feature

Short Description

Clickup Ticket(s): Link(s) if applicable.

This PR integrates the last two metrics associated with the integration of SynthEval metrics into the library. These are the Nearest Neighbor Distance Ratio and Epsilon Identifiability privacy metrics.

  • NNDR considers the average ratio between the closest and next closet point in a real data set for each point in the synthetic dataset. (closer to 1 is better).
  • EIR computes the percentage of points in a real dataset that are closer to a synthetic data point than another real data point in the same dataset. (closer to 0 is better).

NOTE: The SynthEval implementation is flawed. Rather than computing the NNDR for synthetic data points to real data points, it computes the NNDR of real data points to synthetic datapoints. This is not what you want to do in order to measure privacy (see https://arxiv.org/pdf/2501.03941). The second computation is correct if you want to do membership inference (i.e. as Sara and Fatemeh are working on).

Tests Added

Added a fair number of tests to verify the computations.

@emersodb emersodb changed the base branch from main to dbe/add_f1_dff_hitting_rate September 23, 2025 14:28
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


@overload
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to it's own module, as it is useful for computing NNDR as well.

``target_data.``
"""
if batch_size is None:
# If batch size isn't specified, do it all at once.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small bug (didn't make code wrong, just bad batch size estimate).

numerical_column_idx = numerical_column_idx + target_col_idx
else:
categorical_column_idx = categorical_column_idx + target_col_idx
if "target_col_idx" in meta_info:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loosening this preprocessing to admit settings where no target column exists.

@emersodb emersodb marked this pull request as ready for review September 23, 2025 14:56
def test_get_categorical_columns() -> None:
# Low threshold
categorical_columns = get_categorical_columns(TEST_DATAFRAME, 2)
# Note that this does not include the date time column, as it isn't a categorical, as the detection algorithm
Copy link
Collaborator Author

@emersodb emersodb Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_categorical columns functionality here is a convenience function and it's unclear how a user would want to treat data time objects. So we're sort of punting here. Ideally a user would have done "something" to make their datatime objects numerical or categorical (where date time objects can general exists on a spectrum of these).

Copy link
Collaborator

@bzamanlooy bzamanlooy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a few comments that might help improve the readability of the documentation. The code itself looks good to me :)

Copy link
Collaborator

@lotif lotif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with small comments.

some way. This can be done via the ``preprocess`` function beforehand or it can be done within compute if
``do_preprocess`` is True and ``meta_info`` has been provided.
some way. This can be done via the ``preprocess`` function in ``distance_preprocess.py`` beforehand or it can
be done within compute if ``do_preprocess`` is True and ``meta_info`` has been provided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute is a function, right? It should be wrapped in "``" if so.



@overload
def preprocess(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can those functions have better names? preprocess is super generic, maybe something like preprocess_for_distance_calulation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair. Will change

holdout set represents real data that was NOT.
NOTE: Columns are not uniformly weighted. They are weighted by their inverse column entropy to provide
greater attention to rare data points. This is formally defined in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's missing something here... maybe a : or This is formally defined by the paper below:.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

filtered_real_data = real_data[self.numerical_columns]
filtered_synthetic_data = synthetic_data[self.numerical_columns]
filtered_holdout_data = holdout_data[self.numerical_columns] if holdout_data is not None else None
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit dangerous because if we ever add another EpsilonIdentifiabilityNorm element it will silently fall in here if we forget to modify this. Changing this to elif self.norm == EpsilonIdentifiabilityNorm.GOWER will be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changing now.

from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances


DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have this as a module level variable? This variable is set every time this is imported by other modules and it's also set by other modules under the same name. Can we have this variable be defined in an utils module, or maybe be retruned by function, and then be reused here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's worthwhile. Will add a function I think.

norm: Determines what norm the distances are computed in. Defaults to NormType.L2.
batch_size: Batch size used to compute the NNDR iteratively. Just needed to manage memory. Defaults to
1000.
device: What device the tensors should be sent to in order to perform the calculations. Defaults to DEVICE.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some more context here: Defaults to "cuda" if CUDA is available, "cpu" othwerwise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

If None, then no preprocessing is expected to be done. Defaults to None.
do_preprocess: Whether or not to preprocess the dataframes before performing the NNDR calculations.
Preprocessing is performed with the ``preprocess`` function of ``distance_preprocess.py``. Defaults to
False.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention here that meta_info should be provided if this is set to True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Base automatically changed from dbe/add_f1_dff_hitting_rate to main September 30, 2025 14:31
@emersodb emersodb merged commit d231a4a into main Sep 30, 2025
5 checks passed
@emersodb emersodb deleted the dbe/add_nndr_and_eir branch September 30, 2025 14:39
bzamanlooy pushed a commit that referenced this pull request Oct 10, 2025
04d817f Trainer: Changing all literals to enums (#48)
33f9306 Adding ignore for pip 25.2 vulnerability, removing stale ones (#50)
c8d30ca Remove mkdocs build dir, add to gitignore (#49)
c01121c End-to-end Evaluation Script Example (#45)
d231a4a Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics (#42)
53e423b Adding Mean F1 Score Difference and Hitting Rate Metrics (#39)
91150fd Adding in Hellinger and pMSE metrics (#38)
746644e Tightening Ruff Configuration (#46)
580d55f Adding data_split_ratios to both the diffusion config and the classifier config (#47)
72863be Refactoring core.logger into common.logger and removing it (#41)
bef53bf Train code split, Part 4: moving some of the model.py code into dataset.py (#40)
80b0154 Upgrading pip to latest version to solve security issue (#44)
83beba6 New mypy flow and fixes to typing issues that were discovered (#43)
7e77f37 Train code split, Part 3: moving some of the model.py code into sampler.py (#9)

git-subtree-dir: deps/midst-toolkit
git-subtree-split: 04d817f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants