Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42

emersodb · 2025-09-23T14:28:25Z

PR Type

Feature

Short Description

Clickup Ticket(s): Link(s) if applicable.

This PR integrates the last two metrics associated with the integration of SynthEval metrics into the library. These are the Nearest Neighbor Distance Ratio and Epsilon Identifiability privacy metrics.

NNDR considers the average ratio between the closest and next closet point in a real data set for each point in the synthetic dataset. (closer to 1 is better).
EIR computes the percentage of points in a real dataset that are closer to a synthetic data point than another real data point in the same dataset. (closer to 0 is better).

NOTE: The SynthEval implementation is flawed. Rather than computing the NNDR for synthetic data points to real data points, it computes the NNDR of real data points to synthetic datapoints. This is not what you want to do in order to measure privacy (see https://arxiv.org/pdf/2501.03941). The second computation is correct if you want to do membership inference (i.e. as Sara and Fatemeh are working on).

Tests Added

Added a fair number of tests to verify the computations.

…ixing a preprocessing bug and changing the name of the syntheval metric base class.

…ering.py (#8) Moving the clustering code into its own clustering.py module and adding docstrings. Also, moving some common parameter type definitions to a params.py module.

emersodb · 2025-09-23T14:30:01Z

src/midst_toolkit/evaluation/privacy/distance_closest_record.py

 DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


-@overload


Moving this to it's own module, as it is useful for computing NNDR as well.

emersodb · 2025-09-23T14:34:54Z

src/midst_toolkit/evaluation/privacy/distance_utils.py

        ``target_data.``
    """
    if batch_size is None:
        # If batch size isn't specified, do it all at once.


Small bug (didn't make code wrong, just bad batch size estimate).

emersodb · 2025-09-23T14:48:11Z

src/midst_toolkit/evaluation/utils.py

-        numerical_column_idx = numerical_column_idx + target_col_idx
-    else:
-        categorical_column_idx = categorical_column_idx + target_col_idx
+    if "target_col_idx" in meta_info:


Loosening this preprocessing to admit settings where no target column exists.

…ection

emersodb · 2025-09-24T16:19:31Z

tests/unit/data_processing/test_utils.py

 def test_get_categorical_columns() -> None:
    # Low threshold
    categorical_columns = get_categorical_columns(TEST_DATAFRAME, 2)
+    # Note that this does not include the date time column, as it isn't a categorical, as the detection algorithm


The get_categorical columns functionality here is a convenience function and it's unclear how a user would want to treat data time objects. So we're sort of punting here. Ideally a user would have done "something" to make their datatime objects numerical or categorical (where date time objects can general exists on a spectrum of these).

bzamanlooy

I just have a few comments that might help improve the readability of the documentation. The code itself looks good to me :)

src/midst_toolkit/evaluation/privacy/distance_utils.py

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

lotif

Approved with small comments.

lotif · 2025-09-29T19:26:16Z

src/midst_toolkit/evaluation/privacy/distance_closest_record.py

-        some way. This can be done via the ``preprocess`` function beforehand or it can be done within compute if
-        ``do_preprocess`` is True and ``meta_info`` has been provided.
+        some way. This can be done via the ``preprocess`` function in ``distance_preprocess.py`` beforehand or it can
+        be done within compute if ``do_preprocess`` is True and ``meta_info`` has been provided.


compute is a function, right? It should be wrapped in "``" if so.

lotif · 2025-09-29T19:28:39Z

src/midst_toolkit/evaluation/privacy/distance_preprocess.py

+
+
+@overload
+def preprocess(


can those functions have better names? preprocess is super generic, maybe something like preprocess_for_distance_calulation?

Yeah that's fair. Will change

lotif · 2025-09-29T19:42:21Z

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

+        holdout set represents real data that was NOT.
+
+        NOTE: Columns are not uniformly weighted. They are weighted by their inverse column entropy to provide
+        greater attention to rare data points. This is formally defined in


Looks like it's missing something here... maybe a : or This is formally defined by the paper below:.

lotif · 2025-09-29T19:45:32Z

src/midst_toolkit/evaluation/privacy/epsilon_identifiability_risk.py

+            filtered_real_data = real_data[self.numerical_columns]
+            filtered_synthetic_data = synthetic_data[self.numerical_columns]
+            filtered_holdout_data = holdout_data[self.numerical_columns] if holdout_data is not None else None
+        else:


This is a bit dangerous because if we ever add another EpsilonIdentifiabilityNorm element it will silently fall in here if we forget to modify this. Changing this to elif self.norm == EpsilonIdentifiabilityNorm.GOWER will be better.

Good point. Changing now.

lotif · 2025-09-29T19:54:52Z

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

+from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances
+
+
+DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


Can we not have this as a module level variable? This variable is set every time this is imported by other modules and it's also set by other modules under the same name. Can we have this variable be defined in an utils module, or maybe be retruned by function, and then be reused here?

Yeah I think that's worthwhile. Will add a function I think.

lotif · 2025-09-29T19:57:59Z

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

+            norm: Determines what norm the distances are computed in. Defaults to NormType.L2.
+            batch_size: Batch size used to compute the NNDR iteratively. Just needed to manage memory. Defaults to
+                1000.
+            device: What device the tensors should be sent to in order to perform the calculations. Defaults to DEVICE.


I'd add some more context here: Defaults to "cuda" if CUDA is available, "cpu" othwerwise.

lotif · 2025-09-29T19:59:20Z

src/midst_toolkit/evaluation/privacy/nearest_neighbor_distance_ratio.py

+                If None, then no preprocessing is expected to be done. Defaults to None.
+            do_preprocess: Whether or not to preprocess the dataframes before performing the NNDR calculations.
+                Preprocessing is performed with the ``preprocess`` function of ``distance_preprocess.py``. Defaults to
+                False.


We should mention here that meta_info should be provided if this is set to True.

04d817f Trainer: Changing all literals to enums (#48) 33f9306 Adding ignore for pip 25.2 vulnerability, removing stale ones (#50) c8d30ca Remove mkdocs build dir, add to gitignore (#49) c01121c End-to-end Evaluation Script Example (#45) d231a4a Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics (#42) 53e423b Adding Mean F1 Score Difference and Hitting Rate Metrics (#39) 91150fd Adding in Hellinger and pMSE metrics (#38) 746644e Tightening Ruff Configuration (#46) 580d55f Adding data_split_ratios to both the diffusion config and the classifier config (#47) 72863be Refactoring core.logger into common.logger and removing it (#41) bef53bf Train code split, Part 4: moving some of the model.py code into dataset.py (#40) 80b0154 Upgrading pip to latest version to solve security issue (#44) 83beba6 New mypy flow and fixes to typing issues that were discovered (#43) 7e77f37 Train code split, Part 3: moving some of the model.py code into sampler.py (#9) git-subtree-dir: deps/midst-toolkit git-subtree-split: 04d817f

emersodb and others added 11 commits September 17, 2025 09:27

First checkin of hellinger and pmse implementations

adae16e

Fix typing issue

a790991

Adding in Hitting Rate and Mean F1 Difference implementations. Also f…

43834a4

…ixing a preprocessing bug and changing the name of the syntheval metric base class.

Removing hard coding

bec6418

Some CR fixes from Marcelo's review

fb3c48a

Merge branch 'main' into dbe/add_hellinger_pmse

1705d50

Train code split, Part 2: moving some of the model.py code into clust…

c286906

…ering.py (#8) Moving the clustering code into its own clustering.py module and adding docstrings. Also, moving some common parameter type definitions to a params.py module.

Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate

072a09a

NNDR module and tests

81a40ba

Fixing small bug

c5a0682

Adding in the epsilon identifiability risk metric

704e48f

emersodb requested review from amrit110, bzamanlooy, fatemetkl, lotif, masi-sh and sarakodeiri September 23, 2025 14:28

emersodb changed the base branch from main to dbe/add_f1_dff_hitting_rate September 23, 2025 14:28

emersodb commented Sep 23, 2025

View reviewed changes

Small code fixes and documentation improvements

884c582

emersodb marked this pull request as ready for review September 23, 2025 14:56

emersodb added 7 commits September 24, 2025 11:09

New mypy flow and fixes to typing issues that were discovered

59ea7f4

Merge branch 'dbe/fixing_mypy' into dbe/add_hellinger_pmse

660729f

Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate

4023d21

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

b985246

Some small updates

1b51d23

Merge branch 'main' into dbe/add_hellinger_pmse

5bd360c

Adding in a bit more revealing testing for the categorical column det…

810f3d6

…ection

emersodb commented Sep 24, 2025

View reviewed changes

emersodb added 11 commits September 25, 2025 15:09

Merge branch 'main' into dbe/add_hellinger_pmse

8328547

Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate

2ccdfa6

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

0147eb1

Merge branch 'main' into dbe/add_hellinger_pmse

2de4692

Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate

c78973f

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

1aa4e5a

Merge branch 'main' into dbe/add_hellinger_pmse

8918c35

Merge branch 'main' into dbe/add_hellinger_pmse

b2c5de5

Addressing some PR comments

3defbf2

Merge branch 'dbe/add_hellinger_pmse' into dbe/add_f1_dff_hitting_rate

217c7a7

Merge branch 'dbe/add_f1_dff_hitting_rate' into dbe/add_nndr_and_eir

789dd7b

bzamanlooy reviewed Sep 29, 2025

View reviewed changes

Addressing some PR comments.

500ce2b

lotif approved these changes Sep 29, 2025

View reviewed changes

PR Comment changes

231265c

Base automatically changed from dbe/add_f1_dff_hitting_rate to main September 30, 2025 14:31

emersodb added 2 commits September 30, 2025 10:34

Merge branch 'main' into dbe/add_nndr_and_eir

e41bcc7

Dropping unused variable

6b9d639

emersodb merged commit d231a4a into main Sep 30, 2025
5 checks passed

emersodb deleted the dbe/add_nndr_and_eir branch September 30, 2025 14:39

		DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")


		@overload

		from midst_toolkit.evaluation.privacy.distance_utils import NormType, compute_top_k_distances


		DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42

Add Nearest Neighbor Distance Ratio and Epsilon Identifiability Privacy Metrics #42

Uh oh!

Conversation

emersodb commented Sep 23, 2025

PR Type

Short Description

Tests Added

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emersodb Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzamanlooy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lotif left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

emersodb Sep 24, 2025 •

edited

Loading