Skip to content

Ensemble fix: decouple ID extraction from drop operation#124

Merged
fatemetkl merged 8 commits intomainfrom
ft/ensemble_exp_rmia_abl
Feb 4, 2026
Merged

Ensemble fix: decouple ID extraction from drop operation#124
fatemetkl merged 8 commits intomainfrom
ft/ensemble_exp_rmia_abl

Conversation

@fatemetkl
Copy link
Collaborator

@fatemetkl fatemetkl commented Feb 4, 2026

PR Type

Fix

Short Description

The Berka dataset has two id columns, "account_id" and "trans_id".
transaction IDs should be extracted and passed to blending_attacker.predict. Therefore, we should first extract the main ID column, then remove all ID columns. Previously, the code was performing the extraction and drop operation just for the main ID column.


# Dataset specific information used for processing in this example
data_processing_config:
midst_data_path: /projects/midst-experiments/all_tabddpms/ # Used to collect the data (input)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One step towards config modularization.

@fatemetkl fatemetkl changed the title error fix: seperated extraction and drop id Fix: Decouple ID extraction from drop operation Feb 4, 2026
@fatemetkl fatemetkl changed the title Fix: Decouple ID extraction from drop operation Ensemble fix: Decouple ID extraction from drop operation Feb 4, 2026
@fatemetkl fatemetkl changed the title Ensemble fix: Decouple ID extraction from drop operation Ensemble fix: decouple ID extraction from drop operation Feb 4, 2026
@fatemetkl fatemetkl requested a review from sarakodeiri February 4, 2026 18:07
@fatemetkl fatemetkl marked this pull request as ready for review February 4, 2026 18:15
@fatemetkl fatemetkl requested a review from emersodb February 4, 2026 18:17
@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This pull request consolidates configuration restructuring with API and function signature updates. The midst_data_path configuration is moved from the top-level data_paths section into data_processing_config, along with new dataset metadata including split specifications and file naming configurations. The data collection call in run_attack.py is updated to reference the new config path and passes population_splits and challenge_splits parameters. The test_attack_model.py file refactors the ID extraction function—renamed from extract_and_drop_id_column to extract_the_main_id_column with a changed return type from tuple to Series—and updates all usage sites accordingly. Additionally, the dataset partitioning for shadow model training is adjusted by changing split_folders from ["test"] to ["final"]. The TprAtFpr API parameter names are updated across two files, replacing max_fpr with fpr_threshold and predictions with predicted_membership.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive The description covers the PR type (Fix) and provides a clear short description explaining the rationale (two ID columns in Berka dataset, need to extract main ID before removing all IDs), but the template requires a Tests Added section which is missing. Add a Tests Added section describing what tests were added or modified to validate the ID extraction and drop decoupling changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: decoupling ID extraction from the drop operation, which aligns with the primary objective of handling multiple ID columns separately.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ft/ensemble_exp_rmia_abl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/ensemble_attack/test_attack_model.py (1)

55-84: ⚠️ Potential issue | 🟡 Minor

Docstring describes wrong return type.

The docstring (lines 69-72) still describes the old return type as a tuple, but the function now returns only a pd.Series. This inconsistency could confuse future maintainers.

📝 Proposed fix for the docstring
     """
-    Extracts and returns the main IDs from the dataframe. The main ID column is identified based on
-    the data types JSON file with "id_column_name" key.
-    Main IDs are not repeated in the dataset.
-    For example, in the Berka dataset, "trans_id" is the main ID column, and "account_id" is not the main ID column.
+    Extracts and returns the main ID column values from the dataframe. The main ID column is
+    identified based on the data types JSON file with "id_column_name" key.
+    Main IDs are not repeated in the dataset.
+    For example, in the Berka dataset, "trans_id" is the main ID column, and "account_id" is not
+    the main ID column.
 
     Args:
         data_frame: Input dataframe.
         data_types_file_path: Path to the data types JSON file.
 
     Returns:
-        A tuple containing:
-            - The modified dataframe with ID columns dropped.
-            - A Series containing the extracted data of ID columns.
+        A Series containing the main ID column values.
     """



def extract_and_drop_id_column(
def extract_the_main_id_column(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, but maybe "primary_id" instead of "main_id"? Primary keys are the unique keys in databases. So it might be more suggestive to a reader 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I knew there was a better name, but I couldn't remember at the time lol. Thanks!

Copy link
Collaborator

@sarakodeiri sarakodeiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!



def extract_and_drop_id_column(
def extract_the_main_id_column(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suuuper nit-picky, but can we rename the function name to extract_main_id_column (drop the "the")? Sounds better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@fatemetkl fatemetkl merged commit 65d1ac8 into main Feb 4, 2026
6 checks passed
@fatemetkl fatemetkl deleted the ft/ensemble_exp_rmia_abl branch February 4, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants