-
Notifications
You must be signed in to change notification settings - Fork 1
Added example for the attack #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR refactors the Tartan–Federer membership inference attack by externalizing data preparation logic from the core attack module to the example runner script. The main changes include: moving Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
examples/ensemble_attack/README.md (1)
24-42: README mixes Ensemble Attack docs with Tartan–Federer run command; make entrypoints consistent.This README is titled/structured for Ensemble Attack (config paths and
./examples/ensemble_attack/run.sh), but Line 29 now instructs runningexamples.tartan_federer_attack.run_attackwhile Line 41 still usesexamples.ensemble_attack.run_attack. Pick one and align the surrounding text/config references accordingly to avoid sending users down the wrong pipeline.Suggested fix (if this README is meant for Ensemble Attack):
```bash -python -m examples.tartan_federer_attack.run_attack +python -m examples.ensemble_attack.run_attackIf instead the intent is to document the Tartan–Federer example here, I’d strongly recommend renaming/splitting the README and updating *all* references (`config.yaml` path, `run.sh`, terminology) to the Tartan–Federer locations to keep the doc coherent. </blockquote></details> <details> <summary>src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (2)</summary><blockquote> `470-496`: **Remove debug print statements.** The model indexing logic correctly handles optional validation, but Lines 478-479 contain debug print statements that should be removed. Apply this diff: ```diff if model_number in train_indices: - print("Preparing training dataframe...") - print(f"Model dir: {model_dir}") population_df_for_training = prepare_dataframe(
532-563: Remove debug print statements.The validation scoring logic is correct, but Lines 533 and 563 contain debug print statements that should be removed.
Apply this diff:
elif val_indices is not None and model_number in val_indices: - print("Getting validation scores...") batch_size = sample_per_val_model * 2And:
val_count += 1 - print("Val count:", val_count) fitted_regression_model = fit_model(
🧹 Nitpick comments (1)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
378-388: Refactor error messages to follow best practices.The data sufficiency checks are essential for preventing runtime errors, but the error messages should be simplified per best practices (TRY003).
Apply this diff:
if df_exclusive.shape[0] < samples_per_model: raise ValueError( - f"Not enough data to sample non-members from. Requested {samples_per_model} but only " - f"{df_exclusive.shape[0]} available." + f"Insufficient non-member samples: need {samples_per_model}, have {df_exclusive.shape[0]}" ) if raw_data.shape[0] < samples_per_model: raise ValueError( - f"Not enough data to sample members members from. Requested {samples_per_model} but only " - f"{raw_data.shape[0]} available." + f"Insufficient member samples: need {samples_per_model}, have {raw_data.shape[0]}" )As per static analysis hints.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
tests/integration/attacks/tartan_federer/assets/population_data/population_dataset_for_training_attack.csvis excluded by!**/*.csvtests/integration/attacks/tartan_federer/assets/population_data/population_dataset_for_validating_attack.csvis excluded by!**/*.csvtests/integration/attacks/tartan_federer/assets/tabddpm_models/tabddpm_2/data_for_validating_MIA.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
examples/ensemble_attack/README.md(1 hunks)examples/tartan_federer_attack/README.md(1 hunks)examples/tartan_federer_attack/configs/experiment_config.yaml(1 hunks)examples/tartan_federer_attack/run_attack.py(1 hunks)src/midst_toolkit/attacks/tartan_federer/data_utils.py(0 hunks)src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py(15 hunks)tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py(2 hunks)
💤 Files with no reviewable changes (1)
- src/midst_toolkit/attacks/tartan_federer/data_utils.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-11T16:08:49.024Z
Learnt from: lotif
Repo: VectorInstitute/midst-toolkit PR: 107
File: examples/gan/synthesize.py:1-47
Timestamp: 2025-12-11T16:08:49.024Z
Learning: When using SDV (version >= 1.18.0), prefer loading a saved CTGANSynthesizer with CTGANSynthesizer.load(filepath) instead of sdv.utils.load_synthesizer(). This applies to Python code across the repo (e.g., any script that loads a CTGANSynthesizer). Ensure the SDV version is >= 1.18.0 before using CTGANSynthesizer.load, and fall back to sdv.utils.load_synthesizer() only if a compatible alternative is required.
Applied to files:
tests/integration/attacks/tartan_federer/test_tartan_federer_attack.pyexamples/tartan_federer_attack/run_attack.pysrc/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
🧬 Code graph analysis (4)
examples/ensemble_attack/README.md (2)
examples/ensemble_attack/run_attack.py (1)
main(48-87)examples/ept_attack/run_ept_attack.py (1)
main(80-102)
tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py (2)
src/midst_toolkit/common/random.py (2)
set_all_random_seeds(11-55)unset_all_random_seeds(58-67)src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
tartan_federer_attack(577-725)
examples/tartan_federer_attack/run_attack.py (2)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
tartan_federer_attack(577-725)src/midst_toolkit/common/random.py (2)
set_all_random_seeds(11-55)unset_all_random_seeds(58-67)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
src/midst_toolkit/attacks/tartan_federer/data_utils.py (1)
evaluate_attack_performance(160-209)
🪛 Ruff (0.14.8)
examples/tartan_federer_attack/run_attack.py
43-43: Avoid specifying long messages outside the exception class
(TRY003)
63-63: Avoid specifying long messages outside the exception class
(TRY003)
75-75: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
117-117: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
122-122: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
142-142: Unpacked variable mia_performance_train is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
142-142: Unpacked variable mia_performance_val is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
142-142: Unpacked variable mia_performance_test is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
379-382: Avoid specifying long messages outside the exception class
(TRY003)
385-388: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: integration-tests
- GitHub Check: unit-tests
- GitHub Check: run-code-check
🔇 Additional comments (13)
examples/tartan_federer_attack/README.md (1)
1-33: LGTM: Clear documentation with minor typo in config comment.The README provides clear guidance on the data processing flow and execution. The structure is well-organized with distinct sections for data processing and running the attack.
Note: Line 2 of the YAML config (experiment_config.yaml) contains a typo:
tets_attack_model.pyshould betest_attack_model.py.tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py (2)
24-24: LGTM: Population data directory added.The addition of
population_data_diraligns with the refactored data loading approach.
179-180: LGTM: Test execution added.The new tests are properly registered in the main execution block.
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (6)
400-416: LGTM: Optional validation parameter added.The function signature correctly supports optional validation by accepting
val_indices: list[int] | None.
447-448: LGTM: CSV-based population loading implemented.The refactor to load population datasets directly from CSV files simplifies the data flow and aligns with the PR objectives.
458-463: LGTM: Conditional validation arrays.The conditional creation of validation arrays properly handles the optional validation flow.
577-598: LGTM: Function signature updated.The
tartan_federer_attackfunction signature correctly includes the newpopulation_data_dirparameter and optionalval_indices.
660-663: LGTM: Conditional model indexing.The logic correctly constructs
model_folders_indicesbased on whether validation is enabled.
705-710: LGTM: Conditional validation evaluation.The evaluation logic properly handles the case when
val_indices is Noneby settingmia_performance_val = None.examples/tartan_federer_attack/run_attack.py (4)
19-71: LGTM: Data preparation logic moved to example.The
prepare_population_dataset_for_attackfunction is well-implemented with proper validation. Moving this from the toolkit to the example is a good separation of concerns.Note: The error messages on Lines 43 and 63 could be simplified per TRY003, but this is a minor style consideration.
73-112: LGTM: Data processing orchestration.The
run_data_processingfunction cleanly orchestrates population dataset preparation and CSV output.
114-167: LGTM: Hydra-based attack runner.The
run_attackfunction properly uses Hydra for configuration management, sets deterministic seeds, conditionally runs data processing, and invokes the attack with correct parameters.Note: The unpacked variables at Line 142 are not used, but this is acceptable for an entry point where you may want to observe all return values for logging/debugging purposes.
169-170: LGTM: Entry point defined.The
__main__block correctly invokes the Hydra-decoratedrun_attackfunction.
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
Outdated
Show resolved
Hide resolved
tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py
Outdated
Show resolved
Hide resolved
On branch bz/tf_example Your branch is ahead of 'origin/bz/tf_example' by 2 commits. (use "git push" to publish your local commits) Changes to be committed: modified: src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
fatemetkl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments and questions, but most of them are minor.
Looks good overall!
tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py
Outdated
Show resolved
Hide resolved
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
Outdated
Show resolved
Hide resolved
| models_base_dir=model_data_dir, | ||
| columns_for_deduplication=columns_for_deduplication, | ||
| ) | ||
| population_df_for_training = pd.read_csv(population_data_dir / "population_dataset_for_training_attack.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can log here that training dataset is loaded successfully. Also display an error message in case the file was not found under population_data_dir prompting the user to place the file at this address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I used to have this but David mentioned that the pd.read_csv will handle error throwing if the file does not exist. I am still happy to add it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, its default error message can be enough.
| columns_for_deduplication=columns_for_deduplication, | ||
| ) | ||
| population_df_for_training = pd.read_csv(population_data_dir / "population_dataset_for_training_attack.csv") | ||
| population_df_for_validation = pd.read_csv(population_data_dir / "population_dataset_for_validating_attack.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the above comment.
| attack_cfg = cfg["attack_config"] | ||
| classifier_cfg = cfg["classifier_config"] | ||
|
|
||
| mia_performance_train, mia_performance_val, mia_performance_test = tartan_federer_attack( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, I am missing where the user can define the type of the attack (white-box or black-box). Do we give this option to the user?
This example seems to be black-box, but I might be wrong. Adding this information to the README would also be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is really not a difference. If we set the target_model_subdir to be the one containing the shadow model trained using synthetic data, the attack is going to be black box essentially and we currently do not have them trained using the toolkit so perhaps I can add that as a todo as well for Elahe and to update the code then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! So it depends on what is in the config.target_shadow_model_subdir directory. If there are trained shadow models there, it would be black-box, but in this config, it is set to target_shadow_model_subdir: "." So it will be white-box. It wasn't entirely clear to me at first, but I suppose users will already be aware of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can optionally note that in a comment in experiment_config.yaml.
| @@ -1,18 +1,16 @@ | |||
| # Ensemble experiment configuration | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Tartan-Federer instead 🙂
fatemetkl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
Added a few minor and optional comments.
| attack_cfg = cfg["attack_config"] | ||
| classifier_cfg = cfg["classifier_config"] | ||
|
|
||
| mia_performance_train, mia_performance_val, mia_performance_test = tartan_federer_attack( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking comment (sorry): it might be a bit more readable if the inputs followed the function signature order.
| attack_cfg = cfg["attack_config"] | ||
| classifier_cfg = cfg["classifier_config"] | ||
|
|
||
| mia_performance_train, mia_performance_val, mia_performance_test = tartan_federer_attack( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! So it depends on what is in the config.target_shadow_model_subdir directory. If there are trained shadow models there, it would be black-box, but in this config, it is set to target_shadow_model_subdir: "." So it will be white-box. It wasn't entirely clear to me at first, but I suppose users will already be aware of this.
| attack_cfg = cfg["attack_config"] | ||
| classifier_cfg = cfg["classifier_config"] | ||
|
|
||
| mia_performance_train, mia_performance_val, mia_performance_test = tartan_federer_attack( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can optionally note that in a comment in experiment_config.yaml.
PR Type
Other: Example
Short Description
Modified the attack to do population preparation separately that helps with lightening the tests and adding an example.
Tests Added
Added a test to make sure we can do no validation.