Skip to content

Conversation

@stefanradev93
Copy link
Contributor

@stefanradev93 stefanradev93 commented Apr 4, 2025

Adds classifier two-sample test (C2ST) metric to the bayesflow.diagnostics.metrics module. Example usage:

samples = ...
reference_samples = ...

# Using default arguments
c2st = classifier_two_sample_test(samples, reference_samples)

# Customizing arguments - will return full classification report + classifier and use a smaller validation data ratio
c2st_dict = classifier_two_sample_test(samples, reference_samples, return_metric_only=False, validation_split=0.2)

What we need to discuss is the conceptual separation of metrics, namely those intended to capture the performance of an ensemble of posteriors (e.g., SBC, RMSE) and depend on the prior or those intended to capture the quality of individual posteriors (e.g., MMD, C2ST). Should these two classes of metrics reside in different submodules? @paul-buerkner @vpratz @LarsKue

Note also, that I removed the internal cross-validation in favor of a simple train-test split. Since the C2ST will generally be computed on different source-target posterior samples, any epistemic uncertainty will average out. However, I could allow K-fold as well with some custom helpers, mainly because I wanted to avoid the explicit dependence on sklearn.

@codecov
Copy link

codecov bot commented Apr 4, 2025

@vpratz
Copy link
Collaborator

vpratz commented Apr 7, 2025

Thanks for the PR :)

Should these two classes of metrics reside in different submodules?

I like the idea, but lack good names that we could use for the respective submodules. Do you have an idea on how to distinguish them with two words that could serve as submodule names?

Note also, that I removed the internal cross-validation in favor of a simple train-test split.

As far as I can tell, using cross-validation gives us a more powerful test for the same number of samples, so I would opt against removing this in the default setting. How are other implementations handling this? Is there some default setup we might want to implement?
We use sklearn in expected_calibration_error and mc_confusion_matrix as well. Is the plan to refactor those as well, or are we going to keep the dependency anyway?

@paul-buerkner
Copy link
Contributor

I think the PR looks good. Would you have time to add some tests too perhaps?

I would keep them in the same module for now even if their inputs are different. We clearly document the difference in input and I am not sure how we would name them if we went for two modules.

@stefanradev93
Copy link
Contributor Author

@vpratz I removed all dependencies on sklearn across the library. @paul-buerkner Added tests for sensitivity of c2st.

@LarsKue LarsKue assigned stefanradev93 and unassigned paul-buerkner and vpratz Apr 8, 2025
@LarsKue LarsKue requested a review from Copilot April 8, 2025 17:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

bayesflow/diagnostics/metrics/classifier_two_sample_test.py:21

  • There is an inconsistency between the documented default value for mlp_widths (described as (256, 256)) and the actual default in the function signature ((64, 64)). Please update the docstring or the default value to be consistent.
mlp_widths: Sequence = (64, 64),

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@LarsKue LarsKue merged commit 387abe8 into dev Apr 8, 2025
15 checks passed
@LarsKue LarsKue deleted the c2st branch April 8, 2025 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants