Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A profile's average precision is 1 if it contains any NaNs #59

Closed
jessica-ewald opened this issue Mar 1, 2024 · 3 comments · Fixed by #60
Closed

A profile's average precision is 1 if it contains any NaNs #59

jessica-ewald opened this issue Mar 1, 2024 · 3 comments · Fixed by #60

Comments

@jessica-ewald
Copy link
Contributor

The problem

All results come from this notebook

If a profile contains even a single NaN, every pairwise similarity will also be NaN. Here is an example:
image

Here is an example where the profile that we are computing AP for has no NaN, but some of its pairs do:
image

When all pairs (pos and neg) have NaN similarities, copairs returns an AP of 1. This results in any profile with even a single NaN getting an AP of 1 (see ZMYND10_Arg340Gln and UBQLN2_Pro506Thr rows):
image

What to do?

As @alxndrkalinin pointed out, probably we don't want to enforce any NaN handling strategies inside of the pairwise similarity calculations because this can get complex.

But, maybe this also shouldn't happen without any warning? Users could easily have a small number of NaNs in their data and never notice because the mAP values would look normal (between 0 and 1), but would be biased upwards whenever one/some of the individual AP values are 1.

I suggest we add one simple check for NaNs in the feats input in the validate_pipeline_input function, and just give users a warning describing this behavior and let them know they should resolve NaNs. Another (more complicated) solution would be to change the tie strategy when computing average precision from the ranked list such that the average precision is like NaN instead of 1 if everything is NaN and therefore tied.

@shntnu
Copy link
Member

shntnu commented Mar 1, 2024

But, maybe this also shouldn't happen without any warning?

100% agree

Users could easily have a small number of NaNs in their data and never notice because the mAP values would look normal (between 0 and 1), but would be biased upwards whenever one/some of the individual AP values are 1.

Yep

I suggest we add one simple check for NaNs in the feats input in the validate_pipeline_input function, and just give users a warning describing this behavior and let them know they should resolve NaNs.

agreed

    # Check for NaN values in feats DataFrame
    if feats.isna().any(axis=None):
        raise ValueError("Feature columns should not have NaN values.")

Another (more complicated) solution would be to change the tie strategy when computing average precision from the ranked list such that the average precision is like NaN instead of 1 if everything is NaN and therefore tied.

This is certainly more complicated. I think we should take a conservative+simple approach, albeit not the most user-friendly

@johnarevalo
Copy link
Member

Thanks for catching this @jessica-ewald ! I agree on raising ValueError for any NaN. It'd be good to check if np.inf also lead to unexpected behaviors.

Please let me know if you want to author the PR for this change.

@jessica-ewald
Copy link
Contributor Author

I've just submitted a pull request to check for NaNs. Haven't looked into behavior surrounding Infs but I'll be working with mAP more this week and can add it to the list!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants