Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.x] [ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set (#76764) #76772

Merged
merged 4 commits into from
Aug 23, 2021

Conversation

benwtrent
Copy link
Member

Backports the following commits to 7.x:

…he foreground set (elastic#76764)

Whilst testing the p_value scoring heuristic for significant terms introduced
in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.
@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Aug 20, 2021
@benwtrent
Copy link
Member Author

run elasticsearch-ci/bwc

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit ac6c1b3 into elastic:7.x Aug 23, 2021
@benwtrent benwtrent deleted the backport/7.x/pr-76764 branch August 23, 2021 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport v7.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants