[ML] adding new p_value scoring heuristic to significant terms aggregation #75313

benwtrent · 2021-07-13T18:16:03Z

This commit adds a new p_value score heuristic to significant terms.

The p_value is calculating assuming that the foreground set and the background set are independent Bernoulli trials with the null hypothesis that the probabilities are the same.

Example usage:

This calculates the p_value score for terms user_agent.version given the foreground set of "ended in failure" vs "NOT ended in failure".

NOTE: "background_is_superset": false to indicate that the background set does not contain the counts of the foreground set as we filter them out.

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "event.outcome": "failure"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2021-02-01",
              "lt": "2021-02-04"
            }
          }
        },
        {
          "term": {
            "service.name": {
              "value": "frontend-node"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "failure_p_value": {
      "significant_terms": {
        "field": "user_agent.version",
        "background_filter": {
          "bool": {
            "must_not": [
              {
                "term": {
                  "event.outcome": "failure"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "@timestamp": {
                    "gte": "2021-02-01",
                    "lt": "2021-02-04"
                  }
                }
              },
              {
                "term": {
                  "service.name": {
                    "value": "frontend-node"
                  }
                }
              }
            ]
          }
        },
        "p_value": {"background_is_superset": false}
      }
    }
  }
}

elasticmachine · 2021-07-13T18:16:06Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-07-13T18:17:09Z

@elastic/ml-docs Hola, I am not sure where to put the docs for this.

Its a new function in the significant terms aggregation that the ML plugin provides. Right now, there is no independent page for significance functions or a place to put plugin ones.

benwtrent · 2021-07-13T18:18:05Z

@not-napoleon related to: #75264

I needed to move some files so that the ML plugin could access them. I failed to do that in the previous PR. The changes are rather small (mostly a package move).

benwtrent · 2021-07-13T18:18:57Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/LongBinomialDistribution.java

+ *
+ * It expands its usage to allow `long` values instead of restricting to `int`
+ */
+public class LongBinomialDistribution {


This code is mostly a copy paste from apache math3. The only difference is the parameter types are now long instead of int.

benwtrent · 2021-07-13T18:20:18Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/MlChiSquaredDistribution.java

+    public double survivalFunction(double x) {
+        return x <= 0 ?
+            0 :
+            Gamma.regularizedGammaQ(gamma.getShape(), x / gamma.getScale());


It is more accurate to use this regularizedGammaQ directly instead of attempting 1-regularizedGammaP as we could over/under flow quite easily on smaller values.

…e-sig-terms-heuristic

tveasey

Nice work! Everything looks good to me (except for one correction which was a mistake in the prototype). I do think it would be good to allow key constants to be supplied as a parameter and also made some suggestions for extra testing.

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScore.java

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScoreTests.java

…e-sig-terms-heuristic

tveasey

I spotted one further simplification, since you pulled in the condition that the frequency term must be higher on the subset. Also, I realised seeing the actual p-values that my suggestions for testing the case that the fraction is within 5% were off by a factor.

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScore.java

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScoreTests.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScore.java

benwtrent · 2021-07-19T12:17:19Z

Factored out the required agg test changes to this PR: #75452

benwtrent · 2021-07-21T13:25:39Z

run elasticsearch-ci/part-1

davidkyle

LGTM

davidkyle · 2021-07-21T13:41:33Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScore.java

+
+    @Override
+    public void writeTo(StreamOutput out) throws IOException {
+        out.writeBoolean(backgroundIsSuperset);


I would be tempted to use the super methods for writeTo and in the super StreamInput ctor in case something changes in the base class. This class can't be constructed with includeNegatives == false so that constraint is preserved.

benwtrent · 2021-07-21T14:03:56Z

run elasticsearch-ci/part-1

…aggregation (#75313) (#75597) * [ML] adding new p_value scoring heuristic to significant terms aggregation (#75313) This commit adds a new p_value score heuristic to significant terms. The p_value is calculating assuming that the foreground set and the background set are independent Bernoulli trials with the null hypothesis that the probabilities are the same. * adjusting for backport

…ation (elastic#75313) This commit adds a new p_value score heuristic to significant terms. The p_value is calculating assuming that the foreground set and the background set are independent Bernoulli trials with the null hypothesis that the probabilities are the same.

…he foreground set (#76764) Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency.

…he foreground set (elastic#76764) Whilst testing the p_value scoring heuristic for significant terms introduced in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency.

…he foreground set (#76764) (#76773) Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency. Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>

…he foreground set (#76764) (#76772) Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency. Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

benwtrent added 7 commits July 1, 2021 15:49

[ML] add new p_value significant terms heuristic

418ca1f

Merge branch 'master' into feature/ml-p_value-sig-terms-heuristic

4464c10

fixing p-value scaling

85db3a1

correcting p-value calculation

7453977

Merge branch 'master' into feature/ml-p_value-sig-terms-heuristic

f8643a3

adding tests

19d7eaf

removing blank lines

d93c0f7

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.15.0 labels Jul 13, 2021

benwtrent requested review from not-napoleon and tveasey July 13, 2021 18:16

elasticmachine added the Team:ML Meta label for the ML team label Jul 13, 2021

benwtrent commented Jul 13, 2021

View reviewed changes

benwtrent added 2 commits July 14, 2021 07:26

fixing style

1196d26

Merge remote-tracking branch 'upstream/master' into feature/ml-p_valu…

e7efc9b

…e-sig-terms-heuristic

tveasey reviewed Jul 14, 2021

View reviewed changes

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScoreTests.java Outdated Show resolved Hide resolved

benwtrent added 2 commits July 14, 2021 14:08

addressing PR comments

8adc331

Merge remote-tracking branch 'upstream/master' into feature/ml-p_valu…

c77f062

…e-sig-terms-heuristic

benwtrent requested a review from tveasey July 14, 2021 18:13

fixing tests and formatting

36a18ff

tveasey approved these changes Jul 15, 2021

View reviewed changes

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScore.java Show resolved Hide resolved

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScoreTests.java Outdated Show resolved Hide resolved

davidkyle reviewed Jul 15, 2021

View reviewed changes

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/heuristic/PValueScore.java Outdated Show resolved Hide resolved

addressing PR comments

5e8d102

fixing test

997094c

szabosteve mentioned this pull request Jul 15, 2021

[DOCS] Adds p-value heuristic to significant terms aggregation #75369

Merged

benwtrent removed the request for review from not-napoleon July 19, 2021 12:16

benwtrent added 2 commits July 21, 2021 09:08

Merge branch 'master' into feature/ml-p_value-sig-terms-heuristic

c85781c

fixing post merge

e895d34

davidkyle approved these changes Jul 21, 2021

View reviewed changes

benwtrent merged commit 79c176c into elastic:master Jul 21, 2021

benwtrent deleted the feature/ml-p_value-sig-terms-heuristic branch July 21, 2021 15:39

benwtrent mentioned this pull request Jul 21, 2021

[7.x] [ML] adding new p_value scoring heuristic to significant terms aggregation (#75313) #75597

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

tveasey mentioned this pull request Aug 20, 2021

[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] adding new p_value scoring heuristic to significant terms aggregation #75313

[ML] adding new p_value scoring heuristic to significant terms aggregation #75313

benwtrent commented Jul 13, 2021

elasticmachine commented Jul 13, 2021

benwtrent commented Jul 13, 2021

benwtrent commented Jul 13, 2021

benwtrent Jul 13, 2021

benwtrent Jul 13, 2021

tveasey left a comment

tveasey left a comment

benwtrent commented Jul 19, 2021

benwtrent commented Jul 21, 2021

davidkyle left a comment

davidkyle Jul 21, 2021 •

edited

Loading

benwtrent commented Jul 21, 2021

[ML] adding new p_value scoring heuristic to significant terms aggregation #75313

[ML] adding new p_value scoring heuristic to significant terms aggregation #75313

Conversation

benwtrent commented Jul 13, 2021

elasticmachine commented Jul 13, 2021

benwtrent commented Jul 13, 2021

benwtrent commented Jul 13, 2021

benwtrent Jul 13, 2021

Choose a reason for hiding this comment

benwtrent Jul 13, 2021

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

benwtrent commented Jul 19, 2021

benwtrent commented Jul 21, 2021

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

benwtrent commented Jul 21, 2021

davidkyle Jul 21, 2021 •

edited

Loading