# New Matrix Stats Aggregation module #18300

Merged
merged 3 commits into from Jun 1, 2016

## Conversation

Projects
None yet
8 participants
Member

### nknize commented May 12, 2016 • edited

 This PR adds a new aggs-matrix-stats module. The module introduces a new class of aggregations called `Matrix Aggregations`. Matrix aggregations work on multiple fields and produce a matrix as output. The matrix aggregation provided by this module is the `matrix_stats` aggregation which computes the following descriptive statistics over a set of fields: Covariance Correlation For completeness (and interpretation purposes) the following per-field statistics are also provided: sample count population mean population variance population skewness population kurtosis Example request: ```"aggs": { "matrixstats": { "matrix_stats": { "field": ["poverty", "income"] } } }``` Example response: ```{ "aggregations": { "matrixstats": { "fields": [{ "name": "income", "count": 50, "mean": 51985.1, "variance": 7.383377037755103E7, "skewness": 0.5595114003506483, "kurtosis": 2.5692365287787124, "covariance": { "income": 7.383377037755103E7, "poverty": -21093.65836734694 }, "correlation": { "income": 1.0, "poverty": -0.8352655256272504 } }, { "name": "poverty", "count": 50, "mean": 12.732000000000001, "variance": 8.637730612244896, "skewness": 0.4516049811903419, "kurtosis": 2.8615929677997767, "covariance": { "income": -21093.65836734694, "poverty": 8.637730612244896 }, "correlation": { "income": -0.8352655256272504, "poverty": 1.0 } }] } } }``` This PR replaces #16826. The differences include: renames aggregation from `multifield_stats` to `matrix_stats` adds aggregation code as a module instead of directly to core refactors Integration Tests as REST Tests

Closed

Closed

Member Author

### nknize commented May 13, 2016 • edited

 This PR has been updated to include MultiValuesSource classes (per comment in #18285) @colings86, @jpountz could one (or both) of you have a look if you get a chance?

### colings86 reviewed May 16, 2016

core/src/main/java/org/elasticsearch/search/aggregations/InternalAggregation.java Outdated
 @@ -263,6 +265,7 @@ public final void readFrom(StreamInput in) throws IOException { public static final String FROM_AS_STRING = "from_as_string"; public static final String TO = "to"; public static final String TO_AS_STRING = "to_as_string"; public static final ParseField VALUE_TYPE = new ParseField("value_type");

#### colings86 May 16, 2016

Member

I think this should actually be in ValuesSourceAggregatorBuilder since this `CommonFields` interface is for the response side of the aggregations but `value_type` is for the request side, so it doesn't really belong here

#### nknize May 16, 2016

Author Member

Ah, so is `InternalAggregation` specifically for the response side? I'm not as intimate with the aggs framework so I thought it was just an internal framework class (request / response agnostic).

So then what do you think about having a `CommonFields` static class to `AggregatorBuilder` for common fields used on the request side?

#### colings86 May 17, 2016

Member

+1 to having `CommonFields` class in `AggregatorBuilder`

### colings86 reviewed May 16, 2016

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
 return new InternalMatrixStats(name, 0, null, MatrixStatsResults.EMPTY(), pipelineAggregators(), getMetaData()); } RunningStats runningStats = ((InternalMatrixStats) aggregations.get(0)).stats;

#### colings86 May 16, 2016 • edited

Member

We had a bug in aggregations a while ago which was caused by using the first aggregation as the base for the reduce. Instead I think we should create a new empty matrixStats object here and then merge all of the shard results into it.

Contributor

+1

### colings86 reviewed May 16, 2016

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
 if (valuesCount <= 0) { return null; } // get the field value (multi-value is the average of all the values)

#### colings86 May 16, 2016

Member

is this a valid way to deal with multi-value fields (from a stats perspective)? If multi-value fields don't really work for this agg then maybe we should just stop it working on multi-value fields? (there are other aggs that do this)

Contributor

+1

#### nknize May 16, 2016

Author Member

Not really. In the end multi-value fields are treated as a single observation which is technically incorrect. The correct way is to factor multi-value fields into the weighted result. This is a todo (along with handling pair-wise missing values). But I'm not sure what you mean by "stop it working"? Do you mean skip the multi-value document? Or throw some kind of Unsupported exception?

#### jpountz May 16, 2016

Contributor

I think thorowing an exception would be a good first step and then in a follow-up we could consider accepting a `mode` option like sorting does so that users can configure that we should take the min/max/median/average in case of multiple values

Member

+1

### colings86 reviewed May 16, 2016

...ats/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsResults.java Outdated
 protected MatrixStatsResults(StreamInput in) { try { results = new RunningStats(in); this.readFrom(in);

#### colings86 May 16, 2016

Member

We don't really need this constructor and the readFrom() method. Can we get rid of the readFrom method and move the reading logic here? Also can we move the writeTo method so it's straight after this constructor so the read and write code is together? Also I think this constructor should throw IOException (I believe this is what other aggs do)

#### nknize May 16, 2016 • edited

Author Member

Can't remove `readFrom()` since its part of the Streamable Interface. Do you want me to just noop?

#### nknize May 16, 2016

Author Member

nvm. I'll convert from a `Streamable` to a `Writeable`. I just noticed the difference between the two.

Member

### colings86 commented May 16, 2016

 @nknize I left some comments on the aggregation type stuff but I don't know enough about the actual statistics we are calculating here to review those bits. Maybe @brwe or @polyfractal would be able to review those bits?

### jpountz reviewed May 16, 2016

...tats/src/main/java/org/elasticsearch/search/aggregations/matrix/MatrixAggregationPlugin.java Outdated
 public class MatrixAggregationPlugin extends Plugin { static { InternalMatrixStats.registerStreams();

#### jpountz May 16, 2016

Contributor

should we rather do it in onModule? cc @colings86

Member

+1

### jpountz reviewed May 16, 2016

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
 public Double getMean(String field) { if (results == null) { return null; }

#### jpountz May 16, 2016

Contributor

to be consistent with the stats agg I think we should return NaN rather than null here (but still write null in toXContent since json does not support nan)

#### jpountz May 16, 2016

Contributor

(and similarly for other below)

#### nknize May 16, 2016

Author Member

+1 Good catch. The toXContent checks for null and returns NaN. Makes way more sense to return NaN here.

### jpountz reviewed May 16, 2016

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
 @Override public Map> getCorrelation() { return results.getCorrelations(); }

#### jpountz May 16, 2016

Contributor

in terms of API I think we should choose which one of the two above methods we want to expose rather than exposing both

Author Member

+1

### jpountz reviewed May 16, 2016

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
 final long count = in.readVLong(); if (count > 0) { results = new MatrixStatsResults(in); }

#### jpountz May 16, 2016

Contributor

I am not happy that if stats or results is null on one side of the wire, then it becomes an empty instance on the other side of the wire. I would like it better if we either make sure that stats/results is never null or if we transport null so that null on one side is still null on the other side (I like the former better, but the latter would work for me too if keeping nulls is simpler)

#### nknize May 16, 2016

Author Member

+1. I'm going to switch over to `Writeable` from `Streamable` and use the built-in `readOptionalWriteable` and `writeOptionalWriteable`. Looks like this should ensure consistency on both sides of the wire?

### jpountz reviewed May 16, 2016

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
 @Override public InternalAggregation doReduce(List aggregations, ReduceContext reduceContext) { // merge stats across all shards aggregations.removeIf(p -> ((InternalMatrixStats)p).stats == null);

#### jpountz May 16, 2016

Contributor

I think we should avoid modifying the list in place. Maybe make a copy first?

### jpountz reviewed May 16, 2016

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
 boolean needsScores = false; if (valuesSources != null) { for (Map.Entry valueSource : valuesSources.entrySet()) { needsScores |= valueSource.getValue().needsScores();

#### jpountz May 16, 2016

Contributor

if the key is not needed, let's just iterate over valuesSources.values()?

### jpountz reviewed May 16, 2016

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
 private Map getFields(int doc) { // get fieldNames to use as hash keys ArrayList fieldNames = new ArrayList<>(values.keySet()); HashMap fields = new HashMap<>(fieldNames.size());

#### jpountz May 16, 2016

Contributor

this is called in `collect` which is typically called in a tight loop so we generally try to avoid having object creation done under collect(). Maybe we could use ordinal to identify field names so that the `HashMap<String, Double>` can become a simple `double[]` and then reuse the `double[]` across calls to collect?

### polyfractal reviewed May 16, 2016

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
 fieldValue += doubleValues.valueAt(i); } } fields.put(fieldName, fieldValue / valuesCount);

#### polyfractal May 16, 2016

Member

Unsure if this matters, but if `valueCount` is a positive, non-zero integer but all the values are `NaN`, the calculation will be 0. Is that the intended behavior? Since we're treating `NaN` as if it's a missing value and skip them, perhaps we should count how many values were collected instead of using `valuesCount`?

### polyfractal reviewed May 16, 2016

 nA2 = nA * nA; // doc A num samples squared nB2 = nB * nB; // doc B num samples squared // variance variances.put(fieldName, varA + varB + d2 * nA * other.docCount / docCount);

#### polyfractal May 16, 2016

Member

This caught my eye at first because I wasn't sure how the order of operations was falling out, then I was curious. Is the usage of delta squared mean here equivalent to a Pooled Variance like http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html ?

I was expecting something like:

``````(
nA * (Math.pow(varA,2) + Math.pow(meanA - combinedMean, 2))
+
nB * (Math.pow(varB,2) + Math.pow(meanB - combinedMean, 2))
) / docCount
``````

But my math skills are super rusty, so heaps of salt etc etc :)

#### polyfractal May 16, 2016

Member

Oh, just saw the PDF in the comment for streaming formulas. I assume it's one of those, so you can probably ignore this :)

Member

### polyfractal commented May 16, 2016

 I tried, and largely failed, to grok the stats calculations. My math skills are pretty rusty and the streaming formulas look different from what I'm used to. I'd try to get a second opinion from @brwe :)

Member Author

### nknize commented May 23, 2016 • edited

 Great suggestions! I incorporated all feedback and the PR is ready for another review. Here are the main changes: Added `MultiValueMode` support for handling multi value fields (it was easier adding it here instead of postponing to a later PR) Added a `MultiValuesSource` class to encapsulate field names and ValuesSource arrays, this is now used to retrieve document field values instead of creating a Map inside of `collect` Updated `doReduce` to merge all shard results in an empty `RunningStats` object. Added Rest testing for `MultiValueMode` support

### jpountz reviewed May 25, 2016

.../java/org/elasticsearch/search/aggregations/support/MultiValuesSourceAggregationBuilder.java Outdated
 private ValueType valueType = null; private String format = null; private Object missing = null; private DateTimeZone timeZone = null;

#### jpountz May 25, 2016

Contributor

do we make use of things like missing and timeZone already? If we don't I think we should remove them for now?

### jpountz reviewed May 25, 2016

 /** kurtosis values (fourth moment) */ protected HashMap kurtosis; /** covariance values */ protected HashMap> covariances;

#### jpountz May 25, 2016

Contributor

to make this class more efficient, we might want to give an ordinal to every field and then use field ordinals as keys rather than field names? (I don't mind it being done as a follow-up, I'm just suggesting)

#### nknize May 26, 2016

Author Member

+1 for followup PR. I'll open an issue

Member Author

### nknize commented May 26, 2016

 Thanks for the latest round of review @jpountz. I went ahead and made the minor code changes, added support for missing values, and updated documentation. Let me know what you think.

### jpountz reviewed May 31, 2016

docs/reference/modules/aggregations/matrix/stats.asciidoc Outdated
 } -------------------------------------------------- The aggregation type is `matrix_stats` and the `field` setting defines the set of fields (as an array) for computing

Contributor

s/field/fields/

### jpountz reviewed May 31, 2016

docs/reference/modules/aggregations/matrix/stats.asciidoc Outdated
 `min`:: Pick the lowest value. `max`:: Pick the highest value. `sum`:: Use the sum of all values as sort value. `median`:: Use the median of all values as sort value.

#### jpountz May 31, 2016

Contributor

maybe s/as sort values// since there is no sorting happening?

### jpountz reviewed May 31, 2016

docs/reference/modules/aggregations/matrix/stats.asciidoc Outdated
 "matrixstats": { "matrix_stats": { "fields": ["poverty", "income"], "missing": [{"income" : 50000}] <1>

#### jpountz May 31, 2016

Contributor

should it rather be `"missing": {"income" : 50000}`, ie. do we need the array?

### jpountz reviewed May 31, 2016

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
 final NumericDoubleValues doubleValues = values[i]; final double value = doubleValues.get(doc); // skip if value is missing // todo: we could add option to specify a missing value?

#### jpountz May 31, 2016

Contributor

the doc states there is such an option?

Contributor

### jpountz commented May 31, 2016

 I'd like @colings86 to have a final look, at least for the core refactorings that you did. But otherwise this looks good to me for a first iteration!

### colings86 reviewed May 31, 2016

 @@ -76,6 +76,21 @@ public boolean valid() { return this; } public ValuesSourceConfig format(final DocValueFormat format) {

#### colings86 May 31, 2016

Member

Should we make all the fields of this class private now we have setters/getters for them so there is only one way of accessing/modifying the values? (maybe in a followup PR?)

#### nknize May 31, 2016

Author Member

I went ahead and did it in this PR but will keep it as a separate commit when I merge.

Member

### colings86 commented May 31, 2016

 @jpountz @nknize I took a look at the core changes and they LGTM. I also took a brief look at the module, there are unit and REST tests but we are missing integration tests for this. Is there a reason we have not done an integration tests for this aggregation? All other aggregations have integration tests.
Member Author

### nknize commented May 31, 2016

 @colings86 I had asked @rjernst about integration testing in modules. I could be wrong but I think to avoid the overhead of starting/stopping integration test clusters in plugins and modules the `ESIntegTestCase` tests should be replaced by simple Unit and Rest tests? It was a bit tricky but what I did was refactor the integration tests (defined by `AbstractNumericTestCase`) from the original PR (#16826) to the yaml Rest definitions included in this PR. I also split out testing of `RunningStats` and `MatrixStatsResults` into a separate `RunningStatsTests` unit test class to ensure the RunningStats (and corresponding results) are computed as expected.

Member

### rjernst commented Jun 1, 2016

 there are unit and REST tests but we are missing integration tests for this REST tests are real integration tests. No need for fantasy land tests.

### nknize added some commits May 12, 2016

``` New Matrix Stats Aggregation module ```
```This commit adds a new aggs-matrix-stats module. The module presents a new class of aggregations called Matrix Aggregations. Matrix aggregations work on multiple fields and produce a matrix as output. The first matrix aggregation provided by the module is matrix_stats aggregation. This aggregation computes the following statistics over a set of fields:

* Covariance
* Correlation

For completeness (and interpretation purposes) the following per-field statistics are also provided:

* sample count
* population mean
* population variance
* population skewness
* population kurtosis```
``` f449666 ```
``` Adding MultiValuesSource support classes and documentation to matrix … ```
`…stats agg module`
``` 90b8f5d ```
``` set ValuesSourceConfig fields as private ```
``` 54575e5 ```

### nknize merged commit `54575e5` into elastic:master Jun 1, 2016 1 check passed

#### 1 check passed

CLA Commit author is a member of Elasticsearch
Details

Open

Closed

Closed

Merged

Open