Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Matrix Stats Aggregation module #18300

Merged
merged 3 commits into from Jun 1, 2016

Conversation

Projects
None yet
8 participants
@nknize
Copy link
Member

commented May 12, 2016

This PR adds a new aggs-matrix-stats module. The module introduces a new class of aggregations called Matrix Aggregations. Matrix aggregations work on multiple fields and produce a matrix as output. The matrix aggregation provided by this module is the matrix_stats aggregation which computes the following descriptive statistics over a set of fields:

  • Covariance
  • Correlation

For completeness (and interpretation purposes) the following per-field statistics are also provided:

  • sample count
  • population mean
  • population variance
  • population skewness
  • population kurtosis

Example request:

"aggs": {
    "matrixstats": {
        "matrix_stats": {
            "field": ["poverty", "income"]
        }
     }
}

Example response:

{
    "aggregations": {
        "matrixstats": {
            "fields": [{
                "name": "income",
                "count": 50,
                "mean": 51985.1,
                "variance": 7.383377037755103E7,
                "skewness": 0.5595114003506483,
                "kurtosis": 2.5692365287787124,
                "covariance": {
                    "income": 7.383377037755103E7,
                    "poverty": -21093.65836734694
                },
                "correlation": {
                    "income": 1.0,
                    "poverty": -0.8352655256272504
                }
            }, {
                "name": "poverty",
                "count": 50,
                "mean": 12.732000000000001,
                "variance": 8.637730612244896,
                "skewness": 0.4516049811903419,
                "kurtosis": 2.8615929677997767,
                "covariance": {
                    "income": -21093.65836734694,
                    "poverty": 8.637730612244896
                },
                "correlation": {
                    "income": -0.8352655256272504,
                    "poverty": 1.0
                }
            }]
        }
    }
}

This PR replaces #16826. The differences include:

  • renames aggregation from multifield_stats to matrix_stats
  • adds aggregation code as a module instead of directly to core
  • refactors Integration Tests as REST Tests
@nknize

This comment has been minimized.

Copy link
Member Author

commented May 13, 2016

This PR has been updated to include MultiValuesSource classes (per comment in #18285)

@colings86, @jpountz could one (or both) of you have a look if you get a chance?

@colings86

View changes

core/src/main/java/org/elasticsearch/search/aggregations/InternalAggregation.java Outdated
@@ -263,6 +265,7 @@ public final void readFrom(StreamInput in) throws IOException {
public static final String FROM_AS_STRING = "from_as_string";
public static final String TO = "to";
public static final String TO_AS_STRING = "to_as_string";
public static final ParseField VALUE_TYPE = new ParseField("value_type");

This comment has been minimized.

Copy link
@colings86

colings86 May 16, 2016

Member

I think this should actually be in ValuesSourceAggregatorBuilder since this CommonFields interface is for the response side of the aggregations but value_type is for the request side, so it doesn't really belong here

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

Ah, so is InternalAggregation specifically for the response side? I'm not as intimate with the aggs framework so I thought it was just an internal framework class (request / response agnostic).

So then what do you think about having a CommonFields static class to AggregatorBuilder for common fields used on the request side?

This comment has been minimized.

Copy link
@colings86

colings86 May 17, 2016

Member

+1 to having CommonFields class in AggregatorBuilder

@colings86

View changes

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
return new InternalMatrixStats(name, 0, null, MatrixStatsResults.EMPTY(), pipelineAggregators(), getMetaData());
}

RunningStats runningStats = ((InternalMatrixStats) aggregations.get(0)).stats;

This comment has been minimized.

Copy link
@colings86

colings86 May 16, 2016

Member

We had a bug in aggregations a while ago which was caused by using the first aggregation as the base for the reduce. Instead I think we should create a new empty matrixStats object here and then merge all of the shard results into it.

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

+1

@colings86

View changes

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
if (valuesCount <= 0) {
return null;
}
// get the field value (multi-value is the average of all the values)

This comment has been minimized.

Copy link
@colings86

colings86 May 16, 2016

Member

is this a valid way to deal with multi-value fields (from a stats perspective)? If multi-value fields don't really work for this agg then maybe we should just stop it working on multi-value fields? (there are other aggs that do this)

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

+1

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

Not really. In the end multi-value fields are treated as a single observation which is technically incorrect. The correct way is to factor multi-value fields into the weighted result. This is a todo (along with handling pair-wise missing values). But I'm not sure what you mean by "stop it working"? Do you mean skip the multi-value document? Or throw some kind of Unsupported exception?

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

I think thorowing an exception would be a good first step and then in a follow-up we could consider accepting a mode option like sorting does so that users can configure that we should take the min/max/median/average in case of multiple values

This comment has been minimized.

Copy link
@colings86
@colings86

View changes

...ats/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsResults.java Outdated
protected MatrixStatsResults(StreamInput in) {
try {
results = new RunningStats(in);
this.readFrom(in);

This comment has been minimized.

Copy link
@colings86

colings86 May 16, 2016

Member

We don't really need this constructor and the readFrom() method. Can we get rid of the readFrom method and move the reading logic here? Also can we move the writeTo method so it's straight after this constructor so the read and write code is together? Also I think this constructor should throw IOException (I believe this is what other aggs do)

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

Can't remove readFrom() since its part of the Streamable Interface. Do you want me to just noop?

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

nvm. I'll convert from a Streamable to a Writeable. I just noticed the difference between the two.

@colings86

This comment has been minimized.

Copy link
Member

commented May 16, 2016

@nknize I left some comments on the aggregation type stuff but I don't know enough about the actual statistics we are calculating here to review those bits. Maybe @brwe or @polyfractal would be able to review those bits?

@jpountz

View changes

...tats/src/main/java/org/elasticsearch/search/aggregations/matrix/MatrixAggregationPlugin.java Outdated

public class MatrixAggregationPlugin extends Plugin {
static {
InternalMatrixStats.registerStreams();

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

should we rather do it in onModule? cc @colings86

This comment has been minimized.

Copy link
@colings86
@jpountz

View changes

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
public Double getMean(String field) {
if (results == null) {
return null;
}

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

to be consistent with the stats agg I think we should return NaN rather than null here (but still write null in toXContent since json does not support nan)

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

(and similarly for other below)

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

+1 Good catch. The toXContent checks for null and returns NaN. Makes way more sense to return NaN here.

@jpountz

View changes

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
@Override
public Map<String, HashMap<String, Double>> getCorrelation() {
return results.getCorrelations();
}

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

in terms of API I think we should choose which one of the two above methods we want to expose rather than exposing both

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

+1

@jpountz

View changes

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
final long count = in.readVLong();
if (count > 0) {
results = new MatrixStatsResults(in);
}

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

I am not happy that if stats or results is null on one side of the wire, then it becomes an empty instance on the other side of the wire. I would like it better if we either make sure that stats/results is never null or if we transport null so that null on one side is still null on the other side (I like the former better, but the latter would work for me too if keeping nulls is simpler)

This comment has been minimized.

Copy link
@nknize

nknize May 16, 2016

Author Member

+1. I'm going to switch over to Writeable from Streamable and use the built-in readOptionalWriteable and writeOptionalWriteable. Looks like this should ensure consistency on both sides of the wire?

@jpountz

View changes

...ts/src/main/java/org/elasticsearch/search/aggregations/matrix/stats/InternalMatrixStats.java Outdated
@Override
public InternalAggregation doReduce(List<InternalAggregation> aggregations, ReduceContext reduceContext) {
// merge stats across all shards
aggregations.removeIf(p -> ((InternalMatrixStats)p).stats == null);

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

I think we should avoid modifying the list in place. Maybe make a copy first?

@jpountz

View changes

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
boolean needsScores = false;
if (valuesSources != null) {
for (Map.Entry<String, ValuesSource.Numeric> valueSource : valuesSources.entrySet()) {
needsScores |= valueSource.getValue().needsScores();

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

if the key is not needed, let's just iterate over valuesSources.values()?

@jpountz

View changes

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
private Map<String, Double> getFields(int doc) {
// get fieldNames to use as hash keys
ArrayList<String> fieldNames = new ArrayList<>(values.keySet());
HashMap<String, Double> fields = new HashMap<>(fieldNames.size());

This comment has been minimized.

Copy link
@jpountz

jpountz May 16, 2016

Contributor

this is called in collect which is typically called in a tight loop so we generally try to avoid having object creation done under collect(). Maybe we could use ordinal to identify field names so that the HashMap<String, Double> can become a simple double[] and then reuse the double[] across calls to collect?

@polyfractal

View changes

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
fieldValue += doubleValues.valueAt(i);
}
}
fields.put(fieldName, fieldValue / valuesCount);

This comment has been minimized.

Copy link
@polyfractal

polyfractal May 16, 2016

Member

Unsure if this matters, but if valueCount is a positive, non-zero integer but all the values are NaN, the calculation will be 0. Is that the intended behavior? Since we're treating NaN as if it's a missing value and skip them, perhaps we should count how many values were collected instead of using valuesCount?

nA2 = nA * nA; // doc A num samples squared
nB2 = nB * nB; // doc B num samples squared
// variance
variances.put(fieldName, varA + varB + d2 * nA * other.docCount / docCount);

This comment has been minimized.

Copy link
@polyfractal

polyfractal May 16, 2016

Member

This caught my eye at first because I wasn't sure how the order of operations was falling out, then I was curious. Is the usage of delta squared mean here equivalent to a Pooled Variance like http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html ?

I was expecting something like:

(
  nA * (Math.pow(varA,2) + Math.pow(meanA - combinedMean, 2))
  +
  nB * (Math.pow(varB,2) + Math.pow(meanB - combinedMean, 2))
) / docCount

But my math skills are super rusty, so heaps of salt etc etc :)

This comment has been minimized.

Copy link
@polyfractal

polyfractal May 16, 2016

Member

Oh, just saw the PDF in the comment for streaming formulas. I assume it's one of those, so you can probably ignore this :)

@polyfractal

This comment has been minimized.

Copy link
Member

commented May 16, 2016

I tried, and largely failed, to grok the stats calculations. My math skills are pretty rusty and the streaming formulas look different from what I'm used to. I'd try to get a second opinion from @brwe :)

@nknize nknize force-pushed the nknize:feature/MatrixStats-Module branch May 23, 2016

@nknize

This comment has been minimized.

Copy link
Member Author

commented May 23, 2016

Great suggestions! I incorporated all feedback and the PR is ready for another review. Here are the main changes:

  • Added MultiValueMode support for handling multi value fields (it was easier adding it here instead of postponing to a later PR)
  • Added a MultiValuesSource class to encapsulate field names and ValuesSource arrays, this is now used to retrieve document field values instead of creating a Map inside of collect
  • Updated doReduce to merge all shard results in an empty RunningStats object.
  • Added Rest testing for MultiValueMode support

/cc @colings86, @jpountz

@nknize nknize added v5.0.0-alpha3 and removed discuss WIP labels May 24, 2016

@jpountz

View changes

.../java/org/elasticsearch/search/aggregations/support/MultiValuesSourceAggregationBuilder.java Outdated
private ValueType valueType = null;
private String format = null;
private Object missing = null;
private DateTimeZone timeZone = null;

This comment has been minimized.

Copy link
@jpountz

jpountz May 25, 2016

Contributor

do we make use of things like missing and timeZone already? If we don't I think we should remove them for now?

/** kurtosis values (fourth moment) */
protected HashMap<String, Double> kurtosis;
/** covariance values */
protected HashMap<String, HashMap<String, Double>> covariances;

This comment has been minimized.

Copy link
@jpountz

jpountz May 25, 2016

Contributor

to make this class more efficient, we might want to give an ordinal to every field and then use field ordinals as keys rather than field names? (I don't mind it being done as a follow-up, I'm just suggesting)

This comment has been minimized.

Copy link
@nknize

nknize May 26, 2016

Author Member

+1 for followup PR. I'll open an issue

@nknize

This comment has been minimized.

Copy link
Member Author

commented May 26, 2016

Thanks for the latest round of review @jpountz. I went ahead and made the minor code changes, added support for missing values, and updated documentation. Let me know what you think.

@nknize nknize force-pushed the nknize:feature/MatrixStats-Module branch May 27, 2016

@jpountz

View changes

docs/reference/modules/aggregations/matrix/stats.asciidoc Outdated
}
--------------------------------------------------

The aggregation type is `matrix_stats` and the `field` setting defines the set of fields (as an array) for computing

This comment has been minimized.

Copy link
@jpountz

jpountz May 31, 2016

Contributor

s/field/fields/

@jpountz

View changes

docs/reference/modules/aggregations/matrix/stats.asciidoc Outdated
`min`:: Pick the lowest value.
`max`:: Pick the highest value.
`sum`:: Use the sum of all values as sort value.
`median`:: Use the median of all values as sort value.

This comment has been minimized.

Copy link
@jpountz

jpountz May 31, 2016

Contributor

maybe s/as sort values// since there is no sorting happening?

@jpountz

View changes

docs/reference/modules/aggregations/matrix/stats.asciidoc Outdated
"matrixstats": {
"matrix_stats": {
"fields": ["poverty", "income"],
"missing": [{"income" : 50000}] <1>

This comment has been minimized.

Copy link
@jpountz

jpountz May 31, 2016

Contributor

should it rather be "missing": {"income" : 50000}, ie. do we need the array?

@jpountz

View changes

.../src/main/java/org/elasticsearch/search/aggregations/matrix/stats/MatrixStatsAggregator.java Outdated
final NumericDoubleValues doubleValues = values[i];
final double value = doubleValues.get(doc);
// skip if value is missing
// todo: we could add option to specify a missing value?

This comment has been minimized.

Copy link
@jpountz

jpountz May 31, 2016

Contributor

the doc states there is such an option?

@jpountz

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

I'd like @colings86 to have a final look, at least for the core refactorings that you did. But otherwise this looks good to me for a first iteration!

@@ -76,6 +76,21 @@ public boolean valid() {
return this;
}

public ValuesSourceConfig<VS> format(final DocValueFormat format) {

This comment has been minimized.

Copy link
@colings86

colings86 May 31, 2016

Member

Should we make all the fields of this class private now we have setters/getters for them so there is only one way of accessing/modifying the values? (maybe in a followup PR?)

This comment has been minimized.

Copy link
@nknize

nknize May 31, 2016

Author Member

I went ahead and did it in this PR but will keep it as a separate commit when I merge.

@colings86

This comment has been minimized.

Copy link
Member

commented May 31, 2016

@jpountz @nknize I took a look at the core changes and they LGTM. I also took a brief look at the module, there are unit and REST tests but we are missing integration tests for this. Is there a reason we have not done an integration tests for this aggregation? All other aggregations have integration tests.

@nknize

This comment has been minimized.

Copy link
Member Author

commented May 31, 2016

@colings86 I had asked @rjernst about integration testing in modules. I could be wrong but I think to avoid the overhead of starting/stopping integration test clusters in plugins and modules the ESIntegTestCase tests should be replaced by simple Unit and Rest tests? It was a bit tricky but what I did was refactor the integration tests (defined by AbstractNumericTestCase) from the original PR (#16826) to the yaml Rest definitions included in this PR. I also split out testing of RunningStats and MatrixStatsResults into a separate RunningStatsTests unit test class to ensure the RunningStats (and corresponding results) are computed as expected.

@nknize nknize force-pushed the nknize:feature/MatrixStats-Module branch May 31, 2016

@rjernst

This comment has been minimized.

Copy link
Member

commented Jun 1, 2016

there are unit and REST tests but we are missing integration tests for this

REST tests are real integration tests. No need for fantasy land tests.

@nknize nknize force-pushed the nknize:feature/MatrixStats-Module branch 4 times, most recently Jun 1, 2016

@nknize nknize removed the review label Jun 1, 2016

nknize added some commits May 12, 2016

New Matrix Stats Aggregation module
This commit adds a new aggs-matrix-stats module. The module presents a new class of aggregations called Matrix Aggregations. Matrix aggregations work on multiple fields and produce a matrix as output. The first matrix aggregation provided by the module is matrix_stats aggregation. This aggregation computes the following statistics over a set of fields:

* Covariance
* Correlation

For completeness (and interpretation purposes) the following per-field statistics are also provided:

* sample count
* population mean
* population variance
* population skewness
* population kurtosis

@nknize nknize force-pushed the nknize:feature/MatrixStats-Module branch to 54575e5 Jun 1, 2016

@nknize nknize merged commit 54575e5 into elastic:master Jun 1, 2016

1 check passed

CLA Commit author is a member of Elasticsearch
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.