Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] verify that there are no duplicate leaf fields in aggs #41895

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented May 7, 2019

This PR adds two validations for the data frame pivot config:

  • That there are no duplicate fields in the group_by or the aggs definitions
  • That there are no fields that are declared as an both an object and not, e.g. both foo.bar.baz and foo.bar

The best case scenario before this PR is that we can automatically determine the mapped type and we prevent the transform from even being started. However, if we rely on the dynamic mapping, index mapping failures will spam the logs until the task eventually fails due to the indexing failures.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments

// TODO this will need changed once we allow multi-bucket aggs + field merging
aggregationConfig.getAggregatorFactories().forEach(agg -> addAggNames(agg, usedNames));
aggregationConfig.getPipelineAggregatorFactories().forEach(agg -> addAggNames(agg, usedNames));
usedNames.addAll(groups.getGroups().keySet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss something, but wouldn't it be simpler to sort and then compare adjacent name pairs?
(of course you need logic to handle the dots)

}

for (String fullName : usedNames) {
String[] tokens = fullName.split("\\.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you omit the dots, so what if I have foo.bar.baz and foobar? If I get it right, this would fail validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I need to fix that

}


private static void addAggNames(AggregationBuilder aggregationBuilder, List<String> names) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be just Collection<String> ?

aggregationBuilder.getPipelineAggregations().forEach(agg -> addAggNames(agg, names));
}

private static void addAggNames(PipelineAggregationBuilder pipelineAggregationBuilder, List<String> names) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as above, could be just Collection ?

@@ -136,6 +139,74 @@ public void testDoubleAggs() throws IOException {
expectThrows(IllegalArgumentException.class, () -> createPivotConfigFromString(pivot, false));
}

public void testAggNameValidations() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are good, but I wonder if aggFieldValidation(...) could be re-factored in a way to test it at the unit test level with less boiler plate and more coverage?


List<String> validationFailures = new ArrayList<>();
List<String> usedNames = new ArrayList<>();
// TODO this will need changed once we allow multi-bucket aggs + field merging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"need changed" -> "need to be changed"?


for (String fullName : usedNames) {
String[] tokens = fullName.split("\\.");
for (int i = tokens.length - 1; i > 0; i--) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do you iterate backwards and create a separate StringBuilder in each iteration?
I would go fo something like:
for (String fullName : usedNames) {
String[] tokens = fullName.split("\.");
StringBuilder prefix = new StringBuilder();
for each token:
prefix.append(token)
check in "leafNames"

I believe this code will be both more performant and easier to read. Please LMK if I'm missing something here.

assertFalse(fieldValidation.isEmpty());
assertThat(fieldValidation.get(0), equalTo("field [user] cannot be both an object and a field"));

pivotAggs = "{"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it is a good idea to split such a long test method with independent tests into a few (here, 4) shorter methods. This makes the tests more "unit", thus increasing readability.

@hendrikmuhs
Copy link
Contributor

run elasticsearch-ci/1

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 0531987 into elastic:master May 9, 2019
@benwtrent benwtrent deleted the feature/ml-df-disallow-duplicate-leaf-fields branch May 9, 2019 15:51
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request May 9, 2019
…41895)

* [ML] verify that there are no duplicate leaf fields in aggs

* addressing pr comments

* addressing PR comments

* optmizing duplication check
benwtrent added a commit that referenced this pull request May 9, 2019
…42025)

* [ML] verify that there are no duplicate leaf fields in aggs

* addressing pr comments

* addressing PR comments

* optmizing duplication check
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…41895)

* [ML] verify that there are no duplicate leaf fields in aggs

* addressing pr comments

* addressing PR comments

* optmizing duplication check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants