Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Explain data frame analytics API #49455

Merged

Conversation

dimitris-athanasiou
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou commented Nov 21, 2019

This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

  • memory estimation
  • field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

----
{
"field_selection": [

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll come back to add this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now added this in

@dimitris-athanasiou
Copy link
Contributor Author

@alvarezmelissa87 pinging you as this is the PR that adds field selection explanation

@dimitris-athanasiou
Copy link
Contributor Author

dimitris-athanasiou commented Nov 21, 2019

@szabosteve I'm updating documentation too in this PR, could you please take a look at the docs changes?

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! Some minor comments.

Also, I think you might need to black list the failure focused yaml tests for the ml with security tests

Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dimitris-athanasiou for documenting this! I left a couple of minor comments, but it LGTM.

@przemekwitek przemekwitek self-requested a review November 22, 2019 09:24
@@ -100,9 +100,9 @@ private MemoryUsageEstimationResult runJob(String jobId,
} finally {
process.consumeAndCloseOutputStream();
try {
LOGGER.info("[{}] Closing process", jobId);
LOGGER.debug("[{}] Closing process", jobId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be kept in sync with log-levels in AnalyticsProcessManager.java (which are currently info)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this API is not part of the life-cycle of the job, I'd rather it stays quiet in the info level unless something goes wrong.

FieldSelection(String name, Set<String> mappingTypes, boolean isIncluded, boolean isRequired, @Nullable FeatureType featureType,
@Nullable String reason) {
this.name = Objects.requireNonNull(name);
this.mappingTypes = Collections.unmodifiableSet(mappingTypes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects.requireNonNull(mappingTypes)
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.unmodifiableSet throws if it's passed null

.stream().collect(Collectors.toSet());
FieldSelection.FeatureType featureType = randomBoolean() ? null : randomFrom(FieldSelection.FeatureType.values());
String reason = randomBoolean() ? null : randomAlphaOfLength(20);
return new FieldSelection(randomAlphaOfLength(10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new FieldSelection(randomAlphaOfLength(10),
return new FieldSelection(
randomAlphaOfLength(10),

FieldSelection(String name, Set<String> mappingTypes, boolean isIncluded, boolean isRequired, @Nullable FeatureType featureType,
@Nullable String reason) {
this.name = Objects.requireNonNull(name);
this.mappingTypes = Collections.unmodifiableSet(mappingTypes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects.requireNonNull?
Also, should we use ExceptionHelper.requireNonNull instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptionsHelper.requireNonNull makes sense for user facing code. We thus use it when we parse request objects. In this case we create this ourselves Objects.requireNonNull should be enough.

.stream().collect(Collectors.toSet());
FieldSelection.FeatureType featureType = randomBoolean() ? null : randomFrom(FieldSelection.FeatureType.values());
String reason = randomBoolean() ? null : randomAlphaOfLength(20);
return new FieldSelection(randomAlphaOfLength(10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new FieldSelection(randomAlphaOfLength(10),
return new FieldSelection(
randomAlphaOfLength(10),

@dimitris-athanasiou dimitris-athanasiou changed the title [ML] Data frame analytics info API [ML] Explain data frame analytics API Nov 22, 2019
This commit replaces the _estimate_memory_usage API with
a new API, the _info API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.
…documentation/MlClientDocumentationIT.java

Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
…documentation/MlClientDocumentationIT.java

Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
…frame/extractor/ExtractedFieldsDetector.java

Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
…/dataframe/RestDataFrameAnalyticsInfoAction.java

Co-Authored-By: Benjamin Trent <ben.w.trent@gmail.com>
Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-Authored-By: István Zoltán Szabó <istvan.szabo@elastic.co>
@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/1

delvedor added a commit to elastic/elasticsearch-js that referenced this pull request Nov 22, 2019
@dimitris-athanasiou dimitris-athanasiou merged commit 0390ec3 into elastic:master Nov 22, 2019
@dimitris-athanasiou dimitris-athanasiou deleted the df-analytics-info-api branch November 22, 2019 18:08
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Nov 22, 2019
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

Backport of elastic#49455
dimitris-athanasiou added a commit that referenced this pull request Nov 22, 2019
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.

Backport of #49455
Copy link
Contributor

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants