Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Parse and report memory usage for DF Analytics #52778

Conversation

dimitris-athanasiou
Copy link
Contributor

Adds reporting of memory usage for data frame analytics jobs.
This commit introduces a new index pattern .ml-stats-* whose
first concrete index will be .ml-stats-000001. This index serves
to store instrumentation information for those jobs.

createIndexResponse -> listener.onResponse(true),
error -> {
if (ExceptionsHelper.unwrapCause(error) instanceof ResourceAlreadyExistsException) {
listener.onResponse(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inconsistency here, because if the index and alias already existed when the method was called the response is false, but if the method is called twice concurrently such that one concurrent call creates it and the other gets a ResourceAlreadyExistsException then both return true.

If the boolean is intended to be "did this call create the index and alias" then one should return false in this case.

But it's not actually documented what the returned boolean is supposed to mean. Doing that would be good too.

CreateIndexRequest createIndexRequest = client.admin()
.indices()
.prepareCreate(TEMPLATE_NAME + "-000001")
.addAlias(new Alias(writeAlias()).writeIndex(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't cover the edge case where the index exists but the write alias doesn't (presumably because a user accidentally deleted it, but maybe also due to a bug in ILM).

I think this method should cover that case like AnomalyDetectorsIndex.createStateIndexAndAliasIfNecessary() does. It will avoid support cases if the system can be self healing in this situation. At present it will return true giving the impression that everything is good when the post conditions are that the index exists but not the alias.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I thought the main reason we had that code for anomaly detection was because originally we were not using aliases. But I can see how being able to self-heal would help.

parser.declareString((bucket, s) -> {}, TYPE);
parser.declareString(ConstructingObjectParser.constructorArg(), JOB_ID);
parser.declareField(ConstructingObjectParser.constructorArg(),
p -> TimeUtils.parseTimeFieldToInstant(p, TIMESTAMP.getPreferredName()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's good that we're propagating the behaviour of the old Prelert time parsing that is completely non-standard in the Elastic stack into new code:

            if (date.trim().length() <= 10) { // seconds
                return epoch * 1000;
            } else {
                return epoch;
            }

It would have been best if we'd removed this years ago.

Maybe now is a good opportunity to rename TimeUtils.parseTimeField() to TimeUtils.parseTimeFieldDeprecated() and TimeUtils.parseTimeFieldToInstant() to TimeUtils.parseTimeFieldToInstantDeprecated(), annotate both with @Deprecated and introduce a new method TimeUtils.parseTimeFieldToInstant() that can be used here that replaces return Instant.ofEpochMilli(dateStringToEpoch(parser.text())); with return Instant.from(DateFieldMapper.DEFAULT_DATE_TIME_FORMATTER.parse(parser.text()));.

builder.field(TYPE.getPreferredName(), TYPE_VALUE);
builder.field(JOB_ID.getPreferredName(), jobId);
}
builder.timeField(TIMESTAMP.getPreferredName(), TIMESTAMP.getPreferredName() + "_string", timestamp.toEpochMilli());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a comment that the reason for rounding to millisecond accuracy is that the XContent representation rounds to millisecond accuracy and it makes debugging hard if the internal accuracy is greater.

@benwtrent benwtrent self-requested a review February 26, 2020 19:50
ConstructingObjectParser<MemoryUsage, Void> parser = new ConstructingObjectParser<>(TYPE_VALUE,
ignoreUnknownFields, a -> new MemoryUsage((String) a[0], (Instant) a[1], (long) a[2]));

parser.declareString((bucket, s) -> {}, TYPE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Presumably, the only time it is parsing TYPE is when it is reading from the index. In that case, it should ignore unknown fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's that bad to include it even though it's technically redundant. It serves partly as documentation that we expect the field to exist if somebody is looking at the parser definition to find out which fields are expected in the document in the current product version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!


@Override
protected ToXContent.Params getToXContentParams() {
return new ToXContent.MapParams(Collections.singletonMap(ToXContentParams.FOR_INTERNAL_STORAGE, "true"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new ToXContent.MapParams(Collections.singletonMap(ToXContentParams.FOR_INTERNAL_STORAGE, "true"));
return new ToXContent.MapParams(Collections.singletonMap(ToXContentParams.FOR_INTERNAL_STORAGE, Boolean.toString(lenient));

I think will work so that the empty parsing declaration for TYPE can go away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is we also only write out job_id for internal storage. Then we can't test for equality. Given that, I think I'd rather keep parsing TYPE.

));
AtomicInteger counter = new AtomicInteger(stoppedTasksIds.size());
AtomicArray<Stats> jobStats = new AtomicArray<>(stoppedTasksIds.size());
for (int i = 0; i < stoppedTasksIds.size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we will hit scaling issues if there are 100s of stopped tasks.

Seems like we could be making 100s of unbatched, search requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I changed the code to make a multi-search per job. If we batch everything up, then we have jobs * stats_fields searches in a single multi-search and that to have its own problems. We do it this way for anomaly detection jobs too. Not sure there's a better way.

import java.io.InputStream;
import java.util.function.BiFunction;

public final class MlParserUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Adds reporting of memory usage for data frame analytics jobs.
This commit introduces a new index pattern `.ml-stats-*` whose
first concrete index will be `.ml-stats-000001`. This index serves
to store instrumentation information for those jobs.
@dimitris-athanasiou dimitris-athanasiou merged commit dd33193 into elastic:master Feb 28, 2020
@dimitris-athanasiou dimitris-athanasiou deleted the df-analytics-memory-usage branch February 28, 2020 15:35
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Feb 28, 2020
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Feb 28, 2020
dimitris-athanasiou added a commit that referenced this pull request Feb 29, 2020
dimitris-athanasiou added a commit that referenced this pull request Feb 29, 2020
)

Adds reporting of memory usage for data frame analytics jobs.
This commit introduces a new index pattern `.ml-stats-*` whose
first concrete index will be `.ml-stats-000001`. This index serves
to store instrumentation information for those jobs.

Backport of #52778 and #52958
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants