Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Offer users upgrade to multi-metric job #119980

Merged
merged 16 commits into from
Dec 7, 2021

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented Nov 30, 2021

Closes #112502

CleanShot 2021-12-01 at 10 41 07@2x

CleanShot 2021-12-01 at 10 41 21@2x

CleanShot 2021-12-01 at 10 42 25@2x

CleanShot 2021-12-01 at 10 43 25@2x

CleanShot 2021-12-01 at 10 43 59@2x

@dgieselaar dgieselaar added release_note:enhancement Team:APM All issues that need APM UI Team support v8.0.0 auto-backport Deprecated - use backport:version if exact versions are needed v8.1.0 labels Nov 30, 2021
@dgieselaar dgieselaar marked this pull request as ready for review December 1, 2021 09:45
@dgieselaar dgieselaar requested review from a team as code owners December 1, 2021 09:45
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@dgieselaar
Copy link
Member Author

@formgeist I did not implement the popover, partly due to time constraints but also because I'm wondering if it adds any value. It looks like it's mostly just an extended version of whatever is in the tooltip, and the user still has to navigate to the settings UI to be able to execute the upgrade. Or am I missing something here?

@formgeist
Copy link
Contributor

@formgeist I did not implement the popover, partly due to time constraints but also because I'm wondering if it adds any value. It looks like it's mostly just an extended version of whatever is in the tooltip, and the user still has to navigate to the settings UI to be able to execute the upgrade. Or am I missing something here?

I'm sorry, but I'm not following which popover you're referring to?

@dgieselaar
Copy link
Member Author

@formgeist ah, I thought this was a popover, I was fooled by the highlighting 😄 :
image

I'll have a look at that one.

@formgeist
Copy link
Contributor

@formgeist ah, I thought this was a popover, I was fooled by the highlighting 😄
I'll have a look at that one.

My spotlight screenshot feature is pretty convincing 😆

Comment on lines +7 to +11
// eslint-disable-next-line @kbn/eslint/no-restricted-paths
import { FETCH_STATUS } from '../../public/hooks/use_fetcher';
// eslint-disable-next-line @kbn/eslint/no-restricted-paths
import type { APIReturnType } from '../../public/services/rest/createCallApmApi';
import { ENVIRONMENT_ALL } from '../environment_filter_values';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file in /common if it's only used by on the client? Looks like those eslint warnings would go away if this was moved to /public

datafeedState === DATAFEED_STATE.STARTED ||
datafeedState === DATAFEED_STATE.STARTING;

const isClosed =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
const isClosed =
const jobIsClosed =

)}
</EuiBadge>
);
} else if (!isClosed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: flat if's are a bit easier to parse when there are many of them

Suggested change
} else if (!isClosed) {
}
if (!isClosed) {

Comment on lines 42 to 44
.catch(() => {
setLoading(false);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we show a warning toast if the job creation fails? And shouldn't setLoading(false); be called regardless if it failed or not?

Suggested change
.catch(() => {
setLoading(false);
});
.catch(() => {
core.notification.toasts.addWarning({...})
})
.finally(() => {
setLoading(false);
})

Comment on lines +45 to +47
if (!isAuthorized) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this fix #118711 ?

Copy link
Member

@sorenlouv sorenlouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

We've had some problems with duplicate jobs for the same environment being created. Is that something you think will be solved with this PR?

What happens if you create two jobs for the same environment in two different spaces? Is that supported? Should it be?

Comment on lines 34 to 36
if (!isActivePlatinumLicense(context.licensing.license)) {
throw Boom.forbidden(ML_ERRORS.INVALID_LICENSE);
}
Copy link
Member

@sorenlouv sorenlouv Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we still need to check the license or if the check can be removed (I assume ML does this already)

const [jobs, allJobStats, allDatafeedStats] = await Promise.all([
anomalyDetectors
.jobs(APM_ML_JOB_GROUP)
.then((response) => response.jobs),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't 404's be caught here as well?

Suggested change
.then((response) => response.jobs),
.then((response) => response.jobs)
.catch(catch404),

Comment on lines +64 to 66
} catch (e) {
return catch404(e) as ApmMlJob[];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add the catch above I think this can be removed

Suggested change
} catch (e) {
return catch404(e) as ApmMlJob[];
}
} catch (e) {
return catch404(e) as ApmMlJob[];
}

Comment on lines +38 to +42
if (!setup.ml) {
throw Boom.forbidden(ML_ERRORS.ML_NOT_AVAILABLE);
}

const jobs = await getMlJobsWithAPMGroup(setup.ml?.anomalyDetectors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a helper getAnomalyDetectionJobs. All it does is calling through to getMlJobsWithAPMGroup if setup.ml is defined, else it'll throw like here.

Question: should we use the helper here? If not, perhaps the helper is not so useful and we can delete it.

Copy link
Contributor

@formgeist formgeist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I suggested a couple of changes and I stumbled on the tooltip on the header link where the tooltip gets stuck on the next page.

CleanShot 2021-12-03 at 11 37 39

'xpack.apm.settings.anomalyDetection.jobList.warningStatusLabel',
{
defaultMessage:
'Job might not be running correctly. Go to the job management page in the ML app to get more information.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Job might not be running correctly. Go to the job management page in the ML app to get more information.',
'Job might be experiencing problems. Click the Manage Jobs link to learn more.',

Just felt the copy could use an update

{i18n.translate(
'xpack.apm.settings.anomalyDetection.jobList.manageMlJobsButtonText',
{
defaultMessage: 'Manage ML Jobs',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defaultMessage: 'Manage ML Jobs',
defaultMessage: 'Manage jobs',

Just want to get rid of the ML label

Comment on lines +180 to +191
<EuiSwitch
checked={showLegacyJobs}
onChange={(e) => {
setShowLegacyJobs(e.target.checked);
}}
label={i18n.translate(
'xpack.apm.settings.anomalyDetection.jobList.showLegacyJobsCheckboxText',
{
defaultMessage: 'Show legacy jobs',
}
)}
/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the switch toggled on by default to show legacy jobs. I found that when I see the update callout and go to the page, I don't see my legacy jobs that needs to be updated.

CleanShot 2021-12-03 at 14 15 20@2x

If we instead toggle it on by default, they'll show up as expected. The user can choose to toggle it off once they're done updating and make the table more manageable.

CleanShot 2021-12-03 at 14 15 26@2x

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now shows legacy jobs if there are no other jobs. That seems like a reasonable compromise, without having to implement some setting mechanism backed by localStorage.

'xpack.apm.settings.anomalyDetection.jobsList.updateAvailableDescription',
{
defaultMessage:
'We have updated the anomaly detection jobs that provide insights into degraded performance and added detectors for throughput and failed transaction rate. If you choose to upgrade, we will create the new jobs and close the existing legacy jobs. The data shown in the APM app will automatically switch to the new.',
Copy link
Member

@sorenlouv sorenlouv Dec 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super clear why the user should upgrade, or what the consequences of not upgrading are.
Perhaps we should mention transaction metrics directly:

We have detected that you are using transaction metrics (preaggregated transactions), which are not supported by your current ML jobs. Please upgrade to ensure your ML jobs work correctly. If you choose to upgrade, we will create the new jobs and close the existing legacy jobs. Your data from the legacy jobs will still be available to view in the ML app.

(wording can be improved)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think the upgrade message clearly describes why they should upgrade and what happens after. We intentionally made the message more about the benefits of upgrading rather than trying to explain the technical aspects. I think your proposal will only make it more confusing to understand the why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's better to keep the implementation details away from the end user. Still, when reading the current message it feels like an optional upgrade, and not something we strongly urge users to do. ML team said that even if all transactions are sampled, they still wouldn't recommend showing transaction-based ML results on top of metric based data (which is what will happen until people upgrade afaik).

Either way, this will be easy to change down the road so we can release it as-is and adapt if we get feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML team said that even if all transactions are sampled, they still wouldn't recommend showing transaction-based ML results on top of metric based data (which is what will happen until people upgrade afaik).

That information wasn't available at the time when we made the choice to design the migration this way. If that's the case, I'd recommend us looking into a more strict migration in the next iteration. That could still just involve us changing the messaging and displays to say that we're not going to show the legacy job results in the app but only rely on the new jobs results. Do you think it's worth opening a new issue to track this?

Copy link
Member

@sorenlouv sorenlouv Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to see where ML said that we shouldn't mix metric events and transaction anomalies and found this from @sophiec20 :

think we would be safer with transaction charts showing transaction anomalies, and metric powered charts showing metric anomalies. The anomalies would probably be OK, but the expected bounds might be very different.

It's then followed up with an edit (that I had missed) saying:

expected bounds are for mean transaction duration, therefore it should be ok to plot providing sampling is representative

The way I read this we are okay to mix as long as end users sampling is representative.

Do you think it's worth opening a new issue to track this

No, let's marked this as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgieselaar quick question - is the ML job version tracked in our current telemetry? Just wondering if we can employ telemetry to figure out if our migration rate is too slow and then try another strategy to get users to convert. I also think once we will display expected bounds for all metrics, it'll make more sense to get jobs upgraded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not, we would have to add it, which would require us to migrate to telemetry v2 first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we should create a new issue to implement it even if we do it in another iteration. I think the telemetry would be nice to have 👍

@dgieselaar
Copy link
Member Author

Created a follow-up ticket to address remaining feedback: #120503

* 2.0; you may not use this file except in compliance with the Elastic License
* 2.0.
*/
import { DATAFEED_STATE, JOB_STATE } from '../../../ml/common/constants/states';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changed export I guess this can be imported from '../../../ml/common' now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers, updated!

Copy link
Contributor

@walterra walterra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML changes LGTM

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1170 1172 +2

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
ml 279 281 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 2.7MB 2.8MB +5.6KB
ml 3.5MB 3.5MB -1.5KB
total +4.1KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
apm 45 46 +1

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
ml 37.7KB 38.3KB +621.0B
Unknown metric groups

API count

id before after diff
ml 283 285 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@dgieselaar dgieselaar merged commit 0ed5cb9 into elastic:main Dec 7, 2021
@dgieselaar dgieselaar deleted the ml-job-upgrade branch December 7, 2021 13:24
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Dec 7, 2021
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
8.0

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Dec 7, 2021
Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co>
Heenawter pushed a commit to Heenawter/kibana that referenced this pull request Dec 8, 2021
TinLe pushed a commit to TinLe/kibana that referenced this pull request Dec 22, 2021
@MiriamAparicio MiriamAparicio added the apm:test-plan-done Pull request that was successfully tested during the test plan label Dec 28, 2021
gbamparop pushed a commit to gbamparop/kibana that referenced this pull request Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:test-plan-8.0.0 apm:test-plan-done Pull request that was successfully tested during the test plan auto-backport Deprecated - use backport:version if exact versions are needed release_note:enhancement Team:APM All issues that need APM UI Team support v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[APM] Offer users upgrade to metrics-based ML job
8 participants