-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Offer users upgrade to multi-metric job #119980
Conversation
Pinging @elastic/apm-ui (Team:apm) |
@formgeist I did not implement the popover, partly due to time constraints but also because I'm wondering if it adds any value. It looks like it's mostly just an extended version of whatever is in the tooltip, and the user still has to navigate to the settings UI to be able to execute the upgrade. Or am I missing something here? |
8025301
to
daeaef5
Compare
I'm sorry, but I'm not following which popover you're referring to? |
@formgeist ah, I thought this was a popover, I was fooled by the highlighting 😄 : I'll have a look at that one. |
My spotlight screenshot feature is pretty convincing 😆 |
// eslint-disable-next-line @kbn/eslint/no-restricted-paths | ||
import { FETCH_STATUS } from '../../public/hooks/use_fetcher'; | ||
// eslint-disable-next-line @kbn/eslint/no-restricted-paths | ||
import type { APIReturnType } from '../../public/services/rest/createCallApmApi'; | ||
import { ENVIRONMENT_ALL } from '../environment_filter_values'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this file in /common
if it's only used by on the client? Looks like those eslint warnings would go away if this was moved to /public
datafeedState === DATAFEED_STATE.STARTED || | ||
datafeedState === DATAFEED_STATE.STARTING; | ||
|
||
const isClosed = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
const isClosed = | |
const jobIsClosed = |
)} | ||
</EuiBadge> | ||
); | ||
} else if (!isClosed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: flat if
's are a bit easier to parse when there are many of them
} else if (!isClosed) { | |
} | |
if (!isClosed) { |
.catch(() => { | ||
setLoading(false); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we show a warning toast if the job creation fails? And shouldn't setLoading(false);
be called regardless if it failed or not?
.catch(() => { | |
setLoading(false); | |
}); | |
.catch(() => { | |
core.notification.toasts.addWarning({...}) | |
}) | |
.finally(() => { | |
setLoading(false); | |
}) |
if (!isAuthorized) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fix #118711 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
We've had some problems with duplicate jobs for the same environment being created. Is that something you think will be solved with this PR?
What happens if you create two jobs for the same environment in two different spaces? Is that supported? Should it be?
if (!isActivePlatinumLicense(context.licensing.license)) { | ||
throw Boom.forbidden(ML_ERRORS.INVALID_LICENSE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we still need to check the license or if the check can be removed (I assume ML does this already)
const [jobs, allJobStats, allDatafeedStats] = await Promise.all([ | ||
anomalyDetectors | ||
.jobs(APM_ML_JOB_GROUP) | ||
.then((response) => response.jobs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't 404's be caught here as well?
.then((response) => response.jobs), | |
.then((response) => response.jobs) | |
.catch(catch404), |
} catch (e) { | ||
return catch404(e) as ApmMlJob[]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add the catch
above I think this can be removed
} catch (e) { | |
return catch404(e) as ApmMlJob[]; | |
} | |
} catch (e) { | |
return catch404(e) as ApmMlJob[]; | |
} |
if (!setup.ml) { | ||
throw Boom.forbidden(ML_ERRORS.ML_NOT_AVAILABLE); | ||
} | ||
|
||
const jobs = await getMlJobsWithAPMGroup(setup.ml?.anomalyDetectors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a helper getAnomalyDetectionJobs
. All it does is calling through to getMlJobsWithAPMGroup
if setup.ml
is defined, else it'll throw like here.
Question: should we use the helper here? If not, perhaps the helper is not so useful and we can delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'xpack.apm.settings.anomalyDetection.jobList.warningStatusLabel', | ||
{ | ||
defaultMessage: | ||
'Job might not be running correctly. Go to the job management page in the ML app to get more information.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Job might not be running correctly. Go to the job management page in the ML app to get more information.', | |
'Job might be experiencing problems. Click the Manage Jobs link to learn more.', |
Just felt the copy could use an update
{i18n.translate( | ||
'xpack.apm.settings.anomalyDetection.jobList.manageMlJobsButtonText', | ||
{ | ||
defaultMessage: 'Manage ML Jobs', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultMessage: 'Manage ML Jobs', | |
defaultMessage: 'Manage jobs', |
Just want to get rid of the ML
label
<EuiSwitch | ||
checked={showLegacyJobs} | ||
onChange={(e) => { | ||
setShowLegacyJobs(e.target.checked); | ||
}} | ||
label={i18n.translate( | ||
'xpack.apm.settings.anomalyDetection.jobList.showLegacyJobsCheckboxText', | ||
{ | ||
defaultMessage: 'Show legacy jobs', | ||
} | ||
)} | ||
/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the switch toggled on by default to show legacy jobs. I found that when I see the update callout and go to the page, I don't see my legacy jobs that needs to be updated.
If we instead toggle it on by default, they'll show up as expected. The user can choose to toggle it off once they're done updating and make the table more manageable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now shows legacy jobs if there are no other jobs. That seems like a reasonable compromise, without having to implement some setting mechanism backed by localStorage.
'xpack.apm.settings.anomalyDetection.jobsList.updateAvailableDescription', | ||
{ | ||
defaultMessage: | ||
'We have updated the anomaly detection jobs that provide insights into degraded performance and added detectors for throughput and failed transaction rate. If you choose to upgrade, we will create the new jobs and close the existing legacy jobs. The data shown in the APM app will automatically switch to the new.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super clear why the user should upgrade, or what the consequences of not upgrading are.
Perhaps we should mention transaction metrics directly:
We have detected that you are using transaction metrics (preaggregated transactions), which are not supported by your current ML jobs. Please upgrade to ensure your ML jobs work correctly. If you choose to upgrade, we will create the new jobs and close the existing legacy jobs. Your data from the legacy jobs will still be available to view in the ML app.
(wording can be improved)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, I think the upgrade message clearly describes why they should upgrade and what happens after. We intentionally made the message more about the benefits of upgrading rather than trying to explain the technical aspects. I think your proposal will only make it more confusing to understand the why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's better to keep the implementation details away from the end user. Still, when reading the current message it feels like an optional upgrade, and not something we strongly urge users to do. ML team said that even if all transactions are sampled, they still wouldn't recommend showing transaction-based ML results on top of metric based data (which is what will happen until people upgrade afaik).
Either way, this will be easy to change down the road so we can release it as-is and adapt if we get feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML team said that even if all transactions are sampled, they still wouldn't recommend showing transaction-based ML results on top of metric based data (which is what will happen until people upgrade afaik).
That information wasn't available at the time when we made the choice to design the migration this way. If that's the case, I'd recommend us looking into a more strict migration in the next iteration. That could still just involve us changing the messaging and displays to say that we're not going to show the legacy job results in the app but only rely on the new jobs results. Do you think it's worth opening a new issue to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back to see where ML said that we shouldn't mix metric events and transaction anomalies and found this from @sophiec20 :
think we would be safer with transaction charts showing transaction anomalies, and metric powered charts showing metric anomalies. The anomalies would probably be OK, but the expected bounds might be very different.
It's then followed up with an edit (that I had missed) saying:
expected bounds are for mean transaction duration, therefore it should be ok to plot providing sampling is representative
The way I read this we are okay to mix as long as end users sampling is representative.
Do you think it's worth opening a new issue to track this
No, let's marked this as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgieselaar quick question - is the ML job version tracked in our current telemetry? Just wondering if we can employ telemetry to figure out if our migration rate is too slow and then try another strategy to get users to convert. I also think once we will display expected bounds for all metrics, it'll make more sense to get jobs upgraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not, we would have to add it, which would require us to migrate to telemetry v2 first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we should create a new issue to implement it even if we do it in another iteration. I think the telemetry would be nice to have 👍
Created a follow-up ticket to address remaining feedback: #120503 |
* 2.0; you may not use this file except in compliance with the Elastic License | ||
* 2.0. | ||
*/ | ||
import { DATAFEED_STATE, JOB_STATE } from '../../../ml/common/constants/states'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changed export I guess this can be imported from '../../../ml/common'
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cheers, updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML changes LGTM
💚 Build Succeeded
Metrics [docs]Module Count
Public APIs missing comments
Async chunks
Public APIs missing exports
Page load bundle
History
To update your PR or re-run it, just comment with: |
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
Closes #112502