Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Migrate tx latency chart and group stats to rollups/service metrics #153162

Merged

Conversation

dgieselaar
Copy link
Member

Migrates the transaction latency chart and the transaction group statistics table to use service overview metrics/rollups when appropriate.

Part of #152693 and #146683 (the work is split up to prevent a huge PR).

@dgieselaar dgieselaar added release_note:enhancement Team:APM All issues that need APM UI Team support v8.8.0 labels Mar 12, 2023
@dgieselaar dgieselaar requested review from a team as code owners March 12, 2023 15:30
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@@ -15,6 +15,7 @@ export const isAlertDetailsEnabledPerApp = (
alert: TopAlert | null,
config: ConfigSchema | null
): boolean => {
console.log(alert, config);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only file change that we own, so after removing it there is no need for review from AO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! missed this

Comment on lines +21 to +23
export type ApmTransactionDocumentType =
| ApmDocumentType.TransactionMetric
| ApmDocumentType.TransactionEvent;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about service transaction metric?

Suggested change
export type ApmTransactionDocumentType =
| ApmDocumentType.TransactionMetric
| ApmDocumentType.TransactionEvent;
export type ApmTransactionDocumentType =
| ApmDocumentType.ServiceTransactionMetric
| ApmDocumentType.TransactionMetric
| ApmDocumentType.TransactionEvent;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we add ApmDocumentType.ServiceTransactionMetric to this one, it becomes in ApmServiceTransactionDocumentType

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's about making sure that the metric contains certain data. ie, service-instance specific data (like host) will not be available on transaction metrics.

import {
LatencyAggregationType,
latencyAggregationTypeRt,
} from '../../../common/latency_aggregation_types';
import { getApmEventClient } from '../../lib/helpers/get_apm_event_client';
import { getSearchTransactionsEvents } from '../../lib/helpers/transactions';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR but afaict getSearchTransactionsEvents should be renamed getSearchAggregatedTransactions. It was a little confusing debugging this when that function is named the exact opposite of what it actually does.

Suggested change
import { getSearchTransactionsEvents } from '../../lib/helpers/transactions';
import { getSearchAggregatedTransactions } from '../../lib/helpers/transactions';

Btw. I don't think we should use the term "aggregated transactions" anymore. I'd prefer to use the terms "transaction metrics" and "service transaction metrics" instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll eventually remove it anyway.

@@ -65,6 +69,9 @@ export function LatencyChart({ height, kuery }: Props) {
useTransactionLatencyChartsFetcher({
kuery,
environment,
transactionName:
'transactionName' in query ? query.transactionName : null,
latencyAggregationType: getLatencyAggregationType(latencyAggregationType),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use io-ts to validate that latencyAggregationType is indeed of type LatencyAggregationType?

Similar to what we do for sortDirection:

sortDirection: t.union([t.literal('asc'), t.literal('desc')]),

Suggested change
latencyAggregationType: getLatencyAggregationType(latencyAggregationType),
latencyAggregationType,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLatencyAggregationType also returns a default value if it's undefined.

serviceName={serviceName}
start={historicalRange.start}
end={historicalRange.end}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fix related to this this PR and the move to service transaction metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a cosmetic change, it doesn't change the behaviour of the component.


const response = await apmEventClient.search(
'get_service_transaction_groups',
{
apm: {
events: [
getProcessorEventForTransactions(searchAggregatedTransactions),
sources: [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should eventually replace events with sources everywhere - even for the queries that don't support rollups or service metrics. It adds complexity to have to separate ways to define this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! I'm adding some additional support in a follow-up PR.

@@ -99,9 +101,6 @@ export async function getServiceTransactionGroups({
],
},
},
...getDocumentTypeFilterForTransactions(
searchAggregatedTransactions
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed because you apply the filter in APMEventClient, right? We should aim to do this consistently everywhere. I'm sure someone will use events and then forget to add the filter explicitly because they expect it to be added automatically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

start,
end,
kuery,
numBuckets: 20,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict numBuckets is used to calculate bucketSizeInSeconds - but this is not needed for the table. Can it be omitted?

Suggested change
numBuckets: 20,
numBuckets: 20,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transaction group table has spark plots.

Copy link
Member

@sorenlouv sorenlouv Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is for the main statistics (which doesn't include spark plots afaik)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. I think it should still be there though, 1 bucket might lead to a bigger error boundary (e.g. selecting the 1 hour bucket for 1.5 hours will lead to skewed statistics).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. It still seems odd to specify number of buckets when the expected return value is a single metric number (as opposed to a timeseries) but no biggie.

return (
<TimeRangeMetadataContextProvider
uiSettings={uiSettings}
useSpanName={isOperationView}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something still feels odd about this option (useSpanName). I don't understand why we have to special-case service destination metrics like this - isn't it just another source similar to transaction metrics and transaction events?

I think the response of GET /internal/apm/time_range_metadata shows the inconsistency:

{
  "isUsingServiceDestinationMetrics": true,
  "sources": [
    {
      "documentType": "serviceTransactionMetric",
      "rollupInterval": "1m",
      "hasDocs": true
    },
    {
      "documentType": "serviceTransactionMetric",
      "rollupInterval": "10m",
      "hasDocs": true
    },
    {
      "documentType": "serviceTransactionMetric",
      "rollupInterval": "60m",
      "hasDocs": true
    },
    {
      "documentType": "transactionMetric",
      "rollupInterval": "1m",
      "hasDocs": true
    },
    {
      "documentType": "transactionMetric",
      "rollupInterval": "10m",
      "hasDocs": true
    },
    {
      "documentType": "transactionMetric",
      "rollupInterval": "60m",
      "hasDocs": true
    },
    {
      "documentType": "transactionEvent",
      "rollupInterval": "none",
      "hasDocs": true
    }
  ]
}

Shouldn't the service destination metrics be part of sources, like:

{
  "sources": [
    {
      "documentType": "serviceDestinationMetrics",
      "rollupInterval": "none",
      "hasDocs": true
    },
    {
      "documentType": "spanEvents",
      "rollupInterval": "none",
      "hasDocs": true
    },
    // ...
  ]
}

Apart from making things more consistent, we're now able to have views where one component uses serviceDestinationMetrics and another uses spanEvents similar to what we do with transactions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In earlier versions span.name was not available on span metrics. That was a fairly recent change. But we have a couple of views that were added after, and require span.name to be present. If span.name is not available on span metrics, we fall back to span events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what about something like this?

{
  "sources": [
    {
      "documentType": "spanMetricsWithName",
      "rollupInterval": "none",
      "hasDocs": false
    },
    {
      "documentType": "spanMetrics",
      "rollupInterval": "none",
      "hasDocs": true
    },
    {
      "documentType": "spanEvents",
      "rollupInterval": "none",
      "hasDocs": true
    },
    // ...
  ]
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've considered that, also in the context of transaction metrics with transaction.duration.summary, but I don't think different document types for schema changes will scale. I would rather return some metadata with the document type, e.g. a version or a meta property. But I will tackle that when I implement support for transaction.duration.summary on tx metrics (which might not include changes to service destination metrics as it makes sense to tackle that separately).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather return some metadata with the document type, e.g. a version or a meta property.

That makes sense. Not a blocker but it would be cleaner with a "unified" model for this. The useSpanName seems like an afterthought.

@sorenlouv
Copy link
Member

sorenlouv commented Mar 21, 2023

GET /internal/apm/time_range_metadata doesn't take serviceName as a param. This means that if any service has service transaction metrics, it is expected that all other services will have it too.
This is probably a correct assumption in most cases but we should double check that we are not causing too much disruption with this change (eg. showing empty states for services without service transaction metrics). We could add a serviceName param to ensure it re-runs for every service which wouldn't have too big of a perf impact.

@@ -18,7 +18,7 @@ export function RouterProvider({
}: {
router: Router<RouteMap>;
history: History;
children: React.ReactNode;
children?: React.ReactNode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no need for it to be required.

Comment on lines +21 to +23
export type ApmTransactionDocumentType =
| ApmDocumentType.TransactionMetric
| ApmDocumentType.TransactionEvent;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we add ApmDocumentType.ServiceTransactionMetric to this one, it becomes in ApmServiceTransactionDocumentType

dgieselaar and others added 2 commits March 22, 2023 17:13
…/index.tsx

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
…ex.tsx

Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
@dgieselaar
Copy link
Member Author

@sqren:

GET /internal/apm/time_range_metadata doesn't take serviceName as a param. This means that if any service has service transaction metrics, it is expected that all other services will have it too.
This is probably a correct assumption in most cases but we should double check that we are not causing too much disruption with this change (eg. showing empty states for services without service transaction metrics). We could add a serviceName param to ensure it re-runs for every service which wouldn't have too big of a perf impact.

This has always been the behaviour, it's not changing.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 3.8MB 3.8MB +1.9KB
Unknown metric groups

ESLint disabled line counts

id before after diff
securitySolution 433 436 +3

Total ESLint disabled count

id before after diff
securitySolution 513 516 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@dgieselaar dgieselaar merged commit 97bca50 into elastic:main Mar 23, 2023
@dgieselaar dgieselaar deleted the service-overview-rollups-latency-group-stats branch March 23, 2023 10:30
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:enhancement Team:APM All issues that need APM UI Team support v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants