Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Service map request fails when searching outside of the boost window #179229

Closed
ablnk opened this issue Mar 22, 2024 · 33 comments
Closed

[APM] Service map request fails when searching outside of the boost window #179229

ablnk opened this issue Mar 22, 2024 · 33 comments
Assignees
Labels
apm:serverless bug Fixes for quality problems that affect the customer experience needs-refinement A reason and acceptance criteria need to be defined for this issue Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@ablnk
Copy link

ablnk commented Mar 22, 2024

Version:
Serverless project v 8.14.0

Description:
GET internal/apm/service-map request fails with status code 502 and returns backend closed connection when filtering data outside of the boost window.

Preconditions:
I reproduced the issue having ~154 dependencies and 1011 services.

Steps to reproduce:

  1. Go to Applications - Services.
  2. Navigate to Service Map tab and filter data outside of the boost window.

Expected behavior:
Service map data returned for set time interval.

Logs:

Error decoding JSON: json: cannot unmarshal string into Go value of type map[string]interface {}
Error decoding JSON: unexpected EOF

Log messages look somewhat fragmentary:

image

@ablnk ablnk added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Mar 22, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@smith smith added Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team and removed Team:APM All issues that need APM UI Team support labels Mar 27, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@smith smith added the needs-refinement A reason and acceptance criteria need to be defined for this issue label Mar 29, 2024
@chrisdistasio
Copy link

does this always fail outside the boost window? for example, does it consistently fail one day after the boost window or does it fail 12 days after the boost window? does the amount of data/complexity of the service map matter?

@ablnk
Copy link
Author

ablnk commented Apr 9, 2024

@chrisdistasio having the data that I specified in a precondition, it was always failing when searching outside the boost window. The amount of data matter, especially the number of dependencies.

@chrisdistasio
Copy link

@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window?

is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue?

trying to get a better understanding of the severity as it relates to boost window.

for example, i have trouble getting a response in qa for 15 mins - https://keepserverless-qa-oblt-ba6694.kb.eu-west-1.aws.qa.elastic.cloud/app/apm/service-map?comparisonEnabled=false&environment=ENVIRONMENT_ALL&kuery=&offset=1d&rangeFrom=now-15m&rangeTo=now&serviceGroup=

@smith
Copy link
Contributor

smith commented Apr 26, 2024

@chrisdistasio this is prioritized in our refining queue on our internal backlog, so the first steps are determining under which conditions this happens.

@crespocarlos crespocarlos self-assigned this Apr 30, 2024
@crespocarlos
Copy link
Contributor

@ablnk Is it possible to access the environment where the error is happening? I'd like to have a closer look to the logs.

@crespocarlos
Copy link
Contributor

I'm getting the 502 in QA deployment with ~1800 services and ~170 dependencies, within 15m range, which I believe is still within the 7d boost window.

Image

The map is somewhat complex (5m range)

Image

I managed to crash the server when I used a 1d range.

Looks like it's related to the amount of data affecting it on serverless. Searches outside the boost window could potentialize the likelihood of this error happening.

@crespocarlos
Copy link
Contributor

crespocarlos commented May 2, 2024

The solution for this might be related to https://github.com/elastic/observability-serverless/issues/18 and #181790

@crespocarlos crespocarlos removed their assignment May 8, 2024
@neptunian neptunian self-assigned this May 13, 2024
@neptunian
Copy link
Contributor

neptunian commented May 13, 2024

@chrisdistasio @paulb-elastic Are we aiming for the short term or the long term solution here? The short term being we keep the scripted metric aggregation, but make it faster. The tradeoff here is the service map's accuracy or completeness as we could lose some connections and discovered services in the timerange, though we already do some sampling of the trace ids we pass in so this exists to an extent already, I believe.

The long term is removing the scripted aggregation and moving this processing somewhere else, not sure where.

@paulb-elastic
Copy link
Contributor

The steer I've been given is to focus on the shorter term solution for now (although happy to hear @chrisdistasio's opinion too).

@neptunian
Copy link
Contributor

neptunian commented May 15, 2024

Thanks @paulb-elastic. I've been having some conversation in another issue, and will get started on optimizing the scripted metric agg and running some tests. Then figure out what this limit number should be which is the more challenging bit as Dario mentioned.

@neptunian
Copy link
Contributor

I had a chat with Dario the week before last and we aligned on a strategy to avoid OOMs.

optimization strategy

  • create a new Kibana APM setting, eg: xpack.apm.ui.serviceMapMaxEvents that controls the max number of events or documents that we will process to create the service map
  • in the reduce phase of the script aggregation, create a counter and increment for every event we process and terminate early if the limit number is reached
  • in the reduce phase of the script aggregation, before processing the docs for the service map, we now need to sort them before processing and terminating early. This is because with the introduction of terminating early we potentially will not process many events and we'd be returning incomplete traces.
  • response that we terminated early or that some trace is incomplete - not sure if we need two flags

tradeoffs

  • Sorting the events by trace id first means we'd have to deal with 1 incomplete trace, which we could mark as such, and skip. Having to sort and creating another data structure capable of doing so, like an array (service map uses hashes), means using more memory that we'll now have to compensate for.
  • we will process less traces for some users so the service map may have paths that don't get picked up. We already limit the number of traces we get events related to through serviceMapMaxTraces config which defaults to 1000. We also already limit the trace ids in the query through xpack.apm.serviceMapTraceIdGlobalBucketSize and xpack.apm.serviceMapTraceIdBucketSize

issues to resolve / dependencies

  • what should the default number be for the new setting that limits events to be processed? will need support from ES.

@paulb-elastic
Copy link
Contributor

Thanks for the update @neptunian

what should the default number be for the new setting that limits events to be processed? will need support from ES

Are you ok proceeding with this, and not blocked?

@neptunian
Copy link
Contributor

@paulb-elastic Yes, not blocked. I'll be focusing on it this week.

@neptunian
Copy link
Contributor

neptunian commented Jun 5, 2024

I had a chat with @nik9000 about the strategy above. We came to the conclusion that limiting in the reduce phase likely won't resolve the error. Once a large number of documents have arrived at the reduce phase there are already memory issues whether or not we limit adding them to the eventsById hash and processing them. With the current allowance of 100,000 documents per shard and an unknown number of shards, the total volume of data being returned can be extremely large. Unfortunately there is no easy way for to know how many shards a cluster has. I can use the search shards API and count them but that could be a very large response. I believe this error I get on serverless is due to the amount of docs:
Screenshot 2024-06-05 at 11 00 55 AM

Instead, I think we should tweak the limits with the levers we already have in place such as decreasing terminate_after from 100k per shard, and less max traces (currently 1,000);

One thing I've found out after talking to @crespocarlos about his PR #183802, is the synthtrace data that exists in serverless where we are testing is not realistic data. One reason why things are so slow and we might be getting OOM errors is that we are actually running the synthtrace scenario for OOMs:

Here is a response i get on the edge cluster, some of the docs are from OOM scenario:

Screenshot 2024-06-05 at 2 23 40 PM

This scenario creates very long traces (not common or realistic) in order to trigger an OOM. Sometimes when I can get it to load I have a neverending long trace. This is the service map below. You cannot see the services because I've zoomed out to show how long this trace can be.

Screenshot 2024-06-05 at 2 19 51 PM

Given the fact there is a recursive function in the reduce_phase we can end up with an OOM in there because of that or a stack overflow due to the depth of the trace.

I think we should verify the data we are using to test with is realistic and not broken before pulling any levers. It's also very difficult to test otherwise.

@dgieselaar
Copy link
Member

@neptunian fwiw, you can count the number of documents by executing a request that only counts the documents, but does not execute the aggregation. For the second request, the filter cache will kick in and it will run the aggregation on the collected documents. Unfortunately that is a hard stop, and you cannot partially process traces with that strategy. You could chunk up the documents differently until you get a number of documents that make sense, but it's probably more sensible to nudge the user to tweak these settings if possible (?).

Re:

This scenario creates very long traces (not common or realistic) in order to trigger an OOM.

this is the whole point no? It's not about realistic data, it's about making sure that no data at all can trigger an OOM.

@neptunian
Copy link
Contributor

neptunian commented Jun 5, 2024

but it's probably more sensible to nudge the user to tweak these settings if possible

@dgieselaar Do you mean the settings for serviceMapTerminateAfter, serviceMapMaxTracesPerRequest, serviceMapMaxTraces?

this is the whole point no? It's not about realistic data, it's about making sure that no data at all can trigger an OOM.

I guess I'm assuming that if we were trying to make sure no data could trigger an OOM we wouldn't have recursive functions in the service map and waterfall that don't have any safeguards to account for a trace being that long. Unless that is just an oversight. When I run the service_map_oom scenario I typically get a run time error in kibana logs caused by stack_overflow_error. I thought these scenarios were more geared towards integration testing edge cases. I'm just not sure we should be optimizing for all of the cases. Even if we add a safeguard for these very long traces it can make the queries much slower than they would otherwise be. I do think it's useful to have it all in there, though, but I think we and those testing on these clusters for performance need to be aware that there's an expectation things could be slower.

@dgieselaar
Copy link
Member

@dgieselaar Do you mean the settings for serviceMapTerminateAfter, serviceMapMaxTracesPerRequest, serviceMapMaxTraces?

yes (or any other settings we add). I think the fewer settings we have the better, but not sure what is practical.

I guess I'm assuming that if we were trying to make sure no data could trigger an OOM we wouldn't have recursive functions in the service map and waterfall that don't have any safeguards to account for a trace being that long. Unless that is just an oversight.

When you say "service map" and "waterfall" are you talking about browser code? If so, it is something we have not optimized for. This data (spans or descendants that refer to each other as parents) is corrupted, and we can't guard against every imaginable form of data corruption. However, an OOM for an Elasticsearch node has much more devastating consequences (like paging Nik at night).

When I run the service_map_oom scenario I typically get a run time error in kibana logs caused by stack_overflow_error.
Yes, stack_overflow_error is not what we are trying to prevent here. I think that is just a guardrail for the scripted metric agg/painless doing its work.

I thought these scenarios were more geared towards integration testing edge cases. I'm just not sure we should be optimizing for all of the cases. Even if we add a safeguard for these very long traces it can make the queries much slower than they would otherwise be. I do think it's useful to have it all in there, though, but I think we and those testing on these clusters for performance need to be aware that there's an expectation things could be slower. I do think it's useful to have it all in there, though, but I think we and those testing on these clusters for performance need to be aware that there's an expectation things could be slower.

If I understand what you are saying correctly, is that we run the risk of making things slower for the common case? I think that is an acceptable trade-off. Ideally we gain some ground elsewhere but that is out of scope. This is just us making sure we don't take down an Elasticsearch node.

@crespocarlos
Copy link
Contributor

Perhaps the fact that synthtrace is currently generating unrealistic data (with spans having their parent spans as children) makes the effort of optimizing the services map harder than it should be. The fact that Sandy is getting a stack_overflow_error could be one symptom of that.

The non-unique transaction ids could also lead to data not being deterministic because, at any given time, a transaction from a service created by a scenario could become either a parent or child span from a service created by another scenario.

For instance, service-8 can be created by both the other_bucket_group and services_map_oom scenarios. However, I believe none of them is configured to create a service map like the one below:

image

In the worst-case scenario, this could lead to a cyclic dependency.

@dgieselaar
Copy link
Member

dgieselaar commented Jun 6, 2024

Let's separate a few things:

  • The goal of this issue is to prevent OOMs. Specifically, an OOM because too many documents are being processed.
  • Synthtrace is generating data where ids can collide, leading to cyclical references in traces. This should be fixed with [APM] Trace sample performance improvements #183802 (thanks Carlos!).
  • Cyclical references cause a stack_overflow_error which is not an OOM and does not take down an Elasticsearch node. It only fails the request (AFAIK). Not great, but not fatal and not what we're addressing here.
  • The point of the Service Map OOM scenario is to create data that could cause an OOM. Its intent is not to generate data with cyclical dependencies, or realistic data. It simply tries to generate enough documents that are to be processed by the service map scripted metric agg to generate an OOM. You can verify if this causes an OOM by running a small Elasticsearch instance locally (run it with a very small amount of memory). If it doesn't do this, let's fix the scenario.
  • Given that we are trying to prevent an OOM in all scenarios, let's ignore whether data is realistic or corrupt. Even if it is unrealistic (e.g. 500 services deep) or corrupt (e.g. cyclical references) it should not under any circumstance cause an Elasticsearch node to go down.
  • Ideally we can do this without sacrificing performance but it is not the most important thing and I think some performance degradation is expected. Good to measure, but getting a document count is relatively cheap, and we can probably leverage the filter cache in the follow-up request because we're using preference in the search request (let's verify this though, I'm making assumptions here).

@crespocarlos
Copy link
Contributor

it should not under any circumstance cause an Elasticsearch node to go down

100% agree. Just thinking here that if my PR gets merged and oblt cluters data is erased, we might lose the circumstances that are causing the OOMs

I'd also suggest refactoring the recursive function in the get_service_paths_from_trace_ids query and skipping processed ids to prevent cyclic referencing from causing infinite loops.

@dgieselaar
Copy link
Member

100% agree. Just thinking here that if my PR gets merged and oblt cluters data is erased, we might lose the circumstances that are causing the OOMs

We shouldn't - if you run this script in isolation (so no chance of collision because of different synthtrace processes) it should successfully generate an OOM. But let's verify.

I'd also suggest refactoring the recursive function in the get_service_paths_from_trace_ids query and skipping processed ids to prevent cyclic referencing from causing infinite loops.

Ah, I don't think it's recursive. It's basically mapping over the services in reverse order (e.g. the deepest service first), and then creating a downstream span that has the upstream span as a child. I think... :)

@dgieselaar
Copy link
Member

Also we should exclude the Service Map OOM scenario from the oblt clusters 😅

@neptunian
Copy link
Contributor

The point of the Service Map OOM scenario is to create data that could cause an OOM. Its intent is not to generate data with cyclical dependencies, or realistic data. It simply tries to generate enough documents that are to be processed by the service map scripted metric agg to generate an OOM. You can verify if this causes an OOM by running a small Elasticsearch instance locally (run it with a very small amount of memory). If it doesn't do this, let's fix the scenario

Thanks @dgieselaar . Unfortunately @crespocarlos changes didn't resolve it, but if I just remove the recursive call (processAndReturnEvent) inside the scripted metric agg I do get the ES OOM locally. I'll fix this scenario so we have a reproducible ES OOM, but this gives me something to work with now.

For reference, this is the kibana stack overflow error:

error

ES_JAVA_OPTS='-Xms236m -Xmx236m' yarn es snapshot
node scripts/synthtrace.js service_map_oom --from=now-15m --to=now --clean

[2024-06-06T08:01:21.729-04:00][ERROR][plugins.apm] Error: search_phase_execution_exception
	Caused by:
		script_exception: runtime error
    at unwrap_es_response.ts:45:13
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at fetchServicePathsFromTraceIds (fetch_service_paths_from_trace_ids.ts:225:48)
    at getServiceMapFromTraceIds (get_service_map_from_trace_ids.ts:58:48)
    at async Promise.all (index 1)
    at get_service_map.ts:76:30
    at async Promise.all (index 0)
    at get_service_map.ts:124:55
    at wrappedHandler (register_apm_server_routes.ts:112:35)
    at Router.handle (router.ts:268:30)
    at handler (router.ts:196:13)
    at exports.Manager.execute (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
    at Object.internals.handler (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)
    at exports.execute (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)
    at Request._lifecycle (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
    at Request._execute (/Users/sandy/dev/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)
Response: {
  error: {
    root_cause: [],
    type: 'search_phase_execution_exception',
    reason: '',
    phase: 'rank-feature',
    grouped: true,
    failed_shards: [],
    caused_by: {
      type: 'script_exception',
      reason: 'runtime error',
      script_stack: [
        'if (context.processedEvents[eventId] != null) {\n                  ',
        '           ^---- HERE'
      ],
      script: ' ...',
      lang: 'painless',
      position: { offset: 490, start: 479, end: 545 },
      caused_by: { type: 'stack_overflow_error', reason: null }
    }
  },
  status: 400
}

@neptunian
Copy link
Contributor

neptunian commented Jun 12, 2024

fwiw, you can count the number of documents by executing a request that only counts the documents, but does not execute the aggregation. For the second request, the filter cache will kick in and it will run the aggregation on the collected documents. Unfortunately that is a hard stop, and you cannot partially process traces with that strategy. You could chunk up the documents differently until you get a number of documents that make sense, but it's probably more sensible to nudge the user to tweak these settings if possible (?).

@dgieselaar Similar to this idea, since the query returns the shards the query ran on, I thought I could try getting the number of shards in one query (_shards":{"total":1,"successful":1,"skipped":0,"failed":0}) and then dividing the max total docs by that. @nik9000 thought this could be a viable approach. After looking at some debug logs I noticed that the reduce seemed to return more "shards" (from states) than what the query responds with. After talking with @nik9000 he looked into it and found that in modern versions of elasticsearch we parallelize aggregation collection. Aggregations get broken up across many threads to improve performance. So for example I have 1 shard but i get 4 datasets all returning many docs:
[ 52988, 53249, 53291, 146440, 53432 ]. @nik9000 opened a PR to turn this off for the scripted metric agg: elastic/elasticsearch#109597 .

I was able to disable parallelization on elasticsearch using an undocumented internal setting and verified that the states.count was the same as the shard count. Now I get a single dataset but the amount of docs can be quite high. So instead of [ 52988, 53249, 53291, 146440, 53432 ] I get [359000] which is almost exactly the same # of docs. This is the case even if i set terminate_after to something as small as 1000. terminate_after is "best effort". Depending on various factors like cardinality and your aggregation it won't necessarily adhere to anything close to that.

Instead of trying to depend on terminate_after, we can keep a counter in the map phase and not return docs after they've hit the max limit. This should ease the load on the coordinating node but the data nodes will need to map through however many docs. There doesn't seem to be a way around that. I think disabling parallelization should help here because we could get an accurate shard count and divide that by our max total doc count.

@dgieselaar
Copy link
Member

Great finds! Very interesting stuff. Re: terminate_after, I think it's fine that for a small number it's inaccurate, but if terminate_after is 1000 and the aggregation still sees 360k documents that's useless. What happens if you e.g. set terminate_after to 1m? I'm surprised because we e.g. use terminate_after for search requests where we only care if there's a single hit for a query. I have never seen anything to suggest that it's actually collecting over 300k documents. If you experiment with increasing/decreasing the number, what kind of variance do you see?

We need to be careful with dividing no of shards by doc count, because data is not guaranteed to be unevenly distributed, e.g. one shard might contain 99% of the data. As data streams are partitioned by service this is not an unlikely scenario.

@neptunian
Copy link
Contributor

neptunian commented Jun 14, 2024

I see no variation for terminate_after from tests I've done, whether parallelization is on or off. It seems ignored. But if I change the size:1 it does respect terminate_after:

scripted metric agg count with size: 0, terminate_after: 100
POST /traces-*/_search
{
  "terminate_after": 100,
  "size":0,
  "aggs": {
    "doc_count_metric": {
      "scripted_metric": {
        "init_script": "state.count = 0;",
        "map_script": """
          state.count += 1;
        """,
        "combine_script": "return state;",
        "reduce_script": """
          long totalDocCount = 0;
          for (state in states) {
            totalDocCount += state.count;  // Sum the document counts from each shard
          }
          return totalDocCount;
        """
      }
    }
  }
}

Response:

{
  "took": 8,
  "timed_out": false,
  "terminated_early": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "doc_count_metric": {
      "value": 359400
    }
  }
}
scripted metric agg count with size: 1, terminate_after: 100
POST /traces-*/_search
{
  "terminate_after": 100,
  "size":1,
  "aggs": {
    "doc_count_metric": {
      "scripted_metric": {
        "init_script": "state.count = 0;",
        "map_script": """
          state.count += 1;
        """,
        "combine_script": "return state;",
        "reduce_script": """
          long totalDocCount = 0;
          for (state in states) {
            totalDocCount += state.count;  // Sum the document counts from each shard
          }
          return totalDocCount;
        """
      }
    }
  }
}

Response:

{
  "took": 1,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".ds-traces-apm-default-2024.06.13-000001",
        "_id": "QqpVEpABgTLW8ZcmNPqx",
        "_score": 1,
        "_source": {
          "observer": {
            "type": "synthtrace",
            "version": "8.15.0-preview-1716438434",
            "version_major": 8
          },
          "agent": {
            "name": "Synthtrace: service_map_oom"
          },
          "trace": {
            "id": "41835000000000000000000000708015"
          },
          "@timestamp": "2024-06-13T15:59:10.763Z",
          "service": {
            "node": {
              "name": "service-instance"
            },
            "environment": "prod",
            "name": "service-0"
          },
          "host": {
            "name": "service-instance"
          },
          "event": {
            "outcome": "unknown"
          },
          "processor": {
            "name": "transaction",
            "event": "transaction"
          },
          "transaction": {
            "duration": {
              "us": 1000000
            },
            "name": "GET /my/function",
            "id": "4183500000708014",
            "type": "request",
            "sampled": true
          },
          "timestamp": {
            "us": 1718294350763000
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "doc_count_metric": {
      "value": 100
    }
  }
}

So it seemsterminate_after works so long as we have something in size (which we don't currently). Getting the existing terminate_after to kick in would be an improvement alone.

We need to be careful with dividing no of shards by doc count, because data is not guaranteed to be unevenly distributed, e.g. one shard might contain 99% of the data. As data streams are partitioned by service this is not an unlikely scenario.

Yea, or the user could have partitioned their data streams via various namespaces. I think this is a tradeoff. I'm thinking in a real world scenario there probably won't be a huge amount of matching docs from 50 traces(?) so it would be unlikely to go over the max per shard limit unless it was a very large number of shards. It does seem like a safe way to avoid the possibility of too many docs being sent to the coordinating node, though.

sidenote: I said previously "I think disabling parallelization should help here because we could get an accurate shard count and divide that by our max total doc count." I don't think this is true since whether it's on or off it returns the same docs, so I'm not sure it would matter if it was off. I spoke with @nik9000 about this and he'd rather have it off to be safe rather than some performance loss we might have, so that PR was merged.

@dgieselaar
Copy link
Member

So it seemsterminate_after works so long as we have something in size (which we don't currently). Getting the existing terminate_after to kick in would be an improvement alone.

TIL! I think setting size:1 invalidates the shard request cache but that probably isn't very useful anyway.

Yea, or the user could have partitioned their data streams via various namespaces. I think this is a tradeoff. I'm thinking in a real world scenario there probably won't be a huge amount of matching docs from 50 traces(?) so it would be unlikely to go over the max per shard limit unless it was a very large number of shards. It does seem like a safe way to avoid the possibility of too many docs being sent to the coordinating node, though.

Oh, I'm not sure what you mean then. Can you give some examples with numbers?

@neptunian
Copy link
Contributor

neptunian commented Jun 20, 2024

I opened this PR a couple days ago #186417. I added some testing directions to test locally with smaller amounts of data.
To recap:

This strategy involves dividing by shards to limit the amount of total documents that can be processed. Depending on how evenly the user's data is distributed this could mean we are not able to match as many docs if some users have mostly empty shards or unevenly distributed data which means a potentially less accurate service map. This is a tradeoff.

I wanted to go over some issues still to be resolved / noted:

  • @nik9000 and I were thinking to have a static number for the max memory size, otherwise having it dynamic depending on the cluster is complex. @chrisdistasio and @paulb-elastic If we are wanting it to be dynamic depending on the cluster I think we need to continue discussions on this with @nik9000 .
  • If I understand correctly the parent circuit breaker monitors the total memory usage of all operations combined, not just a single query or aggregation. When it reports that the memory usage exceeds the limit during the reduce_aggs phase, it is considering the cumulative memory usage of all operations running on the node, not just this one. My change aims to keep it below the request circuit breaker only, which means we may continue to see this error due to other processes happening in the background or the queries that run before the scripted metric agg.
  • My testing has been local only, which I think should suffice, but it would be good to test on a serverless environment with a lot of data. The QA environment, as noted above, is not stable and constantly throwing parent circuit breaker errors, so this isnt working for us.
  • Not directly related to this issue, but we should optimize for either removing the recursive function or setting a max trace depth to something, I've set it to 20, but I'm not sure what that number should be. Should we have a user configured setting for this? @chrisdistasio Your input might be helpful here. Very long traces are unlikely but we do have them in our test data and perhaps some corrupted data could have these. @crespocarlos also suggested rewriting the code to remove the recursive call. I think we should handle this in this PR because this problem can surface AFTER we've successfully not errored out due to circuit breaker / out of memory issue occurring and you can experience very slow loading or potentially timeouts.
  • Should this be serverless only? I understand there can be issues in any environment and this could be an improvement for everyone. However I think it's more risky to do it everywhere as we could not tweak the default setting as quickly. There might be more edgecases I'm not aware of, though it would simplify the code not having to conditionally do this for serverless.
  • UI messaging. We already had existing circuit breakers in place (terminate_after: 100k, max trace ids 1000), so I'm not sure whether we need to introduce messaging about that now, as this change is an extension of that. @chrisdistasio @paulb-elastic

@dgieselaar
Copy link
Member

@neptunian what is the point of setting a max trace depth?

@neptunian
Copy link
Contributor

@neptunian what is the point of setting a max trace depth?

@dgieselaar I don't think it would be as much an issue now since we are limiting the docs. But I was getting stack overflow errors and crashing kibana using synthtrace data if I didn't add some limitation. I don't know if this is the right way to go about that handling that scenario though. I've asked @crespocarlos to look into that and he said there might be some better optimizations.

neptunian added a commit that referenced this issue Jul 2, 2024
…6417)

## Summary

#179229

This PR addresses the need to limit the amount of data that the scripted
metric aggregation in the service map processes in one request which can
lead to timeouts and OOMs, often resulting in the user seeing [parent
circuit
breaker](https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#parent-circuit-breaker)
errors and no service map visualization. This query can fire up to 20
times max depending on how many trace ids are fetched in subsequent
query, contributing more to exceeding the total allowable memory.

These changes will not remove the possibility of OOMs or circuit breaker
errors. It doesn't control for multiple users or other processes
happening in kibana, rather we are removing the current state of
querying for an unknown number of documents by providing a hard limit
and a way to easily tweak that limit.

## Changes
- Make get_service_paths_from_trace_ids "shard aware" by adding an
initial query, `get_trace_ids_shard_data` without the aggregations and
only the trace id filter and other filters in order to see how many
shards were searched
- Use a baseline of 2_576_980_377 bytes max from new config
`serverlessServiceMapMaxAvailableBytes`, for all
get_service_paths_from_trace_ids queries when hitting the
`/internal/apm/service-map`
- Calculate how many docs we should retrieve per shard and set that to
`terminateAfter` and also as part of the map phase to ensure we never
send more than this number to reduce
- Calculation is: ((serverlessServiceMapMaxAvailableBytes / average
document size) / totalRequests) / numberOfShards
Eg: 2_576_980_377 / 495 avg doc size = 5,206,020 total docs
 5,206,020 total docs / 10 requests = 520,602 docs per query
520,602 docs per query / 3 shards = **173,534 docs per shard**
Since 173,534 is greater than the default setting
`serviceMapTerminateAfter`, docs per shard is 100k
- Ensure that `map_script` phase won't process duplicate events
- Refactor the `processAndReturnEvent` function to replace recursion
with a loop to mitigate risks of stack overflow and excessive memory
consumption when processing deep trees


## Testing

### Testing that the scripted metric agg query does not exceed the
request circuit breaker
- start elasticsearch with default settings
- on `main`, without these changes, update the request circuit breaker
limit to be 2mb:
```
 PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.request.limit": "2mb"
  }
}
```
- run synthtrace `node scripts/synthtrace.js service_map_oom
--from=now-15m --to=now --clean`
- Go to the service map, and you should see this error:
<img width="305" alt="Screenshot 2024-06-20 at 2 41 18 PM"
src="https://github.com/elastic/kibana/assets/1676003/517709e5-f5c0-46bf-a06f-5817458fe292">

- checkout this PR
- set the apm kibana setting to 2mb(binary):
`xpack.apm.serverlessServiceMapMaxAvailableBytes: 2097152`. this
represents the available space for the [request circuit
breaker](https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html#request-circuit-breaker),
since we aren't grabbing that dynamically.
- navigate to the service map and you should not get this error and the
service map should appear

---------

Co-authored-by: Carlos Crespo <carloshenrique.leonelcrespo@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@paulb-elastic
Copy link
Contributor

Closing - #186417 merged and #187707 opened

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:serverless bug Fixes for quality problems that affect the customer experience needs-refinement A reason and acceptance criteria need to be defined for this issue Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

No branches or pull requests

9 participants