Memory leak when returning files in Spring Webflux #2557

LukasSmrcka · 2023-02-21T11:39:26Z

Integration

sentry-spring-boot-starter

Java Version

11

Version

6.13.1

Steps to Reproduce

Hi,
we are using Kotlin based application on Spring Webflux and Kotlin couroutines.
Versions:
Spring boot: 2.7.0
Spring Webflux: 5.3.20

The appication is basically cache. 99% of all http requests hitting the app are getting file from disk in a response. Everything is streamed. Application is currently handling 10 http requests per second.

The first issue we had were sentry breadcrumbs that we had to dissable because after couple of hours it was able to reach 1GB heap limit we have. After disabling breadcrumbs an issue is still there. I am not sure if it is something different or we just cut the size of leaking objects by disabling breadcrombs. Anyway I can observe in the memory that count of io.sentry.Scope instances and all those referenced objects (ConcurrentHashMaps, ...) is just rising and rising. After 24 hours there is more then 50K of instances of io.sentry.Scope.

When sentry Spring boot package is removed then the memory is stable. It looks like it is somehow related to those file streams because there are no issues with other webflux based appliacations we have.
Any idea what could be the issue? How could we help to fix it?

Best regards,
Lukas

Expected Result

No memory leak when returning files on http requests

Actual Result

Those strings/byte arrays on top are part of the Scope aggregates

adinauer · 2023-02-22T07:49:41Z

Hello @LukasSmrcka, did you also notice this behaviour in older versions of the SDK?

I'll try to reproduce and report back.

adinauer · 2023-02-22T08:07:32Z

@LukasSmrcka I tried reproducing this but wasn't able to do so yet.
After hitting an API endpoint returning a ClassPathResource a few million times over a few minutes I don't see the number of Scope objects increase in the heap dumps.

adinauer · 2023-02-22T08:17:29Z

Also not able to reproduce with FileSystemResource, can you please share more details how to reproduce this. Ideally a controller method that helps reproduce the issue.

LukasSmrcka · 2023-02-22T11:55:18Z

I understand. I never managed to reproduce it controlled way on our test environments using tools flooding server with requests. It just always happens on production.

One more clue I can add is when we actually not use input stream and convert the file stream to byte array just before returning it from rest controller then there is no leak, it just works. But when streams are used then it is somehow leaking.

Another thing worthy to point out is that first few hours when the service is running all looks good. Heap alocation graph has "saw" look and GC is very stable. After a few hours something suddenly chnages and the heap graph looks weird and it slowly starts to leak.

I know, that this info is not much of use. Just telling these things becauuse someone might got an idea what could be the cause. I will try to bring some more info upcoming hours or days. I will try to dig deeper in memory dumps, hopefully I will find something helpful.

romtsn · 2023-02-22T12:02:36Z

@LukasSmrcka just to make sure, you are not using our File I/O instrumentation, right?

LukasSmrcka · 2023-02-22T12:05:59Z

No we don't use it. We use spring boot autoconfiguration with breadcrumbs disabled in application.yml.

LukasSmrcka · 2023-02-22T13:01:06Z

@romtsn By the way the setting we used after issues with breadcrumbs is

sentry:
  dsn: https://..........
  traces-sample-rate: 1.0
  enable-user-interaction-breadcrumbs: off
  max-breadcrumbs: 1

But I still see on the heap the same count of breadcrumbs instances as sentry Scope instances. Is it possible to turn it off? It looks like only "max-breadcrumbs" is applied but not the "enable-user-interaction-breadcrumbs: off". Or do I understand the config wrong?

romtsn · 2023-02-22T13:42:55Z

enable-user-interaction-breadcrumbs: off should have no effect on backend at all. What do you mean by "same count of breadcrumbs instances"? You mean CircularFifoQueue instances?

adinauer · 2023-02-22T14:16:53Z

@LukasSmrcka you could try an older version of the Sentry SDK to see if that still has the problem. Other than that you could check if pushScope is only called in SentryWebFilter or also in other places. Maybe there's something else.

An alternative would be to wait for #2546 but that is only for Spring Boot 3. Not sure you can upgrade easily.

Could the problem be caused by streams not being consumed completely or something similar?

LukasSmrcka · 2023-02-22T14:17:01Z

Sorry .. the counts are different. Almost the same. Those are counts of instances of each class in the memory

LukasSmrcka · 2023-02-22T14:22:29Z

Ok I will check your suggestions. We had the same issue with Sentry libs version 6.4.1. I have upgraded to the latest because I noticed some leak fixes in changelog

I am currently running instance without Sentry to be 100% sure the Sentry leak is not just secondary effect of something else.

Streams not being consumed completly? It could be an option actually, I can check if there are such cases. Could you please elaborate more? How this could cause the issue?

adinauer · 2023-02-22T15:22:03Z

It's just a thought. Maybe something isn't cleaned up until the stream is consumed or timeouts are too high. I'm not very familiar with WebFlux. Maybe there's some docs for this case. I don't know. May also have to do with backpressure.

LukasSmrcka · 2023-02-22T15:30:57Z

Ok. Tomorrow morning I should have some conclusion from the lates version running on prod without Sentry. Then I will try to take closer look on this

LukasSmrcka · 2023-02-23T08:34:20Z

The prod server memory does not show any marks of a leak after 20 hours. So Sentry is for sure somehow involved when streams are returned. One more time I wanna point out that when stream is converted to byte array in controller before it is returned to the calling client then there is also no leak. So I am going to take closer look on your suggestions now

LukasSmrcka · 2023-02-23T12:26:43Z

I have also checked heapdump for the stream instances. There is literaly zero count of input streams we are using. So it does not look anything is left opened. Do you have any ideas what diagnostics I should add to get some more info?

adinauer · 2023-02-23T14:19:37Z

Can you filter your heapdump objects by io.sentry and check if there's as many Hubs as there's Scopes and Breadcrumbs? I suspect that somehow hubs are kept alive. If there's many, maybe you can find something that keeps a reference to the hubs and thus prevents them from being garbage collected.

If you could share some more details on what you're doing inside your controllers etc. I can play around locally some more. Like what classes are you using. Where are you reading from. Are you wrapping stuff with reactor Mono/Flux etc.?

LukasSmrcka · 2023-02-23T14:44:27Z

The analyzed heapdump has 8 Hub instances, 1 HubAdapter, 1 NoOpHub. ..... It looks like hub.popScope(); is not called by the web filter or maybe hub.pushScope(); is called multiple times. I am currently investigating what would hot publisher behavior do with the WebFilter when there are more subscribers. I believe there is such a case in our app but counts of unreleased stack items dos not match cases when this scenario might happen so I guess it is not the issue.

I guess there is no chance that popScope() would fail, right?

LukasSmrcka · 2023-02-23T14:48:40Z

By the way it is Kotlin app. So requests are running in couroutines scope. 99% of requests are just files from disk loaded as stream, wrapped to InputStreamResource and returned by controller

adinauer · 2023-02-24T07:11:39Z

I guess there is no chance that popScope() would fail, right?

Never say never. I'll try some more stuff and see if i can reproduce.

LukasSmrcka · 2023-02-24T09:37:50Z

I am trying to get more info about "doFinally" block that is used in SentryWebFilter. Also running a new test on beta environment. I am flooding the server with 10 download requests per second and every download is aborted in the middle by the client app. Based on some info the doFinally might not be called in some situations like this.

LukasSmrcka · 2023-02-27T11:25:22Z

I deployed on production very simple WebFilter monitoring count if sumultanios requests. It will show after some time if doFirst and doFinally block behavior is correct for all cases. After some time of running this we should know if the problem is in the reactor or in Sentry library.

    override fun filter(serverWebExchange: ServerWebExchange, webFilterChain: WebFilterChain): Mono<Void> {
        return webFilterChain
                .filter(serverWebExchange)
                .doFinally(
                        Consumer {
                            requestCounter.decrementAndGet()
                        })
                .doFirst(
                        Runnable {
                            requestCounter.incrementAndGet()
                        })
    }

adinauer · 2023-02-27T11:53:54Z

@LukasSmrcka thanks. That's a great idea. Hopefully we can learn something from it.

LukasSmrcka · 2023-02-27T17:09:04Z

So I have some results. These methods look (doFirst, doFinally) reliable. After 85K of requests for every doFirst there was allways doFinally. However io.sentry.Stack.StackItem instances is 38K in the memory. And of course all related instaces like Scope, FiFoQueue and so on. So it really looks like there is a situation when the "pop" from the stack fails. Spring boot starter is configuring only the the WebFilter and nothing else, right?

Is there a way how to really disable those breadcrumbs? I have currently only limit on "1" ... would 0 work?

LukasSmrcka · 2023-02-27T17:14:42Z

Oh and one more question. Is Sentry for Weblux internally working with RectorContext?

adinauer · 2023-03-01T09:11:01Z

Is there a way how to really disable those breadcrumbs? I have currently only limit on "1" ... would 0 work?

Setting it to 0 doesn't work in the current version.

This should allow you to disable breadcrumbs until we can find a fix:

  @Bean
  SentryOptions.BeforeBreadcrumbCallback beforeBreadcrumbCallback() {
    return new SentryOptions.BeforeBreadcrumbCallback() {
      @Override
      public @Nullable Breadcrumb execute(@NotNull Breadcrumb breadcrumb, @NotNull Hint hint) {
        return null;
      }
    };
  }

Spring boot starter is configuring only the the WebFilter and nothing else, right?

There's more. There's also SentryScheduleHook, SentryRequestResolver and SentryWebExceptionHandler.

Is Sentry for Weblux internally working with RectorContext

In our integration for Spring Boot 2, no. We couldn't get it to work in a way that's easily usable for developers. We're about to merge something that'll use it for Spring Boot 3.0.3+. Do you have a suspicion there?

I'm still unable to reproduce. Will try a few more things.

adinauer · 2023-03-01T11:00:22Z

Think I'm on to something. I can reproduce now even without coroutines etc. Seems like SentryWebFilter and SentryScheduleHook don't play well together. We call .pushScope() on one hub and .popScope() on a different one (that is a grandchild of the hub we call .pushScope() on). Working on a fix.

Using this code in controller to reproduce:

  @GetMapping("{id}")
  Mono<Person> person(@PathVariable Long id) {
//    LOGGER.info("Loading person with id={}", id);
//    throw new IllegalArgumentException("Something went wrong [id=" + id + "]");
    return personService.create(new Person("fir", "la")).flatMap(it -> Mono.error(new RuntimeException("thrown on purpose for webflux")));
  }

And then blast it with some requests using wrk -t2 -c100 -d30s http://user:password@localhost:8080/person/91

Will update here once I know more.

LukasSmrcka · 2023-03-01T14:19:32Z

According to the reactor context. I have found a place in the code where coroutine context is not passed correctly by my opinion so I was just wondering if this could break your code somehow.

I am also gonna check ou the ScheduleHook as there are some cschedulers running very often in our app.

And I will try to run a new version without breadcrumbs with your suggested config

adinauer · 2023-03-01T14:32:47Z

Are you already using

sentry-java/sentry-kotlin-extensions/src/main/java/io/sentry/kotlin/SentryContext.kt

Line 12 in 29232bd

    
           public class SentryContext : ThreadContextElement<IHub>, AbstractCoroutineContextElement(Key) {

via the sentry-kotlin-extensions dependency?

LukasSmrcka · 2023-03-01T14:43:37Z

No we don't use this. I guess we should as I am checking the documentation. But it is for covering of coroutine threads by Sentry or do you think it might actually cause the issue?

adinauer · 2023-03-01T14:56:17Z

I don't think it's responsible for the memory issue you're facing but I've got a PR in the works to improve the situation for WebFlux. Need to do some more testing before pushing it. The sentry-kotlin-extensions dependency should allow you to have correct hub propagation in coroutines, meaning e.g. breadcrumbs end up in the right error. Without it I guess all interactions inside coroutines would leak into other requests etc.

LukasSmrcka · 2023-03-01T15:28:57Z

Yes I understand. It looks good. I am gonna use it.

LukasSmrcka · 2023-03-02T09:52:16Z

I tried the config above to disable breadcrumbs and behavior is the same. I think the bug you are solving might be the cause as the scheduler is used in the app really heavily. Do you need anything from me now? Or can I just wait for the fix?

adinauer · 2023-03-02T10:12:28Z

Hmm I would've expected it to at least reduce the amount of memory that is consumed. The amount of Stacks, StackItems and Scopes should remain the same. We just merged the fix and are planning a release soon.

adinauer · 2023-03-02T10:16:00Z

The fix doesn't entirely get rid of the problem but should improve the situation by a lot. I'd like to further improve this in a later release by freeing hubs after we think they're no longer needed. At the moment the ThreadLocals keep a reference to the Hub and prevent it from being garbage collected. Currently there may still be one hub per thread including its Stack, StackItems and Scope etc. which stays alive until something else comes along and replaces the hub then allowing it to finally be GCd.

LukasSmrcka · 2023-03-02T10:33:23Z

Well ... there are no longer breadcrumbs in the memory so the leak is lower in term of amount of memory. Some leak reductione is there seems to me but I can not tell exactly. 28K StackItems after 17 hours and 145K of requests. Looks better for sure, it progresses slower seems to me but could be just accidental so nothing relevant for any conclusions.

Ok I will try the fix when it is ready. Shouldn't be TreadLocals avoided? Or are you using them somehow together with Reactor context?

adinauer · 2023-03-02T11:10:34Z

We're heavily relying on ThreadLocal to offer the static Sentry.capture... API. For Spring Boot 3.0.3 we'll be able to leverage ThreadLocalAccessor for hub propagation which works better than what he had before (also included in the next release).

adinauer · 2023-03-02T13:44:01Z

@LukasSmrcka 6.15.0 of the Java SDK has just been released. Can you please give it a try and let us know whether the memory issue has been improved?

LukasSmrcka · 2023-03-02T15:26:13Z

Sure. On it

LukasSmrcka · 2023-03-02T15:36:27Z

It is running now. We will see some results tomorrow

LukasSmrcka · 2023-03-03T08:45:36Z

@adinauer After 18 hours of running and 155K of requests there is no Sentry leak. Good job. I think youhave fixed that

adinauer · 2023-03-03T09:00:40Z

That's great to hear. Apologies for the troubles.

LukasSmrcka · 2023-03-03T09:03:11Z

Thanks for fixing that. Later I will also give a try to breadcrumbs. Now it is disabled

adinauer · 2023-03-06T07:19:26Z

Played around with my idea for only referencing an ID instead of the whole hub but wasn't able to quickly make it work in all cases (e.g. ThreadLocalAccessor).

Closing this now as the issue seems resolved. Please feel free to reopen in case you still encounter problems.

LukasSmrcka · 2023-03-06T10:16:41Z

@adinauer After half of the milion of requests all is good. Now I have deployed a new version with enabled breadcrumbs. Will let you know if anything goes wrong

LukasSmrcka · 2023-03-07T08:35:40Z

Looks like all is good even with breadcrumbs. No marks of leak. Thank you one more time

LukasSmrcka added Platform: Java Type: Bug Something isn't working labels Feb 21, 2023

getsantry bot added the Status: Untriaged label Feb 21, 2023

LukasSmrcka changed the title ~~Memory leak whenreturning files in Spring Webflux~~ Memory leak when returning files in Spring Webflux Feb 21, 2023

adinauer removed the Status: Untriaged label Feb 22, 2023

adinauer self-assigned this Feb 22, 2023

adinauer added the waiting for feedback label Feb 22, 2023

adinauer removed the waiting for feedback label Mar 1, 2023

adinauer mentioned this issue Mar 2, 2023

Fix memory leak in Webflux related to an ever growing stack #2580

Merged

7 tasks

adinauer closed this as completed Mar 6, 2023

Memory leak when returning files in Spring Webflux #2557

Memory leak when returning files in Spring Webflux #2557

Comments

LukasSmrcka commented Feb 21, 2023

Integration

Java Version

Version

Steps to Reproduce

Expected Result

Actual Result

adinauer commented Feb 22, 2023

adinauer commented Feb 22, 2023

adinauer commented Feb 22, 2023

LukasSmrcka commented Feb 22, 2023

romtsn commented Feb 22, 2023

LukasSmrcka commented Feb 22, 2023

LukasSmrcka commented Feb 22, 2023 • edited

romtsn commented Feb 22, 2023

adinauer commented Feb 22, 2023

LukasSmrcka commented Feb 22, 2023

LukasSmrcka commented Feb 22, 2023

adinauer commented Feb 22, 2023

LukasSmrcka commented Feb 22, 2023

LukasSmrcka commented Feb 23, 2023

LukasSmrcka commented Feb 23, 2023 • edited

adinauer commented Feb 23, 2023

LukasSmrcka commented Feb 23, 2023

LukasSmrcka commented Feb 23, 2023

adinauer commented Feb 24, 2023

LukasSmrcka commented Feb 24, 2023

LukasSmrcka commented Feb 27, 2023

adinauer commented Feb 27, 2023

LukasSmrcka commented Feb 27, 2023

LukasSmrcka commented Feb 27, 2023

adinauer commented Mar 1, 2023

adinauer commented Mar 1, 2023 • edited

LukasSmrcka commented Mar 1, 2023

adinauer commented Mar 1, 2023

LukasSmrcka commented Mar 1, 2023

adinauer commented Mar 1, 2023 • edited

LukasSmrcka commented Mar 1, 2023

LukasSmrcka commented Mar 2, 2023

adinauer commented Mar 2, 2023

adinauer commented Mar 2, 2023

LukasSmrcka commented Mar 2, 2023

adinauer commented Mar 2, 2023

adinauer commented Mar 2, 2023

LukasSmrcka commented Mar 2, 2023

LukasSmrcka commented Mar 2, 2023

LukasSmrcka commented Mar 3, 2023

adinauer commented Mar 3, 2023

LukasSmrcka commented Mar 3, 2023

adinauer commented Mar 6, 2023

LukasSmrcka commented Mar 6, 2023

LukasSmrcka commented Mar 7, 2023

LukasSmrcka commented Feb 22, 2023 •

edited

LukasSmrcka commented Feb 23, 2023 •

edited

adinauer commented Mar 1, 2023 •

edited

adinauer commented Mar 1, 2023 •

edited