Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when returning files in Spring Webflux #2557

Closed
LukasSmrcka opened this issue Feb 21, 2023 · 45 comments
Closed

Memory leak when returning files in Spring Webflux #2557

LukasSmrcka opened this issue Feb 21, 2023 · 45 comments
Assignees
Labels
Platform: Java Type: Bug Something isn't working

Comments

@LukasSmrcka
Copy link

Integration

sentry-spring-boot-starter

Java Version

11

Version

6.13.1

Steps to Reproduce

Hi,
we are using Kotlin based application on Spring Webflux and Kotlin couroutines.
Versions:
Spring boot: 2.7.0
Spring Webflux: 5.3.20

The appication is basically cache. 99% of all http requests hitting the app are getting file from disk in a response. Everything is streamed. Application is currently handling 10 http requests per second.

The first issue we had were sentry breadcrumbs that we had to dissable because after couple of hours it was able to reach 1GB heap limit we have. After disabling breadcrumbs an issue is still there. I am not sure if it is something different or we just cut the size of leaking objects by disabling breadcrombs. Anyway I can observe in the memory that count of io.sentry.Scope instances and all those referenced objects (ConcurrentHashMaps, ...) is just rising and rising. After 24 hours there is more then 50K of instances of io.sentry.Scope.

When sentry Spring boot package is removed then the memory is stable. It looks like it is somehow related to those file streams because there are no issues with other webflux based appliacations we have.
Any idea what could be the issue? How could we help to fix it?

Best regards,
Lukas

Expected Result

No memory leak when returning files on http requests

Actual Result

image

Those strings/byte arrays on top are part of the Scope aggregates

@LukasSmrcka LukasSmrcka added Platform: Java Type: Bug Something isn't working labels Feb 21, 2023
@LukasSmrcka LukasSmrcka changed the title Memory leak whenreturning files in Spring Webflux Memory leak when returning files in Spring Webflux Feb 21, 2023
@adinauer
Copy link
Member

Hello @LukasSmrcka, did you also notice this behaviour in older versions of the SDK?

I'll try to reproduce and report back.

@adinauer adinauer self-assigned this Feb 22, 2023
@adinauer
Copy link
Member

@LukasSmrcka I tried reproducing this but wasn't able to do so yet.
After hitting an API endpoint returning a ClassPathResource a few million times over a few minutes I don't see the number of Scope objects increase in the heap dumps.

@adinauer
Copy link
Member

Also not able to reproduce with FileSystemResource, can you please share more details how to reproduce this. Ideally a controller method that helps reproduce the issue.

@LukasSmrcka
Copy link
Author

I understand. I never managed to reproduce it controlled way on our test environments using tools flooding server with requests. It just always happens on production.

One more clue I can add is when we actually not use input stream and convert the file stream to byte array just before returning it from rest controller then there is no leak, it just works. But when streams are used then it is somehow leaking.

Another thing worthy to point out is that first few hours when the service is running all looks good. Heap alocation graph has "saw" look and GC is very stable. After a few hours something suddenly chnages and the heap graph looks weird and it slowly starts to leak.

I know, that this info is not much of use. Just telling these things becauuse someone might got an idea what could be the cause. I will try to bring some more info upcoming hours or days. I will try to dig deeper in memory dumps, hopefully I will find something helpful.

@romtsn
Copy link
Member

romtsn commented Feb 22, 2023

@LukasSmrcka just to make sure, you are not using our File I/O instrumentation, right?

@LukasSmrcka
Copy link
Author

No we don't use it. We use spring boot autoconfiguration with breadcrumbs disabled in application.yml.

@LukasSmrcka
Copy link
Author

LukasSmrcka commented Feb 22, 2023

@romtsn By the way the setting we used after issues with breadcrumbs is

sentry:
  dsn: https://..........
  traces-sample-rate: 1.0
  enable-user-interaction-breadcrumbs: off
  max-breadcrumbs: 1

But I still see on the heap the same count of breadcrumbs instances as sentry Scope instances. Is it possible to turn it off? It looks like only "max-breadcrumbs" is applied but not the "enable-user-interaction-breadcrumbs: off". Or do I understand the config wrong?

@romtsn
Copy link
Member

romtsn commented Feb 22, 2023

enable-user-interaction-breadcrumbs: off should have no effect on backend at all. What do you mean by "same count of breadcrumbs instances"? You mean CircularFifoQueue instances?

@adinauer
Copy link
Member

@LukasSmrcka you could try an older version of the Sentry SDK to see if that still has the problem. Other than that you could check if pushScope is only called in SentryWebFilter or also in other places. Maybe there's something else.

An alternative would be to wait for #2546 but that is only for Spring Boot 3. Not sure you can upgrade easily.

Could the problem be caused by streams not being consumed completely or something similar?

@LukasSmrcka
Copy link
Author

Sorry .. the counts are different. Almost the same. Those are counts of instances of each class in the memory
image

@LukasSmrcka
Copy link
Author

Ok I will check your suggestions. We had the same issue with Sentry libs version 6.4.1. I have upgraded to the latest because I noticed some leak fixes in changelog

I am currently running instance without Sentry to be 100% sure the Sentry leak is not just secondary effect of something else.

Streams not being consumed completly? It could be an option actually, I can check if there are such cases. Could you please elaborate more? How this could cause the issue?

@adinauer
Copy link
Member

It's just a thought. Maybe something isn't cleaned up until the stream is consumed or timeouts are too high. I'm not very familiar with WebFlux. Maybe there's some docs for this case. I don't know. May also have to do with backpressure.

@LukasSmrcka
Copy link
Author

Ok. Tomorrow morning I should have some conclusion from the lates version running on prod without Sentry. Then I will try to take closer look on this

@LukasSmrcka
Copy link
Author

The prod server memory does not show any marks of a leak after 20 hours. So Sentry is for sure somehow involved when streams are returned. One more time I wanna point out that when stream is converted to byte array in controller before it is returned to the calling client then there is also no leak. So I am going to take closer look on your suggestions now

@LukasSmrcka
Copy link
Author

LukasSmrcka commented Feb 23, 2023

I have also checked heapdump for the stream instances. There is literaly zero count of input streams we are using. So it does not look anything is left opened. Do you have any ideas what diagnostics I should add to get some more info?

@adinauer
Copy link
Member

Can you filter your heapdump objects by io.sentry and check if there's as many Hubs as there's Scopes and Breadcrumbs? I suspect that somehow hubs are kept alive. If there's many, maybe you can find something that keeps a reference to the hubs and thus prevents them from being garbage collected.

If you could share some more details on what you're doing inside your controllers etc. I can play around locally some more. Like what classes are you using. Where are you reading from. Are you wrapping stuff with reactor Mono/Flux etc.?

@LukasSmrcka
Copy link
Author

The analyzed heapdump has 8 Hub instances, 1 HubAdapter, 1 NoOpHub. ..... It looks like hub.popScope(); is not called by the web filter or maybe hub.pushScope(); is called multiple times. I am currently investigating what would hot publisher behavior do with the WebFilter when there are more subscribers. I believe there is such a case in our app but counts of unreleased stack items dos not match cases when this scenario might happen so I guess it is not the issue.

I guess there is no chance that popScope() would fail, right?

@LukasSmrcka
Copy link
Author

By the way it is Kotlin app. So requests are running in couroutines scope. 99% of requests are just files from disk loaded as stream, wrapped to InputStreamResource and returned by controller

@adinauer
Copy link
Member

I guess there is no chance that popScope() would fail, right?

Never say never. I'll try some more stuff and see if i can reproduce.

@LukasSmrcka
Copy link
Author

I am trying to get more info about "doFinally" block that is used in SentryWebFilter. Also running a new test on beta environment. I am flooding the server with 10 download requests per second and every download is aborted in the middle by the client app. Based on some info the doFinally might not be called in some situations like this.

@LukasSmrcka
Copy link
Author

I deployed on production very simple WebFilter monitoring count if sumultanios requests. It will show after some time if doFirst and doFinally block behavior is correct for all cases. After some time of running this we should know if the problem is in the reactor or in Sentry library.

    override fun filter(serverWebExchange: ServerWebExchange, webFilterChain: WebFilterChain): Mono<Void> {
        return webFilterChain
                .filter(serverWebExchange)
                .doFinally(
                        Consumer {
                            requestCounter.decrementAndGet()
                        })
                .doFirst(
                        Runnable {
                            requestCounter.incrementAndGet()
                        })
    }

@adinauer
Copy link
Member

@LukasSmrcka thanks. That's a great idea. Hopefully we can learn something from it.

@LukasSmrcka
Copy link
Author

So I have some results. These methods look (doFirst, doFinally) reliable. After 85K of requests for every doFirst there was allways doFinally. However io.sentry.Stack.StackItem instances is 38K in the memory. And of course all related instaces like Scope, FiFoQueue and so on. So it really looks like there is a situation when the "pop" from the stack fails. Spring boot starter is configuring only the the WebFilter and nothing else, right?

Is there a way how to really disable those breadcrumbs? I have currently only limit on "1" ... would 0 work?

@LukasSmrcka
Copy link
Author

Oh and one more question. Is Sentry for Weblux internally working with RectorContext?

@adinauer
Copy link
Member

adinauer commented Mar 1, 2023

Is there a way how to really disable those breadcrumbs? I have currently only limit on "1" ... would 0 work?

Setting it to 0 doesn't work in the current version.

This should allow you to disable breadcrumbs until we can find a fix:

  @Bean
  SentryOptions.BeforeBreadcrumbCallback beforeBreadcrumbCallback() {
    return new SentryOptions.BeforeBreadcrumbCallback() {
      @Override
      public @Nullable Breadcrumb execute(@NotNull Breadcrumb breadcrumb, @NotNull Hint hint) {
        return null;
      }
    };
  }

Spring boot starter is configuring only the the WebFilter and nothing else, right?

There's more. There's also SentryScheduleHook, SentryRequestResolver and SentryWebExceptionHandler.

Is Sentry for Weblux internally working with RectorContext

In our integration for Spring Boot 2, no. We couldn't get it to work in a way that's easily usable for developers. We're about to merge something that'll use it for Spring Boot 3.0.3+. Do you have a suspicion there?

I'm still unable to reproduce. Will try a few more things.

@adinauer
Copy link
Member

adinauer commented Mar 1, 2023

Think I'm on to something. I can reproduce now even without coroutines etc. Seems like SentryWebFilter and SentryScheduleHook don't play well together. We call .pushScope() on one hub and .popScope() on a different one (that is a grandchild of the hub we call .pushScope() on). Working on a fix.

Using this code in controller to reproduce:

  @GetMapping("{id}")
  Mono<Person> person(@PathVariable Long id) {
//    LOGGER.info("Loading person with id={}", id);
//    throw new IllegalArgumentException("Something went wrong [id=" + id + "]");
    return personService.create(new Person("fir", "la")).flatMap(it -> Mono.error(new RuntimeException("thrown on purpose for webflux")));
  }

And then blast it with some requests using wrk -t2 -c100 -d30s http://user:password@localhost:8080/person/91

Will update here once I know more.

@LukasSmrcka
Copy link
Author

According to the reactor context. I have found a place in the code where coroutine context is not passed correctly by my opinion so I was just wondering if this could break your code somehow.

I am also gonna check ou the ScheduleHook as there are some cschedulers running very often in our app.

And I will try to run a new version without breadcrumbs with your suggested config

@adinauer
Copy link
Member

adinauer commented Mar 1, 2023

Are you already using

public class SentryContext : ThreadContextElement<IHub>, AbstractCoroutineContextElement(Key) {
via the sentry-kotlin-extensions dependency?

@LukasSmrcka
Copy link
Author

No we don't use this. I guess we should as I am checking the documentation. But it is for covering of coroutine threads by Sentry or do you think it might actually cause the issue?

@adinauer
Copy link
Member

adinauer commented Mar 1, 2023

I don't think it's responsible for the memory issue you're facing but I've got a PR in the works to improve the situation for WebFlux. Need to do some more testing before pushing it. The sentry-kotlin-extensions dependency should allow you to have correct hub propagation in coroutines, meaning e.g. breadcrumbs end up in the right error. Without it I guess all interactions inside coroutines would leak into other requests etc.

@LukasSmrcka
Copy link
Author

Yes I understand. It looks good. I am gonna use it.

@LukasSmrcka
Copy link
Author

I tried the config above to disable breadcrumbs and behavior is the same. I think the bug you are solving might be the cause as the scheduler is used in the app really heavily. Do you need anything from me now? Or can I just wait for the fix?

@adinauer
Copy link
Member

adinauer commented Mar 2, 2023

Hmm I would've expected it to at least reduce the amount of memory that is consumed. The amount of Stacks, StackItems and Scopes should remain the same. We just merged the fix and are planning a release soon.

@adinauer
Copy link
Member

adinauer commented Mar 2, 2023

The fix doesn't entirely get rid of the problem but should improve the situation by a lot. I'd like to further improve this in a later release by freeing hubs after we think they're no longer needed. At the moment the ThreadLocals keep a reference to the Hub and prevent it from being garbage collected. Currently there may still be one hub per thread including its Stack, StackItems and Scope etc. which stays alive until something else comes along and replaces the hub then allowing it to finally be GCd.

@LukasSmrcka
Copy link
Author

Well ... there are no longer breadcrumbs in the memory so the leak is lower in term of amount of memory. Some leak reductione is there seems to me but I can not tell exactly. 28K StackItems after 17 hours and 145K of requests. Looks better for sure, it progresses slower seems to me but could be just accidental so nothing relevant for any conclusions.

Ok I will try the fix when it is ready. Shouldn't be TreadLocals avoided? Or are you using them somehow together with Reactor context?

@adinauer
Copy link
Member

adinauer commented Mar 2, 2023

We're heavily relying on ThreadLocal to offer the static Sentry.capture... API. For Spring Boot 3.0.3 we'll be able to leverage ThreadLocalAccessor for hub propagation which works better than what he had before (also included in the next release).

@adinauer
Copy link
Member

adinauer commented Mar 2, 2023

@LukasSmrcka 6.15.0 of the Java SDK has just been released. Can you please give it a try and let us know whether the memory issue has been improved?

@LukasSmrcka
Copy link
Author

Sure. On it

@LukasSmrcka
Copy link
Author

It is running now. We will see some results tomorrow

@LukasSmrcka
Copy link
Author

@adinauer After 18 hours of running and 155K of requests there is no Sentry leak. Good job. I think youhave fixed that

@adinauer
Copy link
Member

adinauer commented Mar 3, 2023

That's great to hear. Apologies for the troubles.

@LukasSmrcka
Copy link
Author

Thanks for fixing that. Later I will also give a try to breadcrumbs. Now it is disabled

@adinauer
Copy link
Member

adinauer commented Mar 6, 2023

Played around with my idea for only referencing an ID instead of the whole hub but wasn't able to quickly make it work in all cases (e.g. ThreadLocalAccessor).

Closing this now as the issue seems resolved. Please feel free to reopen in case you still encounter problems.

@adinauer adinauer closed this as completed Mar 6, 2023
@LukasSmrcka
Copy link
Author

@adinauer After half of the milion of requests all is good. Now I have deployed a new version with enabled breadcrumbs. Will let you know if anything goes wrong

@LukasSmrcka
Copy link
Author

Looks like all is good even with breadcrumbs. No marks of leak. Thank you one more time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Platform: Java Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants