-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dagster-webserver
memory leak
#18997
Comments
I don't see any notable commits in
How exactly did you do this? Can you report the python environments in the two containers ( |
We changed the helm chart version. We literally just reverted the Renovate bot commit.
|
Thanks for following up, not much interesting in the dependency changes. I spent some time with Do you have anything like automated recurring queries against the webserver? |
Well only the Turns out we actually still observe the same behaviour after rolling back to |
I've had luck using this tool to get a memory profile of a running process https://github.com/facebookarchive/memory-analyzer and this https://github.com/kmaork/madbg for interactive poking around at the active process. I believe these both need Given its a webserver its also susceptible to the "type 3" leaks described here https://blog.nelhage.com/post/three-kinds-of-leaks/ python allocator arena fragmentation, but the very smooth gradient of your graphs makes me skeptical thats the cause without some sort of recurring large query causing the fragmentation. |
@aaaaahaaaaa did you find any reason why memory started growing? We have a similar issue and switching between versions didn't help yet - tried from 1.5.14 to 1.5.12. The memory increase is quite noticeable, showing up even in daily granularity. This issue seems to be isolated to the webserver component. Both the daemon and code servers are exhibiting stable memory usage. We are operating these as three separate containers within AWS ECS. We have only one scheduled job active, no sensors, auto-materialized so far. Assets are loaded from dbt. |
@jvyoralek No I didn't find the source of the problem and the issue is still occurring for us as well. Unfortunately I didn't have time to investigate further. I think there's clearly something up with the workload, we're not doing anything special either aside from deploying the helm chart. |
@alangenfeld found a memory leak that could be the cause of this, I'll let him comment but here is the PR that attempts to fix it #19298 |
#19298 is a fix for a problem that manifests as very rapid unbounded memory growth resulting in process termination. I don't believe its related to this slower memory growth. |
@noam-jacobson what version were you upgrading from? |
I was on version 1.5.10 |
@noam-jacobson We're having the same issue on ECS/Fargate on 1.5.7 |
We are also having the same issue on 1.6.0, also ECS/Fargate |
Same here in our k8s deployment cluster. Any clue? |
We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:
|
dagster-webserver
1.5.13
memory leakdagster-webserver
memory leak
How did that impact your memory usage? Technically you'll still retain ticks for up to 365 days, thus you should not see a change in behavior in just a few days. Or did I miss something? I've applied a similar setting on my deployment as well (way stricter than yours, for testing) and my memory is still going up, same as before. |
Same problem here on Open-Shift with nearly same packages (dagster 1.6.5), also PostgreSQL and slim-buster images on both daemon and dagster-webserver (separate pods). |
EDIT: We found out that the following is actually not working. The initial indication might have just been a fluke.
|
Has anyone had success with the solution recommended by @stasharrofi ? We have made changes, but it appears that the memory usage is still increasing. I see anyio 4.3 in log
|
@jvyoralek It hasn't worked for me. Deployed the newest Dagster version 1.6.6 with anyio-4.3.0. |
@jvyoralek : No, we found out that it's not working for us either. The initial indication that it was working was probably just a fluke. |
My team experienced this issue in an OSS ECS deployment after an upgrade from 1.5.9 -> 1.6.8. It impacted the dagit/webserver and daemon services, but not independent grpc/code location services. It presented as a slow leak that would increase memory utilization over a week or so until hitting critical thresholds / crashing the service, with 1gb memory allocated to services. We "resolved" the issue in our environments by downgrading and pinning the grpcio python package to 1.57.0. In incremental tests we downgraded our docker image base to the image version/sha we used for our 1.5.9 deployment, reverted dagster packages from 1.6.8 back to 1.5.9, and updated python from 3.10 -> 3.11. None of these changes resolved the memory leak. Sharing this context as it supports root cause being related to an unpinned package dependency, and not necessarily an issue with the core dagster packages. It also ruled out interaction with OS libs/OS version causing the leak. We selected grpcio 1.57.0 because it was the version of the dep that was solved for at the time when we originally deployed 1.5.9. It's possible a more recent version would work as well. |
Thank you, @jobicarter, for the effective workaround. We deployed it yesterday, and although it's only been a short time, we're already seeing promising changes. Tested with these versions:
|
I can confirm that downgrading grpcio to 1.57.0 stops the leak.
We also did try to upgrade it to 1.62.1, but that didn't seem to work. |
Thanks for the solution, I think this could be related to the dagster issue, grpc/grpc#36117 |
We are running into the same issue on our Kubernetes cluster, having installed Dagster via the Helm chart. Is the solution to downgrade I don't understand why Dagster hasn't pinned the grpcio version themselves to prevent this issue from happening, it seems a little strange that they are expecting users to either live with the memory leak, or manually fix the dependencies themselves. |
Just to add my 2 cents': running dagster 1.7.16/dbt/dagster-webserver all in one k8s pod. I admit that it is somewhat inconclusive since some memory increase (but also a kind of garbage collection releasing much of the extra memory at a point) was visible before the last restart while using grpcio 1.57.0. Still, overall it looks way better than with grpcio 1.60. It seems to be a workaround for now, but with at least two drawbacks (other than using an outdated component at all):
|
We started noticing memory leaks in certain code locations after upgrading to Dagster 1.8. Could grpcio potentially be contributing to these leaks? We're still investigating, but I’d like to rule out this possibility. |
Dagster version
1.5.13
What's the issue?
dagster-webserver
1.5.13
seems to have some kind of memory leak. Since we updated to that version, we can observe a steady increase in memory usage over the last couple of weeks.1.5.12
resolves the issue.What did you expect to happen?
No response
How to reproduce?
No response
Deployment type
Dagster Helm chart
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: