New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PC2 on GPU may crash, leaks out GPU RAM and worker #1326
Comments
It sounds like tree-building is leaking GPU RAM. This was much worse before and believed fixed. I think I know where to look and generally what the fix should be, if I can confirm. If you were to make a minimal issue in the |
It only seems to happen if the prover crashes though. I haven't seen a leak while normally operating. |
Can you estimate how often it happens? Is it regular? |
Another data point, my PC2 just crashed with bad file descriptor (this seems to be another bug, if I restart my miner in PC2, it will fail 3x on PC2 then revert to PC1 with this error):
The memory leak can be observed in this case too. My node restarted PC2, but the 1.6GB GPU RAM was not clered up, rather an additional 1.6GB was allocated. |
Here's the relevant issue: #1327 |
@porcuquine How often? 3 times in the last 12 hours |
Thanks, that's all helpful. |
When this happens on lotus-worker again, can you try killing it with SIGQUIT to see if the goroutine calling rust is stuck in the call? |
|
No idea what's happening above, PC2 takes beteen 30-60 minutes generally. |
8 hours later and my GPU is already running with leaked memory:
I.e. No PC2 or C2 running for hours now, but the GPU still has memory allocated on it which will build up and choke future PC2s. |
Typically the sector logs look like this:
|
I am not sure whether this would have been expected to affect whatever version you are using, but we are aware of a GPU memory leak which should be fixed in the next release. |
Tree building has been re-written and issues like this haven't come up in a while. I'm closing now, but any new issues should be reported. |
I've configured my node to run both C2 and PC2 on the GPU.
I've also configured my node to run multiple PC2s concurrently.
An annoying issue I've been hitting now is that from time to time there are more jobs scheduled to the GPU than available RAM. In the case of C2, if there's not enough available GPU RAM, it will print an error and continue on the CPU:
However, if the GPU RAM limit is hit on PC2, the process crashes:
The big issues here are:
The text was updated successfully, but these errors were encountered: