Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76

Closed
MEIXuYan opened this issue Mar 1, 2024 · 8 comments

Comments

@MEIXuYan
Copy link

MEIXuYan commented Mar 1, 2024

When I running the superpoint generation process using parallel_cut_pursuit, I notice that the memory usage keep going up when process multiple files, when meet the maximum memory capacity, the python program got killed by the system(Although I can restart the program and resume the superpoint generation process).
Does the parallel_cut_pursuit have memory leak issue? How can I fix it?

@drprojects
Copy link
Owner

Hi, it seems your problem is cut-pursuit specific. I invite you to file an issue on the associated repo: https://gitlab.com/1a7r0ch3/parallel-cut-pursuit

@1a7r0ch3
Copy link
Collaborator

1a7r0ch3 commented Mar 12, 2024 via email

@MEIXuYan
Copy link
Author

MEIXuYan commented Mar 13, 2024

Hi!

Thank you for your reply!

As I mentioned in anther issue that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the loss computing code of SSP, weather the cpp or python part.

I check the newest update of grid_graph and parallel-cut-pursuit, so the leakage is caused by the interface between cpp and python?

To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version):
first, I only update the grid_graph part, the leakage really reduced a lot(very happy that I can train SSP with more epochs in one round). Then I update the parallel-cut-pursuit(cp_d0_dist function), but the leakage did not reduced much, so I think the parallel-cut-pursuit part still has RAM leakage issue?(Note that for SPP, the point embedding feature is computed by network not the geof lib)

As I retrained the official version of SPP, I found the non parallel version of cut-pursuit also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, so I can find this issue.

With the latest update of grid_graph, I think the leakage may down to the same level with previous cut-pursuit code base.

I also tried some RAM collect function like gc.collect on python part, the leakage issue did not change.

Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of PyTorch, so the RAM usage of dataloader can not be recycled? We can not found the leakage when debug the code separately? Just guess...

Thank you again for your effort on this issue, what you guys do really inspire me a lot !

@1a7r0ch3
Copy link
Collaborator

1a7r0ch3 commented Mar 13, 2024 via email

@MEIXuYan
Copy link
Author

Sorry I forgot to put the training log, after I update the grid_graph and parallel-cut-pursuit, the log shows that RAM usage keep going up while the training going on, but the leakage has reduced a lot.

...
...
...
epoch 1
0.0075494519360548495
 50%|███████████████████████████████▌                               | 31/62 [00:30<00:31,  1.02s/it]
current step memory usage (before collate) 1558.3125 MB
current step memory usage (before train data load) 1558.3125 MB
current step memory usage (before fstar) 1558.3125 MB
current step memory usage (after fstar) 1558.3125 MB
current step memory usage (before cutp) 1558.3125 MB
current step memory usage (after cutp) 1559.38671875 MB

...
...
...
epoch 19
 98%|█████████████████████████████████████████████████████████████▉ | 61/62 [03:21<00:03,  3.33s/it]
current step memory usage (before collate) 12610.2890625 MB
current step memory usage (before train data load) 12610.2890625 MB
current step memory usage (before fstar) 12610.2890625 MB
current step memory usage (after fstar) 12610.2890625 MB
current step memory usage (before cutp) 12610.2890625 MB
current step memory usage (after cutp) 12611.2265625 MB
...
...
...
epoch 20
 98%|█████████████████████████████████████████████████████████████▋ | 49/50 [02:18<00:02,  2.80s/it]
current step memory usage (before collate) 14615.55859375 MB
current step memory usage (before fstar) 14615.55859375 MB
current step memory usage (after fstar) 14615.55859375 MB
current step memory usage (before cutp) 14615.55859375 MB
current step memory usage (after cutp) 14615.66796875 MB
Saving model to model_epoch_20.pth.tar
...

@loicland
Copy link
Collaborator

loicland commented Mar 13, 2024

Hi,

thanks for the detailed report. I wrote the "original" cp code, and I remember having investigated this memory leak back in 2019. If I remember correctly, it was not a true memory leak, but an issue in how often the the memory allocation is refreshed. What you may see is the memory increasing linearly during training / preprocessing, and then suddenly dropping.

I have no idea if this can be fixed, but this should not kill your process. If it does, let us know!

@MEIXuYan
Copy link
Author

Some details for clarity. grid-graph is only a simple utility for manipulating the graph data structure (here, it is used to convert a graph adjacency list into "forward-star" aka "sparse matrix row" representation). It DID have leakage (due to Python internal ref counts handling), that is now solved. parallel-cut-pursuit is a solver for the "minimal partition problem" on which is based the superpoint hierarchy here. It has significant memory needs that might double or triple the memory need for the graph representation during run. If you suspect that there is memory leak, either because of the internal cpp source code or - most probably - because of the python interface, please try to set up a minimal example (independant of PyTorch and the like) to put this into evidence and I'll be able to fix it. By the way, there have been more time and effort put in the parallel-cut-pursuit, which should more efficient both in term of memory and computations resources that previous versions. Of couse, possible leakage on the computation of the loss should be investigated separately. 13/03/2024 00:54, Yan Xu :

Hi! Thank you for your reply! As I mentioned in anther issue that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the loss computing code of SSP, weather the cpp or python part. I check the newest update of grid_graph and parallel-cut-pursuit, so the leakage is caused by the interface between cpp and python? To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version): first, I only update the grid_graph part, the leakage really reduced a lot(very happy that I can train SSP with more epochs in one round). Then I update the parallel-cut-pursuit(cp_d0_dist function), but the leakage did not reduced much, so I think the parallel-cut-pursuit part still has RAM leakage issue? As I retrained the official version of SPP, I found the non parallel version of cut-pursuit also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, then I found this issue. With the latest update of grid_graph, I think the leakage may down to the same level with previous cut-pursuit code base. (Note that for SPP, the point embedding feature is computed by network not the geof lib) I also tried some RAM collect function like gc.collect on python part, the leakage issue did not change. Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of PyTorch, so the RAM usage of dataloader can not be recycled? We can not found the leakage when debug the code separately? Just guess... Thank you again for your effort on this issue, what you guys do really inspire me a lot ! -- Reply to this email directly or view it on GitHub: #76 (comment) You are receiving this because you commented. Message ID: @.***>

Yes, I understand that I should set an set up a minimal example independant of PyTorch to trackle the leak.

Since I just implented the code of cp_d0_dist from src/transform/partition.py of SPT repo and the only difference between SSP and SPT is that embedding of SSP is computed by an network (normalized), while SPT is computed by geof, I think if the leakage disappear in SPT and the leakage of SSP stays, maybe the leakage is caused by PyTorch's shortage when handling the non-end-to-end training.(I had converted SSP to an end-to-end semantic segmentation networks with bce loss, the leakage disappear, and the other part of SSP loss computing is purely python numpy-based operation and I delete every variable with python function del)

@MEIXuYan
Copy link
Author

MEIXuYan commented Mar 13, 2024

Hi,

thanks for the detailed report. I wrote the "original" cp code, and I remember having investigated this memory leak back in 2019. If I remember correctly, it was not a true memory leak, but an issue in how often the the memory allocation is refreshed. What you may see is the memory increasing linearly during training / preprocessing, and then suddenly dropping.

I have no idea if this can be fixed, but this should not kill your process. If it does, let us know!

Thank you for your reply, professor.

Yes, I have only seen an process killed after I replaced cutp with parallel_cutp, which has been proved an true leak issue caused by grid_graph.

As I trained the SSP today with the latest version of grid_graph, I only noticed memory usage increase from 1GB to 14GB, since I set max_epoch=20, the training stopped before the RAM ran out and the system killed the process.(My server has 64GB RAM, so maybe the RAM usage is not big enough to trigger an memory suddenly dropping?)

I will test the code with bigger max_epoch and larger dataset. Thank you again for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants