process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76

MEIXuYan · 2024-03-01T09:11:50Z

When I running the superpoint generation process using parallel_cut_pursuit, I notice that the memory usage keep going up when process multiple files, when meet the maximum memory capacity, the python program got killed by the system(Although I can restart the program and resume the superpoint generation process).
Does the parallel_cut_pursuit have memory leak issue? How can I fix it?

drprojects · 2024-03-01T12:44:16Z

Hi, it seems your problem is cut-pursuit specific. I invite you to file an issue on the associated repo: https://gitlab.com/1a7r0ch3/parallel-cut-pursuit

1a7r0ch3 · 2024-03-12T15:27:28Z

Hello Thanks for the feedback. I corrected the memory leak on edge_list_to_forward_star (was due to mismanagement of references, all too frequent in Python). I invite you to update the grid-graph repo (actually independent from parallel cut-pursuit) and recompile the C-extension module. 08/03/2024 07:56, pierrick Bournez :

…

Hello, I've been investigating it a bit and I think theire is two memory leak: One huge on partition_py with edge_list_to_forward_star ( I can provide a script that prove it just below) One small with pgeof, Here is the script that proves the memory leak : ` import tracemalloc tracemalloc.start() marqueur=0 snapshot=tracemalloc.take_snapshot() import numpy as np from grid_graph import edge_list_to_forward_star num_iter=int(1e5) import time start=time.time() for k in range(num_iter): V=int(5e3) edges=np.random.randint(0,V,(num_iter,2)) # convert it to be C-contiguous edges=np.ascontiguousarray(edges,dtype=np.int32) first_edge,adj_vertices,reindex=edge_list_to_forward_star( V,edges ) marqueur+=1 if marqueur>1e2: snapshot2=tracemalloc.take_snapshot() marqueur=0 Top_stats=snapshot2.compare_to(snapshot,'lineno') print("TOP 10 differneces") for stat in Top_stats[:10]: print(stat) current,peak=tracemalloc.get_traced_memory() print(f"Current memory usage is {current/10**6}MB; Peak was {peak/10**6}MB") top_stats=snapshot2.statistics('traceback') stat=top_stats[0] print("%s memory blocks: %.1f in MiB: " % (stat.count,stat.size/1024**2)) for line in stat.traceback.format(): print(line) ` -- Reply to this email directly or view it on GitHub: #76 (comment) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

MEIXuYan · 2024-03-13T07:54:33Z

Hi!

Thank you for your reply!

As I mentioned in anther issue that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the loss computing code of SSP, weather the cpp or python part.

I check the newest update of grid_graph and parallel-cut-pursuit, so the leakage is caused by the interface between cpp and python?

To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version):
first, I only update the grid_graph part, the leakage really reduced a lot(very happy that I can train SSP with more epochs in one round). Then I update the parallel-cut-pursuit(cp_d0_dist function), but the leakage did not reduced much, so I think the parallel-cut-pursuit part still has RAM leakage issue?(Note that for SPP, the point embedding feature is computed by network not the geof lib)

As I retrained the official version of SPP, I found the non parallel version of cut-pursuit also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, so I can find this issue.

With the latest update of grid_graph, I think the leakage may down to the same level with previous cut-pursuit code base.

I also tried some RAM collect function like gc.collect on python part, the leakage issue did not change.

Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of PyTorch, so the RAM usage of dataloader can not be recycled? We can not found the leakage when debug the code separately? Just guess...

Thank you again for your effort on this issue, what you guys do really inspire me a lot !

1a7r0ch3 · 2024-03-13T10:40:40Z

Some details for clarity. `grid-graph` is only a simple utility for manipulating the graph data structure (here, it is used to convert a graph adjacency list into "forward-star" aka "sparse matrix row" representation). It DID have leakage (due to Python internal ref counts handling), that is now solved. `parallel-cut-pursuit` is a solver for the "minimal partition problem" on which is based the superpoint hierarchy here. It has significant memory needs that might double or triple the memory need for the graph representation during run. If you suspect that there is memory leak, either because of the internal cpp source code or - most probably - because of the python interface, please try to set up a minimal example (independant of PyTorch and the like) to put this into evidence and I'll be able to fix it. By the way, there have been more time and effort put in the `parallel-cut-pursuit`, which should more efficient both in term of memory and computations resources that previous versions. Of couse, possible leakage on the computation of the loss should be investigated separately. 13/03/2024 00:54, Yan Xu :

…

Hi! Thank you for your reply! As I mentioned in [anther issue](1a7r0ch3/parallel-cut-pursuit#10) that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the [loss computing](https://github.com/loicland/superpoint_graph/blob/ssp%2Bspg/supervized_partition/losses.py) code of SSP, weather the cpp or python part. I check the newest update of [grid_graph](https://gitlab.com/1a7r0ch3/grid-graph) and [parallel-cut-pursuit](https://gitlab.com/1a7r0ch3/parallel-cut-pursuit), so the leakage is caused by the interface between cpp and python? To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version): first, I only update the grid_graph part, the leakage really reduced a lot(very happy that I can train SSP with more epochs in one round). Then I update the parallel-cut-pursuit(`cp_d0_dist` function), but the leakage did not reduced much, so I think the parallel-cut-pursuit part still has RAM leakage issue? As I retrained the official version of SPP, I found the non parallel version of [cut-pursuit](https://github.com/loicland/cut-pursuit) also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, then I found this issue. With the latest update of grid_graph, I think the leakage may down to the same level with previous cut-pursuit code base. (Note that for SPP, the point embedding feature is computed by network not the `geof` lib) I also tried some RAM collect function like `gc.collect` on python part, the leakage issue did not change. Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of `PyTorch`, so the RAM usage of `dataloader` can not be recycled? We can not found the leakage when debug the code separately? Just guess... Thank you again for your effort on this issue, what you guys do really inspire me a lot ! -- Reply to this email directly or view it on GitHub: #76 (comment) You are receiving this because you commented. Message ID: ***@***.***>

MEIXuYan · 2024-03-13T12:27:33Z

Sorry I forgot to put the training log, after I update the grid_graph and parallel-cut-pursuit, the log shows that RAM usage keep going up while the training going on, but the leakage has reduced a lot.

...
...
...
epoch 1
0.0075494519360548495
 50%|███████████████████████████████▌                               | 31/62 [00:30<00:31,  1.02s/it]
current step memory usage (before collate) 1558.3125 MB
current step memory usage (before train data load) 1558.3125 MB
current step memory usage (before fstar) 1558.3125 MB
current step memory usage (after fstar) 1558.3125 MB
current step memory usage (before cutp) 1558.3125 MB
current step memory usage (after cutp) 1559.38671875 MB

...
...
...
epoch 19
 98%|█████████████████████████████████████████████████████████████▉ | 61/62 [03:21<00:03,  3.33s/it]
current step memory usage (before collate) 12610.2890625 MB
current step memory usage (before train data load) 12610.2890625 MB
current step memory usage (before fstar) 12610.2890625 MB
current step memory usage (after fstar) 12610.2890625 MB
current step memory usage (before cutp) 12610.2890625 MB
current step memory usage (after cutp) 12611.2265625 MB
...
...
...
epoch 20
 98%|█████████████████████████████████████████████████████████████▋ | 49/50 [02:18<00:02,  2.80s/it]
current step memory usage (before collate) 14615.55859375 MB
current step memory usage (before fstar) 14615.55859375 MB
current step memory usage (after fstar) 14615.55859375 MB
current step memory usage (before cutp) 14615.55859375 MB
current step memory usage (after cutp) 14615.66796875 MB
Saving model to model_epoch_20.pth.tar
...

loicland · 2024-03-13T13:05:59Z

Hi,

thanks for the detailed report. I wrote the "original" cp code, and I remember having investigated this memory leak back in 2019. If I remember correctly, it was not a true memory leak, but an issue in how often the the memory allocation is refreshed. What you may see is the memory increasing linearly during training / preprocessing, and then suddenly dropping.

I have no idea if this can be fixed, but this should not kill your process. If it does, let us know!

MEIXuYan · 2024-03-13T17:00:04Z

Some details for clarity. grid-graph is only a simple utility for manipulating the graph data structure (here, it is used to convert a graph adjacency list into "forward-star" aka "sparse matrix row" representation). It DID have leakage (due to Python internal ref counts handling), that is now solved. parallel-cut-pursuit is a solver for the "minimal partition problem" on which is based the superpoint hierarchy here. It has significant memory needs that might double or triple the memory need for the graph representation during run. If you suspect that there is memory leak, either because of the internal cpp source code or - most probably - because of the python interface, please try to set up a minimal example (independant of PyTorch and the like) to put this into evidence and I'll be able to fix it. By the way, there have been more time and effort put in the parallel-cut-pursuit, which should more efficient both in term of memory and computations resources that previous versions. Of couse, possible leakage on the computation of the loss should be investigated separately. 13/03/2024 00:54, Yan Xu :
…
Hi! Thank you for your reply! As I mentioned in anther issue that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the loss computing code of SSP, weather the cpp or python part. I check the newest update of grid_graph and parallel-cut-pursuit, so the leakage is caused by the interface between cpp and python? To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version): first, I only update the grid_graph part, the leakage really reduced a lot(very happy that I can train SSP with more epochs in one round). Then I update the parallel-cut-pursuit(cp_d0_dist function), but the leakage did not reduced much, so I think the parallel-cut-pursuit part still has RAM leakage issue? As I retrained the official version of SPP, I found the non parallel version of cut-pursuit also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, then I found this issue. With the latest update of grid_graph, I think the leakage may down to the same level with previous cut-pursuit code base. (Note that for SPP, the point embedding feature is computed by network not the geof lib) I also tried some RAM collect function like gc.collect on python part, the leakage issue did not change. Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of PyTorch, so the RAM usage of dataloader can not be recycled? We can not found the leakage when debug the code separately? Just guess... Thank you again for your effort on this issue, what you guys do really inspire me a lot ! -- Reply to this email directly or view it on GitHub: #76 (comment) You are receiving this because you commented. Message ID: @.***>

Yes, I understand that I should set an set up a minimal example independant of PyTorch to trackle the leak.

Since I just implented the code of cp_d0_dist from src/transform/partition.py of SPT repo and the only difference between SSP and SPT is that embedding of SSP is computed by an network (normalized), while SPT is computed by geof, I think if the leakage disappear in SPT and the leakage of SSP stays, maybe the leakage is caused by PyTorch's shortage when handling the non-end-to-end training.(I had converted SSP to an end-to-end semantic segmentation networks with bce loss, the leakage disappear, and the other part of SSP loss computing is purely python numpy-based operation and I delete every variable with python function del)

MEIXuYan · 2024-03-13T17:09:20Z

Hi,

thanks for the detailed report. I wrote the "original" cp code, and I remember having investigated this memory leak back in 2019. If I remember correctly, it was not a true memory leak, but an issue in how often the the memory allocation is refreshed. What you may see is the memory increasing linearly during training / preprocessing, and then suddenly dropping.

I have no idea if this can be fixed, but this should not kill your process. If it does, let us know!

Thank you for your reply, professor.

Yes, I have only seen an process killed after I replaced cutp with parallel_cutp, which has been proved an true leak issue caused by grid_graph.

As I trained the SSP today with the latest version of grid_graph, I only noticed memory usage increase from 1GB to 14GB, since I set max_epoch=20, the training stopped before the RAM ran out and the system killed the process.(My server has 64GB RAM, so maybe the RAM usage is not big enough to trigger an memory suddenly dropping?)

I will test the code with bigger max_epoch and larger dataset. Thank you again for your reply.

drprojects closed this as completed Mar 1, 2024

gardiens mentioned this issue Mar 13, 2024

Memory leak of pgeof drprojects/point_geometric_features#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76

process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76

MEIXuYan commented Mar 1, 2024

drprojects commented Mar 1, 2024

1a7r0ch3 commented Mar 12, 2024 via email

MEIXuYan commented Mar 13, 2024 •

edited

1a7r0ch3 commented Mar 13, 2024 via email

MEIXuYan commented Mar 13, 2024

loicland commented Mar 13, 2024 •

edited

MEIXuYan commented Mar 13, 2024

MEIXuYan commented Mar 13, 2024 •

edited

process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76

process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76

Comments

MEIXuYan commented Mar 1, 2024

drprojects commented Mar 1, 2024

1a7r0ch3 commented Mar 12, 2024 via email

MEIXuYan commented Mar 13, 2024 • edited

1a7r0ch3 commented Mar 13, 2024 via email

MEIXuYan commented Mar 13, 2024

loicland commented Mar 13, 2024 • edited

MEIXuYan commented Mar 13, 2024

MEIXuYan commented Mar 13, 2024 • edited

MEIXuYan commented Mar 13, 2024 •

edited

loicland commented Mar 13, 2024 •

edited

MEIXuYan commented Mar 13, 2024 •

edited