New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
process got killed when generating superpoint, probably memory leak by parallel_cut_pursuit package #76
Comments
Hi, it seems your problem is cut-pursuit specific. I invite you to file an issue on the associated repo: https://gitlab.com/1a7r0ch3/parallel-cut-pursuit |
Hello
Thanks for the feedback. I corrected the memory leak on edge_list_to_forward_star (was due to mismanagement of references, all too frequent in Python). I invite you to update the grid-graph repo (actually independent from parallel cut-pursuit) and recompile the C-extension module.
08/03/2024 07:56, pierrick Bournez :
… Hello,
I've been investigating it a bit and I think theire is two memory leak:
One huge on partition_py with edge_list_to_forward_star ( I can provide a script that prove it just below)
One small with pgeof,
Here is the script that proves the memory leak :
`
import tracemalloc
tracemalloc.start()
marqueur=0
snapshot=tracemalloc.take_snapshot()
import numpy as np
from grid_graph import edge_list_to_forward_star
num_iter=int(1e5)
import time
start=time.time()
for k in range(num_iter):
V=int(5e3)
edges=np.random.randint(0,V,(num_iter,2))
# convert it to be C-contiguous
edges=np.ascontiguousarray(edges,dtype=np.int32)
first_edge,adj_vertices,reindex=edge_list_to_forward_star(
V,edges
)
marqueur+=1
if marqueur>1e2:
snapshot2=tracemalloc.take_snapshot()
marqueur=0
Top_stats=snapshot2.compare_to(snapshot,'lineno')
print("TOP 10 differneces")
for stat in Top_stats[:10]:
print(stat)
current,peak=tracemalloc.get_traced_memory()
print(f"Current memory usage is {current/10**6}MB; Peak was {peak/10**6}MB")
top_stats=snapshot2.statistics('traceback')
stat=top_stats[0]
print("%s memory blocks: %.1f in MiB: " % (stat.count,stat.size/1024**2))
for line in stat.traceback.format():
print(line)
`
--
Reply to this email directly or view it on GitHub:
#76 (comment)
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
|
Hi! Thank you for your reply! As I mentioned in anther issue that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the loss computing code of SSP, weather the cpp or python part. I check the newest update of grid_graph and parallel-cut-pursuit, so the leakage is caused by the interface between cpp and python? To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version): As I retrained the official version of SPP, I found the non parallel version of cut-pursuit also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, so I can find this issue. With the latest update of grid_graph, I think the leakage may down to the same level with previous cut-pursuit code base. I also tried some RAM collect function like Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of Thank you again for your effort on this issue, what you guys do really inspire me a lot ! |
Some details for clarity.
`grid-graph` is only a simple utility for manipulating the graph data structure (here, it is used to convert a graph adjacency list into "forward-star" aka "sparse matrix row" representation). It DID have leakage (due to Python internal ref counts handling), that is now solved.
`parallel-cut-pursuit` is a solver for the "minimal partition problem" on which is based the superpoint hierarchy here. It has significant memory needs that might double or triple the memory need for the graph representation during run. If you suspect that there is memory leak, either because of the internal cpp source code or - most probably - because of the python interface, please try to set up a minimal example (independant of PyTorch and the like) to put this into evidence and I'll be able to fix it.
By the way, there have been more time and effort put in the `parallel-cut-pursuit`, which should more efficient both in term of memory and computations resources that previous versions.
Of couse, possible leakage on the computation of the loss should be investigated separately.
13/03/2024 00:54, Yan Xu :
… Hi!
Thank you for your reply!
As I mentioned in [anther issue](1a7r0ch3/parallel-cut-pursuit#10) that when I transfer the SSP to semantic segmentation networks, the leakage disappear. So I think the leakage must with the [loss computing](https://github.com/loicland/superpoint_graph/blob/ssp%2Bspg/supervized_partition/losses.py) code of SSP, weather the cpp or python part.
I check the newest update of [grid_graph](https://gitlab.com/1a7r0ch3/grid-graph) and [parallel-cut-pursuit](https://gitlab.com/1a7r0ch3/parallel-cut-pursuit), so the leakage is caused by the interface between cpp and python?
To check the effectiveness of the update ,I retrain the SPP(with parallel-cut-pursuit version):
first, I only update the grid_graph part, the leakage really reduced a lot(very happy that I can train SSP with more epochs in one round). Then I update the parallel-cut-pursuit(`cp_d0_dist` function), but the leakage did not reduced much, so I think the parallel-cut-pursuit part still has RAM leakage issue?
As I retrained the official version of SPP, I found the non parallel version of [cut-pursuit](https://github.com/loicland/cut-pursuit) also had leakage issue, since I changed the backend from cut-pursuit to parallel-cut-pursuit(with grid_graph), the RAM usage increased too fast, then I found this issue. With the latest update of grid_graph, I think the leakage may down to
the same level with previous cut-pursuit code base. (Note that for SPP, the point embedding feature is computed by network not the `geof` lib)
I also tried some RAM collect function like `gc.collect` on python part, the leakage issue did not change.
Since this update is about the API between python and cpp, not inside the cpp code, maybe the leakage is because the loss computing procedure triggered some bug of `PyTorch`, so the RAM usage of `dataloader` can not be recycled? We can not found the leakage when debug the code separately? Just guess...
Thank you again for your effort on this issue, what you guys do really inspire me a lot !
--
Reply to this email directly or view it on GitHub:
#76 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
|
Sorry I forgot to put the training log, after I update the ...
...
...
epoch 1
0.0075494519360548495
50%|███████████████████████████████▌ | 31/62 [00:30<00:31, 1.02s/it]
current step memory usage (before collate) 1558.3125 MB
current step memory usage (before train data load) 1558.3125 MB
current step memory usage (before fstar) 1558.3125 MB
current step memory usage (after fstar) 1558.3125 MB
current step memory usage (before cutp) 1558.3125 MB
current step memory usage (after cutp) 1559.38671875 MB
...
...
...
epoch 19
98%|█████████████████████████████████████████████████████████████▉ | 61/62 [03:21<00:03, 3.33s/it]
current step memory usage (before collate) 12610.2890625 MB
current step memory usage (before train data load) 12610.2890625 MB
current step memory usage (before fstar) 12610.2890625 MB
current step memory usage (after fstar) 12610.2890625 MB
current step memory usage (before cutp) 12610.2890625 MB
current step memory usage (after cutp) 12611.2265625 MB
...
...
...
epoch 20
98%|█████████████████████████████████████████████████████████████▋ | 49/50 [02:18<00:02, 2.80s/it]
current step memory usage (before collate) 14615.55859375 MB
current step memory usage (before fstar) 14615.55859375 MB
current step memory usage (after fstar) 14615.55859375 MB
current step memory usage (before cutp) 14615.55859375 MB
current step memory usage (after cutp) 14615.66796875 MB
Saving model to model_epoch_20.pth.tar
... |
Hi, thanks for the detailed report. I wrote the "original" cp code, and I remember having investigated this memory leak back in 2019. If I remember correctly, it was not a true memory leak, but an issue in how often the the memory allocation is refreshed. What you may see is the memory increasing linearly during training / preprocessing, and then suddenly dropping. I have no idea if this can be fixed, but this should not kill your process. If it does, let us know! |
Yes, I understand that I should set an set up a minimal example independant of PyTorch to trackle the leak. Since I just implented the code of |
Thank you for your reply, professor. Yes, I have only seen an As I trained the SSP today with the latest version of I will test the code with bigger |
When I running the superpoint generation process using parallel_cut_pursuit, I notice that the memory usage keep going up when process multiple files, when meet the maximum memory capacity, the python program got killed by the system(Although I can restart the program and resume the superpoint generation process).
Does the parallel_cut_pursuit have memory leak issue? How can I fix it?
The text was updated successfully, but these errors were encountered: