-
-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PooledMemory bugs? #317
Comments
I will look about this later. |
Can you paste codes to reproduce? |
Sorry, my code causing the problem is for my research, so I can't paste it. |
On cython,
and run your program. |
Hmm, I can not reproduce. import cupy
cupy.cuda.set_allocator(cupy.cuda.MemoryPool().malloc)
x = cupy.array([1,2,3,4], dtype=cupy.float64) import cupy
cupy.cuda.set_allocator(cupy.cuda.MemoryPool().malloc)
x = cupy.array([1,2,3,4], dtype=cupy.float64)
y = cupy.array([1,2,3,4], dtype=cupy.float64)
z = float(cupy.sum((x - y)**2, dtype=cupy.float64)) |
Let me take a note. Line 471 in 5062f61
Memory.__dealloc__ is called)
|
Thank you for your rapid replies. I will try to create PoC code from my framework and do I realized that:
|
The traceback line
tells the Line 358 in 5062f61
347 def __dealloc__(self):
348 if self.ptr != 0:
349 self.free()
350
351 cpdef free(self):
352 """Frees the memory buffer and returns it to the memory pool.
353
354 This function actually does not free the buffer. It just returns the
355 buffer to the memory pool for reuse.
356
357 """
358 pool = self.pool()
359 if pool and self.ptr != 0:
360 pool.free(self.ptr, self.size)
361 self.ptr = 0
362 self.size = 0
363 self.device = None
it looks the This looks like a potential bug which also exists in cupy 1.0.1. |
BTW: When I put
I got following error messages which are similar to issued error messages
Now, I am wondering how to fix this problem because cython's cdef classes do not have |
@hiro4bbh I wrote an experimental patch master...sonots:fix_317. This patch is to use Build as:
|
Thank you for your patch. I applied the patch as you tell, then I got the following error message many times:
The number of times that this error happens changes at each run... Are some free lists destroyed in some chunk operations? I think there is no multithreaded operations... If I will inspect the details as preparing PoC code. |
Thank you for trying. Hmm, I will investigate.
|
It looks |
When I implement Adam with Maybe, this error is based on free list manipulation operated at memory allocations/deallocations, so it would be difficult to write stable PoC code succinctly (some parts in my framework may affect). I couldn't create PoC code, but i will continue to create PoC code and inspect the implementation. |
As trying to create stable PoC code, I realized that In some cases, when |
Let me make sure. Do you mean you still get |
Yes. I got |
Okay, thanks. |
@hiro4bbh could you do me a favor? I added debug print > sonots@6a6732a (this commit is pushed in Could you run your program with this and paste the result? Please note that the result would become so huge. Pasting on a separated gist would be better. If log is too huge to paste, it is okay to filter to only "malloc" and "free" line. |
Thanks for your patch. |
I tried @sonots patch on a CUDA machine. I got the exceptions (fix_317_failed_stdout.txt) and I think that we can ignore |
Thanks! But, it seems the last line of fix_317_failed_malloc_free.txt is broken like:
Was you able to paste entire logs until last line where an error occurred? |
One more thing. I changed the log line of malloc as:
Could you pull |
Thank you for your email!! |
With logs you've sent via email, I could not see
Did you get the |
This is just my progress. I tried to reproduce by generating python codes like below from logs: import re
print('import cupy')
print('pool = cupy.cuda.MemoryPool()')
for line in open('fix_317_failed_all.txt', 'r'):
# fix_317 malloc(size=512) ptr=81719733760 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x0000020435305BE0>
if line.startswith('fix_317 malloc'):
line = line.replace('fix_317 malloc(', '')
line = line.replace(')', '')
line = re.sub(r'PooledMemory object.*$', '', line)
items = line.split(' ')
d = {}
for item in items:
k, v = item.split('=')
d[k] = v
print('m{} = pool.malloc({})'.format(d['ptr'], d['size']))
# fix_317 free(ptr=81723916288, size=24064)
elif line.startswith('fix_317 free'):
line = line.replace('fix_317 free(', '')
line = line.replace(')', '')
items = line.split(', ')
d = {}
for item in items:
k, v = item.split('=')
d[k] = v
print('del m{}'.format(d['ptr'])) Generated codes are:
But, I still can not reproduce yet. |
Sorry, I think that I extracted the log of not-problematic code. I will extract the log of the problematic one. Please wait a moment... |
Thank you for new logs. |
Hmm, unfortunately, I could not reproduce from replay. I will investigate more. |
This log tells I tried to reproduce this on my environment, but I still can not reproduce. |
@hiro4bbh could you tell me python version and cython version you used? |
Do you run in multiple threads actually? I found wierd logs as below:
where
are consecutive logs. |
Added codes to print thread_ids on the fix_317 branch. |
Also, I got logs from hiro4bbh-san and it seemed it was one thread when the error occurred although it shows the latter part of logs uses another different thread, but it probably is not related with the error. |
I tried with same python and cython version, but I could not reproduce. I now doubt windows environment, but I do not have a windows environment ... |
I am not sure whether this helps, but I made thread-safe implementation as master...sonots:fix_317. Can you try this? |
I tried several cases for reproducing the bug in your previous patch, but I couldn't reproduce. I will try your latest patch. If there is no problem, I will use that version. I will report how your latest patch works. Thank you for your patch! |
I confimed that your latest patch didn't fail. I couldn't reproduce the bug. Thank you! |
Using cupy (commit:
5062f61065caecb8b3910c452f51b1307f5d8121
on Windows 10) in my program, I got the following many error messages:These errors only happen when I enable memory pool as the following code:
I think the failure point (cupy.sum) is irreverent to the cause of these errors.
I have no details about these errors.
Is there any way to debug these errors more deeply?
The text was updated successfully, but these errors were encountered: