-
Notifications
You must be signed in to change notification settings - Fork 549
QUESTION about memory management #1709
Description
The following loop should be able to succeed on my nvidia GPU with 4GB of memory, because each step in the loop should only use less than 800MB. However this loop aborts after a few iterations with CUDA Error (2): out of memory:
array res = constant(0, 10000, 20000, f32);
for (int i = 0; i < 20; ++i) {
array R = randu(10000, 20000, f32);
res += R;
}
I added printMemInfo() in each step and this is what it shows:
starting step 0
---------------------------------------------------------
| POINTER | SIZE | AF LOCK | USER LOCK |
---------------------------------------------------------
---------------------------------------------------------
starting step 1
---------------------------------------------------------
| POINTER | SIZE | AF LOCK | USER LOCK |
---------------------------------------------------------
| 0x1302f40000 | 762.9 MB | Yes | No |
---------------------------------------------------------
starting step 2
---------------------------------------------------------
| POINTER | SIZE | AF LOCK | USER LOCK |
---------------------------------------------------------
| 0x1332a40000 | 762.9 MB | Yes | No |
| 0x1302f40000 | 762.9 MB | Yes | No |
---------------------------------------------------------
starting step 3
---------------------------------------------------------
| POINTER | SIZE | AF LOCK | USER LOCK |
---------------------------------------------------------
| 0x1362540000 | 762.9 MB | Yes | No |
| 0x1332a40000 | 762.9 MB | Yes | No |
| 0x1302f40000 | 762.9 MB | Yes | No |
---------------------------------------------------------
starting step 4
---------------------------------------------------------
| POINTER | SIZE | AF LOCK | USER LOCK |
---------------------------------------------------------
| 0x1392040000 | 762.9 MB | Yes | No |
| 0x1362540000 | 762.9 MB | Yes | No |
| 0x1332a40000 | 762.9 MB | Yes | No |
| 0x1302f40000 | 762.9 MB | Yes | No |
---------------------------------------------------------
starting step 5
---------------------------------------------------------
| POINTER | SIZE | AF LOCK | USER LOCK |
---------------------------------------------------------
| 0x13c1b40000 | 762.9 MB | Yes | No |
| 0x1302f40000 | 762.9 MB | Yes | No |
| 0x1332a40000 | 762.9 MB | Yes | No |
| 0x1362540000 | 762.9 MB | Yes | No |
| 0x1392040000 | 762.9 MB | Yes | No |
---------------------------------------------------------
terminate called after throwing an instance of 'af::exception'
what(): ArrayFire Exception (Device out of memory:101):
In function virtual void* cuda::MemoryManager::nativeAlloc(size_t)
In file src/backend/cuda/memory.cpp:97
CUDA Error (2): out of memory
In function af::array af::randu(const af::dim4&, af::dtype)
In file src/api/cpp/random.cpp:96
I have tried inserting deviceGC in each step in the loop, but the result is the same. I've also tried to sync() (and plus deviceGC) after each step, still the same result.
If I understand this correctly, even though the R array goes out of scope, for some reason AF still keeps the lock on it and does not release the memory back to the pool. It keeps accumulating these until it runs out of device memory.
Is this a bug or expected behavior? If it's expected, what can I do to make such a loop work?
This is of course an artificially constructed example to demonstrate the problem. In my real code, I have a function that uses a lot of GPU memory. I call this function many times, and the first few times it's able to finish properly, but eventually it results in the same out of memory error as above. And printMemInfo() shows the same pattern - temporary arrays that are allocated and used within the function show up as locked memory even after the function is returned.