Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VM][PooledAllocator] try reallocation once when OOM #8285

Merged
merged 1 commit into from
Jun 21, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions src/runtime/vm/pooled_allocator.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ class PooledAllocator final : public Allocator {
~PooledAllocator() { ReleaseAll(); }

Buffer Alloc(size_t nbytes, size_t alignment, DLDataType type_hint) override {
std::lock_guard<std::mutex> lock(mu_);
std::lock_guard<std::recursive_mutex> lock(mu_);
size_t size = ((nbytes + page_size_ - 1) / page_size_) * page_size_;
auto&& it = memory_pool_.find(size);
if (it != memory_pool_.end() && !it->second.empty()) {
Expand All @@ -57,14 +57,22 @@ class PooledAllocator final : public Allocator {
Buffer buf;
buf.device = device_;
buf.size = size;
buf.data = DeviceAPI::Get(device_)->AllocDataSpace(device_, size, alignment, type_hint);
try {
buf.data = DeviceAPI::Get(device_)->AllocDataSpace(device_, size, alignment, type_hint);
} catch (InternalError& err) {
LOG(WARNING) << "PooledAllocator got InternalError during allocation: " << err.message();
LOG(WARNING) << "Trying to release all unused memory and reallocate...";
ReleaseAll();
buf.data = DeviceAPI::Get(device_)->AllocDataSpace(device_, size, alignment, type_hint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to expect if it still failed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it still fails, an InternalError will be thrown, causing a TVMError regarding OOM in the Python End.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would that be better if we let ReleaseAll return the size it released and check if it is larger than the requested size? So that we can directly throw a message including both sizes without calling alloc again.

Copy link
Contributor Author

@ganler ganler Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. But IMHO this is not robust enough.

Say that we have 8 GB GPU memory, the PooledAllocator cached 4 GB and we want to allocate 6 GB.

  • Applying your idea, ReleaseAll() returns "4GB" which is less than "6GB", thus resulting in a failed allocation.
  • Instead, if we release unused memory and do re-allocation, "6GB" is very likely to be successfully allocated.

The big picture behind your idea is practical if we can have some APIs like "total_system_memory" and "available_system_memory", which may require introducing a series of runtime/driver libraries. e.g., cudaMemGetInfo by CudaRT (user space) or NVML (if some system privilege is allowed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I don't have other comments then.

}

used_memory_.fetch_add(size, std::memory_order_relaxed);
DLOG(INFO) << "allocate " << size << " B, used memory " << used_memory_ << " B";
return buf;
}

void Free(const Buffer& buffer) override {
std::lock_guard<std::mutex> lock(mu_);
std::lock_guard<std::recursive_mutex> lock(mu_);
if (memory_pool_.find(buffer.size) == memory_pool_.end()) {
memory_pool_.emplace(buffer.size, std::vector<Buffer>{});
}
Expand All @@ -76,7 +84,7 @@ class PooledAllocator final : public Allocator {

private:
void ReleaseAll() {
std::lock_guard<std::mutex> lock(mu_);
std::lock_guard<std::recursive_mutex> lock(mu_);
for (auto const& it : memory_pool_) {
auto const& pool = it.second;
for (auto const& buf : pool) {
Expand All @@ -92,7 +100,7 @@ class PooledAllocator final : public Allocator {
size_t page_size_;
std::atomic<size_t> used_memory_;
std::unordered_map<size_t, std::vector<Buffer> > memory_pool_;
std::mutex mu_;
std::recursive_mutex mu_;
Device device_;
};

Expand Down