Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-530: [C++/Python] Provide subpools for better memory allocation … #2057

Closed
wants to merge 9 commits into from

Conversation

rok
Copy link
Member

@rok rok commented May 17, 2018

…tracking

Currently we can only track the amount of bytes allocated by the main memory pool or the alternative jemalloc implementation. To better understand certain situation, we should provide a MemoryPool proxy implementation that tracks only the amount of memory that was made through its direct calls but delegates the actual allocation to an underlying pool.

Authors: Rok Mihevc rok@mihevc.org and Alex Hagerman alex@unexpectedeof.net

Usage example:

import pyarrow as pa

def report(pool, proxy_pool):
print("Total: ", pa.total_allocated_bytes())
print("Default pool: ", pool.bytes_allocated())
print("Proxy allocated: ", proxy_pool.proxy_bytes_allocated())

pool = pa.default_memory_pool()
proxy_pool = pa.ProxyMemoryPool(pool)

report(pool, proxy_pool)
a1 = pa.array([0]*1000, memory_pool=pool)
report(pool, proxy_pool)
a2 = pa.array([0]*1, memory_pool=proxy_pool)
report(pool, proxy_pool)
a3 = pa.array([0]*1000, memory_pool=proxy_pool)
report(pool, proxy_pool)
a3 = pa.array([0]*1000, memory_pool=proxy_pool)
report(pool, proxy_pool)

Result:

Total: 0
Default pool: 0
bytes_allocated: 0
Proxy allocated: 0
Total: 8128
Default pool: 8128
bytes_allocated: 0
Proxy allocated: 0
Allocate: size = 64
Allocate: size = 256
Reallocate: old_size = 256 - new_size = 64 - proxy_allocated - 128
Total: 8256
Default pool: 8256
bytes_allocated: 128
Proxy allocated: 128
Allocate: size = 128
Allocate: size = 8192
Reallocate: old_size = 8192 - new_size = 8000 - proxy_allocated - 8256
Total: 16384
Default pool: 16384
bytes_allocated: 8256
Proxy allocated: 8256
Allocate: size = 128
Allocate: size = 8192
Reallocate: old_size = 8192 - new_size = 8000 - proxy_allocated - 16384
Free: size = 128
Free: size = 8000
Total: 16384
Default pool: 16384
bytes_allocated: 8256
Proxy allocated: 8256

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise this looks good. Also for me it seems as this pool will already tracks only the allocations made for it. To complete this PR, please add some C++ unit tests that check the implementation.

Also have a look at Travis which reports some linting errors.

Status ProxyMemoryPool::Allocate(int64_t size, uint8_t** out) {
proxy_bytes_allocated_ += size;
Status s = pool_->Allocate(size, out);
std::cout << "Allocate: size = " << size << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the calls to std::cout. If the user wants this, he should explicitly chose the LoggingMemoryPool.

@xhochy
Copy link
Member

xhochy commented May 19, 2018

@rok Code and tests looks good. Could you fix the linting errors mentioned in Travis?

@rok
Copy link
Member Author

rok commented May 19, 2018

@xhochy - will do, traveling at the moment, better availability from Monday on.

@codecov-io
Copy link

codecov-io commented May 20, 2018

Codecov Report

Merging #2057 into master will decrease coverage by 1.08%.
The diff coverage is 68.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2057      +/-   ##
==========================================
- Coverage   87.44%   86.35%   -1.09%     
==========================================
  Files         189      230      +41     
  Lines       29376    40513   +11137     
==========================================
+ Hits        25687    34987    +9300     
- Misses       3689     5526    +1837
Impacted Files Coverage Δ
python/pyarrow/__init__.py 57.14% <ø> (ø)
python/pyarrow/memory.pxi 34.48% <0%> (ø)
cpp/src/arrow/memory_pool.h 100% <100%> (ø) ⬆️
cpp/src/arrow/memory_pool-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/memory_pool.cc 65.09% <60.86%> (-1.18%) ⬇️
cpp/src/arrow/status.cc 44.64% <0%> (-1.79%) ⬇️
cpp/src/arrow/util/bit-util.h 98.01% <0%> (-1.23%) ⬇️
cpp/src/arrow/io/hdfs-internal.cc 33.62% <0%> (-0.45%) ⬇️
cpp/src/arrow/compare.cc 89.97% <0%> (-0.2%) ⬇️
cpp/src/arrow/ipc/json-internal.cc 89.79% <0%> (-0.12%) ⬇️
... and 103 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df20683...eabd8a4. Read the comment docs.

@rok rok changed the title ARROW-530: C++/Python: Provide subpools for better memory allocation … ARROW-530 [C++/Python]: Provide subpools for better memory allocation … May 20, 2018
@rok rok changed the title ARROW-530 [C++/Python]: Provide subpools for better memory allocation … ARROW-530: [C++/Python] Provide subpools for better memory allocation … May 20, 2018
proxy_bytes_allocated_ -= size;
}

int64_t ProxyMemoryPool::bytes_allocated() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realised that bytes_allocated refers to the parent Pool. I would have expected that it would return only the allocation of the currrent scope (i.e. proxy_bytes_allocated_).

@rok
Copy link
Member Author

rok commented May 22, 2018

@xhochy - agreed that is a more intuitive interface. Now the ProxyMemoryPool object should return bytes allocated through it with bytes_allocated() method.
By the way how do you feel about the naming here? ProxyMemoryPool might not be the best name.


int64_t ProxyMemoryPool::bytes_allocated() const {
int64_t nb_bytes = bytes_allocated_;
return nb_bytes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use num or number here instead of nb


int64_t ProxyMemoryPool::max_memory() const {
int64_t mem = pool_->max_memory();
return mem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return pool_->max_memory(); ?


Status ProxyMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) {
Status s = pool_->Reallocate(old_size, new_size, ptr);
bytes_allocated_ += new_size - old_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pool_->Reallocate fails, bytes_allocated_ will be incorrect


Status ProxyMemoryPool::Allocate(int64_t size, uint8_t** out) {
Status s = pool_->Allocate(size, out);
bytes_allocated_ += size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pool_->Allocate fails, bytes_allocated_ will be incorrect

@@ -86,6 +86,25 @@ class ARROW_EXPORT LoggingMemoryPool : public MemoryPool {
MemoryPool* pool_;
};

class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider some alternative names than "Proxy"? Can you add a doxygen comment describing what this class is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about SubMemoryPool? MemorySubPool? PartialMemoryPool?

class ARROW_EXPORT ProxyMemoryPool : public MemoryPool {
public:
explicit ProxyMemoryPool(MemoryPool* pool);
~ProxyMemoryPool() override = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dtor isn't needed


private:
MemoryPool* pool_;
int64_t bytes_allocated_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic?

@@ -44,6 +44,15 @@ cdef class LoggingMemoryPool(MemoryPool):
self.init(self.logging_pool.get())


cdef class ProxyMemoryPool(MemoryPool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other classes are not documented in memory.pxi. Is there another location for documentation I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is the main location where we should document things. We were not so good with adding docstrings in the past. I'll probably make a pass over the old classes soon and add some documentation but we should start adding documentation to new classes once we add them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Would be happy to help with the old classes as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be happy to help with the old classes as well.

That would be very much appreciated!

@@ -58,6 +67,8 @@ def set_memory_pool(MemoryPool pool):
cdef MemoryPool _default_memory_pool = default_memory_pool()
cdef LoggingMemoryPool _logging_memory_pool = (
LoggingMemoryPool(_default_memory_pool))
cdef ProxyMemoryPool _proxy_memory_pool = (
ProxyMemoryPool(_default_memory_pool))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this?

@xhochy
Copy link
Member

xhochy commented Jun 3, 2018

By the way how do you feel about the naming here? ProxyMemoryPool might not be the best name.

Cannot think of anything better at the moment, I think it's ok.

@xhochy
Copy link
Member

xhochy commented Jun 4, 2018

Looks good except for a small formatting issue: https://travis-ci.org/apache/arrow/jobs/387368172#L1227

Would merge on a green build.

@rok
Copy link
Member Author

rok commented Jun 6, 2018

It seems tests in Travis failed due to timeouts: https://travis-ci.org/apache/arrow/jobs/388699104#L4333, https://travis-ci.org/apache/arrow/jobs/388699106#L523.
Is there a way to rerun them manually?

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Travis tests passed after I retriggered them.

@xhochy xhochy closed this in e82a34a Jun 7, 2018
@xhochy
Copy link
Member

xhochy commented Jun 7, 2018

Thanks @rok . I assigned the jira to rokm, is that your user on https://issues.apache.org/jira ?

@rok
Copy link
Member Author

rok commented Jun 8, 2018

@xhochy - yeah, that's me, thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants