Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Implement zero-copy array slicing #15401

Closed
asfimport opened this issue Mar 2, 2016 · 14 comments
Closed

C++: Implement zero-copy array slicing #15401

asfimport opened this issue Mar 2, 2016 · 14 comments

Comments

@asfimport
Copy link

asfimport commented Mar 2, 2016

Sliced array results will retain a reference to the parent's data buffer

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-33. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Feng Guangyuan / @fengguangyuan:
Hi, there.
I’d like to work on this, if convenient. Thanks.

I have a draft implementation that adds a function to Array class like below:
code
Array slice(int32_t start, int32_t length);
code
It will return an Array with the reference to the parent Array’s raw data buffer and index scope information.
The function won’t copy the real data, and it just contains the concerned data buffer’s location and length.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Hi [~qq272101359@163.com],

it would be best if you push your code and open a pull request on github. Then someone can review it and give you feedback on it.

@asfimport
Copy link
Author

Feng Guangyuan / @fengguangyuan:
Issue resolved by pull request 56.
#56

However, there are some confusions.
1.If it's necessary to add the slicing methods into ListArray, UnionArray etc.
2.Before slice(...) returns an Array, which is better the Array contains a base class Buffer
or a derived Buffer.
3.null_bitmap also to be sliced concurrently may be better.
Please feel free to give me any advice.

Thanks. Uwe L. Korn.
: )

@asfimport
Copy link
Author

Wes McKinney / @wesm:
From a strictly C++ implementation perspective, the most robust way to support array slicing will be to add an offset member to the base Array class. Otherwise, the bitmap must be copied. @xhochy what do you think about this? This is most likely what we'll end up doing in pandas 2.0, for example.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Fine with me. I wouldn't have considered it zero-copy if we copied the whole bitmap ;)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@xhochy I'd like to do this soon. The trouble is that anywhere where we access the raw buffers in the Array will be affected – this includes in libparquet_arrow. It will also result in some API changes because we'll need additional constructors that take an offset parameter. And anywhere we touch bitmaps will need to be more careful to account for the offset (and in IPC, we'll have to do bitmap copies when the offset is nonzero, e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/adapter.cc#L65)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
On this note, I don't think it will be too difficult to implement this, but we should make a list of all the code we need to review for "naked" buffer access. Let me know your thoughts

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
This sounds like a good thing to do before the 0.2 release. In parquet-cpp I often already use an offset for the bitmaps, partly it's combined with the array offset which we need to separate.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I started looking at this. I forgot about the null count – with the current null count requirement, adding an offset member would require inspecting the bitmap on calling Array::Slice to determine the number of nulls in the sliced range.

One possible workaround is to make Array::null_count virtual, and set the null_count_ member to 1 when it is not known, so for sliced arrays it would need to be computed and cached the first time you call {{arr>null_count()}}.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
We should not need to make null_count virtual as the handling of valid bitmaps is the same for all arrays. Just checking in the main implementation if it is <0 and then recompute should be enough. Given that we use the null_count mostly for checking if the number of nulls is 0, and if not we go into the more expensive code path, I think the solution to compute on first null_count() is the most reasonable.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Agreed, that was also my conclusion after more thought. Will put up a patch once I write enough tests to cover areas of concern

@asfimport
Copy link
Author

Wes McKinney / @wesm:
As an idle thought: one consequence of this functionality is the arrays with offset != 0 (or that were constructed from foreign memory spaces / no guarantees about how the memory was allocated) in generally may not have padded / aligned buffers. So when we go to implement SIMD integration / hardware intrinsics, we'll need to be careful about this

@asfimport
Copy link
Author

Wes McKinney / @wesm:
PR in progress here: #322

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 322
#322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants