Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods pub fn offsets_buffer, pub fn types_ids_bufferand pub fn data_buffer for ArrayDataBuilder #1640

Closed
HaoYang670 opened this issue May 3, 2022 · 3 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@HaoYang670
Copy link
Contributor

HaoYang670 commented May 3, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, when users what to create an array by using ArrayDataBuilder, they always write code like:

let data_buffer = ...;
let offsets_buffer = ...;
let validity_buffer = ...;
let array_data = ArrayData::builder(datatype)
    .len(num_of_element)
    .null_bit_buffer(validity_buffer)
    .add_buffer(offsets_buffer)
    .add_buffer(data_buffer)
    .offset(number)
    .build()
    .unwrap();

The code is not user-friendly and easy to cause some runtime errors. Because:

  1. Users have to remember the order and type of each buffer: https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
  2. The array can have at most 3 buffers, but by using the method add_buffer or buffers, users can create arrayData with > 3 buffers, which is harmful.
  3. Users may get a wrong answer if they confuse offset buffer and data buffer. For example:
// build a list array [[0, 1], [2], [4]]
let offsets_buffer = [0, 2, 3, 4];
let data_buffer = [0, 1, 2, 4];

// currect way
let array_data = ArrayData::builder(ListArray)
    .len(3)
    .buffers(vec![offsets_buffer, data_buffer])
    ...
    .build();

// what if we reverse the order of buffers?
// wrong way
let array_data = ArrayData::builder(ListArray)
    .len(3)
    .buffers(vec![data_buffer, offsets_buffer])
    ...
    .build();

// This will get an array: [[0], [2], [3, 4]]

ArrayData::try_new can not detect such errors. So users have to debug to find it.

Describe the solution you'd like

  1. delete the methods add_buffer and buffers
  2. add methods offsets_buffer, type_ids_buffer and data_buffer
  3. rename null_bit_buffer and validity_buffer (because it is the name in Apache Arrow Format)
  4. Update the struct of ArrayDataBuilder like:
pub struct ArrayDataBuilder {
    data_type: DataType,
    len: usize,
    null_count: Option<usize>,
    validity_buffer: Option<Buffer>,
    offsets_buffer: Option<Buffer>,
    type_ids_buffer: Option<Buffer>,
    data_buffer: Option<Buffer>,
    offset: usize,
    child_data: Vec<ArrayData>,
}

we can add some cheap checking in the build method. For example

match (data_type, validity_buffer, type_ids_buffer, offsets_buffer, data_buffer) {
    (Binary, Some(buf0), None, Some(buf1), Some(buf2)) => ArrayData::try_new(...),
    (DenseUnion, None, Some(buf0), Some(buf1), None) => ArrayData::try_new(...),
    ...,
    _ => Err("wrong layout"),
}

This will introduce some public API changes. So I want more code maintainers to review this issue. And I will implement this if we could reach consensus.

@HaoYang670 HaoYang670 added the enhancement Any new improvement worthy of a entry in the changelog label May 3, 2022
@tustvold
Copy link
Contributor

tustvold commented May 4, 2022

The only major thing that comes to mind is how this would interact with IPC which doesn't make the same distinction.

FWIW my 2 cents is that users shouldn't be interacting with ArrayData at all, and arrow-rs should instead provide builders, from_iter, etc... that abstract this away. These can potentially also ensure they construct valid data without a potentially expensive validation step. I could definitely see a world where we eventually make ArrayData an implementation detail, and people only interact with strongly typed Arrays a 'la arrow2.

In the case of your example, there is a GenericListArray::from_iter_primitive, but perhaps we should add a more general purpose try_new constructor, like UnionArray::try_new, or possibly a GenericListArrayBuilder?

@HaoYang670
Copy link
Contributor Author

Thank you very much for your explanation, @tustvold ! I have seen some code which buffers() is the best choice:

        let mut builder = ArrayData::builder(data_type.clone())
            .len(field_node.length() as usize)
            .buffers(buffers[1..2].to_vec())
            .offset(0)
            .child_data(vec![child_array.data().clone()]);

@HaoYang670
Copy link
Contributor Author

I will close this issue, because there is no need to modify the ArrayDataBuilder

@alamb alamb added the arrow Changes to the arrow crate label May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants