Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11037: [Rust] Optimized creation of string array from iterator. #9016

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 18 additions & 15 deletions rust/arrow/src/array/array_string.rs
Original file line number Diff line number Diff line change
Expand Up @@ -123,25 +123,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}

pub(crate) fn from_vec(v: Vec<&str>) -> Self {
let mut offsets = Vec::with_capacity(v.len() + 1);
let mut values = Vec::new();
let mut offsets =
MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
let mut values = MutableBuffer::new(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to test here whether it is cheaper to calculate offsets + total size first for values buffer based on final length_so_far so that it doesn't require extra allocations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like

for s in &v {
            length_so_far = length_so_far + OffsetSize::from_usize(s.len()).unwrap();
            offsets.extend_from_slice(length_so_far.to_byte_slice());
        }
        
let mut values = MutableBuffer::new(length_so_far);
for s in &v {
    values.extend_from_slice(s.as_bytes());
}

Copy link
Contributor

@Dandandan Dandandan Dec 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the offset buffer is also faster after converting it to a slice using typed_data_mut::<OffsetSize>() and iterating it with iter_mut, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both great ideas. If it is ok for you, I will leave them to future work, as I am focusing my work on the MutableBuffer atm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. the general direction that I am pushing towards is to stop using Vec and replace it by MutableBuffer to avoid extra allocations. Once that is in place, we can replace these by iter_mut.

also current conversion to a byte slice may add some overhead?

I believe so, as to_byte_slice returns a slice of unknown size, both for &T and &[T]. Either the compiler optimizes it, or there is an extra cost. I benched a 10% diff in building buffers when introduced a method that was not doing bound checks on these. IMO we should be using to_le_bytes, and have a method MutableBuffer::push<ToByteSlice>. I think we need the crate byteorder because AFAIK std's to_le_bytes does not have a trait (which we need).


let mut length_so_far = OffsetSize::zero();
offsets.push(length_so_far);
offsets.extend_from_slice(length_so_far.to_byte_slice());

for s in &v {
length_so_far += OffsetSize::from_usize(s.len()).unwrap();
offsets.push(length_so_far);
offsets.extend_from_slice(length_so_far.to_byte_slice());
values.extend_from_slice(s.as_bytes());
}
let array_data = ArrayData::builder(OffsetSize::DATA_TYPE)
.len(v.len())
.add_buffer(Buffer::from(offsets.to_byte_slice()))
.add_buffer(Buffer::from(&values[..]))
.add_buffer(offsets.into())
.add_buffer(values.into())
.build();
Self::from(array_data)
}

pub(crate) fn from_opt_vec(v: Vec<Option<&str>>) -> Self {
GenericStringArray::from_iter(v.into_iter())
v.into_iter().collect()
}
}

Expand All @@ -155,32 +158,32 @@ where
let (_, data_len) = iter.size_hint();
let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound.

let mut offsets = Vec::with_capacity(data_len + 1);
let mut values = Vec::new();
let mut offsets =
MutableBuffer::new((data_len + 1) * std::mem::size_of::<OffsetSize>());
let mut values = MutableBuffer::new(0);
let mut null_buf = MutableBuffer::new_null(data_len);
let null_slice = null_buf.as_slice_mut();
let mut length_so_far = OffsetSize::zero();
offsets.push(length_so_far);
offsets.extend_from_slice(length_so_far.to_byte_slice());

for (i, s) in iter.enumerate() {
if let Some(s) = s {
let s = s.as_ref();
// set null bit
let null_slice = null_buf.as_slice_mut();
bit_util::set_bit(null_slice, i);

length_so_far += OffsetSize::from_usize(s.len()).unwrap();
offsets.push(length_so_far);
values.extend_from_slice(s.as_bytes());
} else {
offsets.push(length_so_far);
values.extend_from_slice(b"");
}
offsets.extend_from_slice(length_so_far.to_byte_slice());
}

let array_data = ArrayData::builder(OffsetSize::DATA_TYPE)
.len(data_len)
.add_buffer(Buffer::from(offsets.to_byte_slice()))
.add_buffer(Buffer::from(&values[..]))
.add_buffer(offsets.into())
.add_buffer(values.into())
.null_bit_buffer(null_buf.into())
.build();
Self::from(array_data)
Expand Down