Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make slice work for nested types #389

Merged
merged 1 commit into from
Jul 14, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 1 addition & 6 deletions arrow/src/array/array_struct.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,12 +84,7 @@ impl From<ArrayData> for StructArray {
fn from(data: ArrayData) -> Self {
let mut boxed_fields = vec![];
for cd in data.child_data() {
let child_data = if data.offset() != 0 || data.len() != cd.len() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann @jorgecarleitao the problem was here. We take data which has children and their own offsets, then slice that data correctly to populate boxed_fields, but then we still use the data with the child arrays that aren't offset.

The result's that the struct is correct when looking at it through boxed_fields (e.g. the print utility uses this), but once you need to do anything with data, it's as if the child values were never sliced.

The lightbulb turned on while i was trying to figure out why a test for #491 was failing.

There's a change that @bjchambers authored (9f96561) which addressed the child offsets, but my tests were failing because I needed to revert what they did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to make sense -- my main interest was in getting the tests I added (concat on sliced struct arrays) to pass because I had run into problems with that. It seems like the tests are still here and passing, and it sounds like this may be more robust as well (addressing other issues I've hit when slicing StructArray such as the mentioned equality issues).

For my own curiosity -- If I understand correctly, slicing the child data doesn't actually copy the data, but just creates a reference to the same data with different offsets, so this shouldn't affect performance significantly, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to make sense -- my main interest was in getting the tests I added (concat on sliced struct arrays) to pass because I had run into problems with that. It seems like the tests are still here and passing, and it sounds like this may be more robust as well (addressing other issues I've hit when slicing StructArray such as the mentioned equality issues).

Yes, that's correct. I relied on your tests to verify that my changes make sense.

For my own curiosity -- If I understand correctly, slicing the child data doesn't actually copy the data, but just creates a reference to the same data with different offsets, so this shouldn't affect performance significantly, correct?

Yes. The cost should be negligible as we're cloning a bunch of arcs individually instead of all at once

cd.slice(data.offset(), data.len())
} else {
cd.clone()
};
boxed_fields.push(make_array(child_data));
boxed_fields.push(make_array(cd.clone()));
}
Self { data, boxed_fields }
}
Expand Down
39 changes: 30 additions & 9 deletions arrow/src/array/data.rs
Original file line number Diff line number Diff line change
Expand Up @@ -383,15 +383,36 @@ impl ArrayData {
pub fn slice(&self, offset: usize, length: usize) -> ArrayData {
assert!((offset + length) <= self.len());

let mut new_data = self.clone();

new_data.len = length;
new_data.offset = offset + self.offset;

new_data.null_count =
count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);

new_data
if let DataType::Struct(_) = self.data_type() {
// Slice into children
let new_offset = self.offset + offset;
let new_data = ArrayData {
data_type: self.data_type().clone(),
len: length,
null_count: count_nulls(self.null_buffer(), new_offset, length),
offset: new_offset,
buffers: self.buffers.clone(),
// Slice child data, to propagate offsets down to them
child_data: self
.child_data()
.iter()
.map(|data| data.slice(offset, length))
.collect(),
null_bitmap: self.null_bitmap().clone(),
};

new_data
} else {
let mut new_data = self.clone();

new_data.len = length;
new_data.offset = offset + self.offset;

new_data.null_count =
count_nulls(new_data.null_buffer(), new_data.offset, new_data.len);

new_data
}
}

/// Returns the `buffer` as a slice of type `T` starting at self.offset
Expand Down
13 changes: 5 additions & 8 deletions arrow/src/array/transform/structure.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,10 @@ pub(super) fn build_extend(array: &ArrayData) -> Extend {
index: usize,
start: usize,
len: usize| {
mutable.child_data.iter_mut().for_each(|child| {
child.extend(
index,
array.offset() + start,
array.offset() + start + len,
)
})
mutable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reverses 9f96561

.child_data
.iter_mut()
.for_each(|child| child.extend(index, start, start + len))
},
)
} else {
Expand All @@ -41,7 +38,7 @@ pub(super) fn build_extend(array: &ArrayData) -> Extend {
index: usize,
start: usize,
len: usize| {
(array.offset() + start..array.offset() + start + len).for_each(|i| {
(start..start + len).for_each(|i| {
if array.is_valid(i) {
mutable
.child_data
Expand Down