New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6868: [Go] Fix slicing struct arrays #5642
Conversation
except for the panics in some of the tests :) |
I had to update the tests and fix the Before the fix, calling From the documentation:
|
This should be good for review now. @sbinet @stuartcarnie @alexandreyc |
Apologies. I am a bit off the grid till middle of next week. |
thanks.
waiting a bit for @stuartcarnie and/or @alexandreyc to chime in?
go/arrow/array/struct.go
Outdated
// with a nullBitmapBytes adjusted according on the parent struct nullBitmapBytes. | ||
// From the docs: | ||
// "When reading the struct array the parent validity bitmap takes priority." | ||
func newStructFieldWithParentValidityMask(a *Struct, fieldIndex int) Interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this newStructFieldWithParentValidityMask
function should actually be a method of Struct
?
and, actually, shouldn't this be factored into the current func (*Struct) Field(i int) Interface
method?
(this would probably mean to have an additional (caching) slice of Interface
fields to not require a matching Release
on the client...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about adding this to Field(i)
but I decided against it because it would require unnecessary caching logic and I thought it was more in line with the way the other nested logical types implemented slicing.
Lines 47 to 71 in a4a11bf
func (a *List) String() string { | |
o := new(strings.Builder) | |
o.WriteString("[") | |
for i := 0; i < a.Len(); i++ { | |
if i > 0 { | |
o.WriteString(" ") | |
} | |
if !a.IsValid(i) { | |
o.WriteString("(null)") | |
continue | |
} | |
sub := a.newListValue(i) | |
fmt.Fprintf(o, "%v", sub) | |
sub.Release() | |
} | |
o.WriteString("]") | |
return o.String() | |
} | |
func (a *List) newListValue(i int) Interface { | |
j := i + a.array.data.offset | |
beg := int64(a.offsets[j]) | |
end := int64(a.offsets[j+1]) | |
return NewSlice(a.values, beg, end) | |
} |
arrow/go/arrow/array/fixed_size_list.go
Lines 47 to 73 in a4a11bf
func (a *FixedSizeList) String() string { | |
o := new(strings.Builder) | |
o.WriteString("[") | |
for i := 0; i < a.Len(); i++ { | |
if i > 0 { | |
o.WriteString(" ") | |
} | |
if !a.IsValid(i) { | |
o.WriteString("(null)") | |
continue | |
} | |
sub := a.newListValue(i) | |
fmt.Fprintf(o, "%v", sub) | |
sub.Release() | |
} | |
o.WriteString("]") | |
return o.String() | |
} | |
func (a *FixedSizeList) newListValue(i int) Interface { | |
n := int64(a.n) | |
off := int64(a.array.data.offset) | |
beg := (off + int64(i)) * n | |
end := (off + int64(i+1)) * n | |
sli := NewSlice(a.values, beg, end) | |
return sli | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to a method on struct. 4cfef3c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. I think I see what you're referring to with Field(i)
. I guess we need to decide what the behavior of Field(i)
should be. Should calling Field(i)
return the masked field with an updated null bitmap or the actual underling field, which could potentially have a different null bitmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the desired behavior, but for what it's worth the Python version does not apply the struct null bitmap either.
python
Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import numpy as np
>>> pa.__version__
'0.15.0'
>>> np.__version__
'1.17.3'
>>> ty = pa.struct([('x', pa.int16()),('y', pa.bool_())])
>>> xs = pa.array([5, None, 7], type=pa.int16())
>>> ys = pa.array([False, True, True])
>>> mask = pa.array(['a', 'b', 'c'], mask=np.array([True, False, False]))
>>> mask
<pyarrow.lib.StringArray object at 0x11995e108>
[
null,
"b",
"c"
]
>>> arr = pa.StructArray.from_buffers(ty, 3, mask.buffers()[0:1], children=(xs, ys))
>>> arr
<pyarrow.lib.StructArray object at 0x11995e168>
-- is_valid:
[
false,
true,
true
]
-- child 0 type: int16
[
5,
null,
7
]
-- child 1 type: bool
[
false,
true,
true
]
>>> arr.field(0)
<pyarrow.lib.Int16Array object at 0x11995e1c8>
[
5,
null,
7
]
>>> arr.field(0).null_count
1
>>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ implementation doesn't appear to try to apply the nulls on a call to field()
: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L612
My vote would be to add a new method to Struct where the developer can choose to select a field with the nulls applied and leave the existing Field
method as is.
Merging since this has been sitting here approved for a while. Can open a followup issue if anyone finds anything objectionable. |
Sorry for the late review; didn't see any issues scanning through
This PR fixes the issue in: https://issues.apache.org/jira/browse/ARROW-6868
Slicing the struct array should result in
"{[1.3 1.4] [(null) 4]}"
, however the entire struct array is returned on a call toarr.String()
:"{[1.1 (null) 1.3 1.4] [1 2 (null) 4]}"
. The fix was to adjust Data to account for the new offset/length:Result of the test before the fix: