Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6868: [Go] Fix slicing struct arrays #5642

Closed

Conversation

@nickpoorman
Copy link
Contributor

nickpoorman commented Oct 13, 2019

This PR fixes the issue in: https://issues.apache.org/jira/browse/ARROW-6868

Slicing the struct array should result in "{[1.3 1.4] [(null) 4]}", however the entire struct array is returned on a call to arr.String(): "{[1.1 (null) 1.3 1.4] [1 2 (null) 4]}". The fix was to adjust Data to account for the new offset/length:

sub := NewSliceData(child, int64(data.offset), int64(data.offset+data.length))

Result of the test before the fix:

=== RUN   TestStructArraySlice
--- FAIL: TestStructArraySlice (0.00s)
   /arrow/go/arrow/array/struct_test.go:358: invalid string representation:
        got = "{[1.1 (null) 1.3 1.4] [1 2 (null) 4]}"
        want= "{[1.3 1.4] [(null) 4]}"
FAIL
FAIL	github.com/apache/arrow/go/arrow/array	0.043s
Error: Tests failed.
nickpoorman added 2 commits Oct 13, 2019
@github-actions

This comment has been minimized.

@sbinet
sbinet approved these changes Oct 14, 2019
Copy link
Contributor

sbinet left a comment

LGTM.

what do others think? @stuartcarnie @alexandreyc ?

@sbinet sbinet self-requested a review Oct 14, 2019
@sbinet

This comment has been minimized.

Copy link
Contributor

sbinet commented Oct 14, 2019

except for the panics in some of the tests :)

@nickpoorman

This comment has been minimized.

Copy link
Contributor Author

nickpoorman commented Oct 19, 2019

I had to update the tests and fix the String() method on the Struct type to make the tests work properly.

Before the fix, calling String() on Struct would not take into account the validity bitmap of the struct when printing the child fields.

From the documentation:

While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the validity bitmap. Any of the child field arrays can have null values according to their respective independent validity bitmaps. This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot when one or more of its child arrays has a non-null value in their corresponding slot. When reading the struct array the parent validity bitmap takes priority.

@nickpoorman

This comment has been minimized.

Copy link
Contributor Author

nickpoorman commented Oct 21, 2019

This should be good for review now. @sbinet @stuartcarnie @alexandreyc

@sbinet

This comment has been minimized.

Copy link
Contributor

sbinet commented Oct 22, 2019

Apologies. I am a bit off the grid till middle of next week.

@sbinet sbinet added the lang-go label Oct 30, 2019
@sbinet
sbinet approved these changes Oct 31, 2019
Copy link
Contributor

sbinet left a comment

thanks.

waiting a bit for @stuartcarnie and/or @alexandreyc to chime in?

go/arrow/array/struct.go Outdated Show resolved Hide resolved
// with a nullBitmapBytes adjusted according on the parent struct nullBitmapBytes.
// From the docs:
// "When reading the struct array the parent validity bitmap takes priority."
func newStructFieldWithParentValidityMask(a *Struct, fieldIndex int) Interface {

This comment has been minimized.

Copy link
@sbinet

sbinet Oct 31, 2019

Contributor

perhaps this newStructFieldWithParentValidityMask function should actually be a method of Struct?
and, actually, shouldn't this be factored into the current func (*Struct) Field(i int) Interface method?
(this would probably mean to have an additional (caching) slice of Interface fields to not require a matching Release on the client...)

This comment has been minimized.

Copy link
@nickpoorman

nickpoorman Oct 31, 2019

Author Contributor

I thought about adding this to Field(i) but I decided against it because it would require unnecessary caching logic and I thought it was more in line with the way the other nested logical types implemented slicing.

func (a *List) String() string {
o := new(strings.Builder)
o.WriteString("[")
for i := 0; i < a.Len(); i++ {
if i > 0 {
o.WriteString(" ")
}
if !a.IsValid(i) {
o.WriteString("(null)")
continue
}
sub := a.newListValue(i)
fmt.Fprintf(o, "%v", sub)
sub.Release()
}
o.WriteString("]")
return o.String()
}
func (a *List) newListValue(i int) Interface {
j := i + a.array.data.offset
beg := int64(a.offsets[j])
end := int64(a.offsets[j+1])
return NewSlice(a.values, beg, end)
}

func (a *FixedSizeList) String() string {
o := new(strings.Builder)
o.WriteString("[")
for i := 0; i < a.Len(); i++ {
if i > 0 {
o.WriteString(" ")
}
if !a.IsValid(i) {
o.WriteString("(null)")
continue
}
sub := a.newListValue(i)
fmt.Fprintf(o, "%v", sub)
sub.Release()
}
o.WriteString("]")
return o.String()
}
func (a *FixedSizeList) newListValue(i int) Interface {
n := int64(a.n)
off := int64(a.array.data.offset)
beg := (off + int64(i)) * n
end := (off + int64(i+1)) * n
sli := NewSlice(a.values, beg, end)
return sli
}

This comment has been minimized.

Copy link
@nickpoorman

nickpoorman Oct 31, 2019

Author Contributor

I moved it to a method on struct. 4cfef3c

This comment has been minimized.

Copy link
@nickpoorman

nickpoorman Oct 31, 2019

Author Contributor

Ahh. I think I see what you're referring to with Field(i). I guess we need to decide what the behavior of Field(i) should be. Should calling Field(i) return the masked field with an updated null bitmap or the actual underling field, which could potentially have a different null bitmap?

This comment has been minimized.

Copy link
@nickpoorman

nickpoorman Nov 2, 2019

Author Contributor

I'm not sure if this is the desired behavior, but for what it's worth the Python version does not apply the struct null bitmap either.

python
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 14:38:56) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import numpy as np
>>> pa.__version__
'0.15.0'
>>> np.__version__
'1.17.3'
>>> ty = pa.struct([('x', pa.int16()),('y', pa.bool_())])
>>> xs = pa.array([5, None, 7], type=pa.int16())
>>> ys = pa.array([False, True, True])
>>> mask = pa.array(['a', 'b', 'c'], mask=np.array([True, False, False]))
>>> mask
<pyarrow.lib.StringArray object at 0x11995e108>
[
  null,
  "b",
  "c"
]
>>> arr = pa.StructArray.from_buffers(ty, 3, mask.buffers()[0:1], children=(xs, ys))
>>> arr
<pyarrow.lib.StructArray object at 0x11995e168>
-- is_valid:
  [
    false,
    true,
    true
  ]
-- child 0 type: int16
  [
    5,
    null,
    7
  ]
-- child 1 type: bool
  [
    false,
    true,
    true
  ]
>>> arr.field(0)
<pyarrow.lib.Int16Array object at 0x11995e1c8>
[
  5,
  null,
  7
]
>>> arr.field(0).null_count
1
>>> 

This comment has been minimized.

Copy link
@nickpoorman

nickpoorman Nov 2, 2019

Author Contributor

The C++ implementation doesn't appear to try to apply the nulls on a call to field(): https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L612

My vote would be to add a new method to Struct where the developer can choose to select a field with the nulls applied and leave the existing Field method as is.

nickpoorman added 2 commits Oct 31, 2019
@wesm

This comment has been minimized.

Copy link
Member

wesm commented Nov 8, 2019

@nealrichardson

This comment has been minimized.

Copy link
Contributor

nealrichardson commented Nov 15, 2019

Merging since this has been sitting here approved for a while. Can open a followup issue if anyone finds anything objectionable.

Copy link
Contributor

stuartcarnie left a comment

Sorry for the late review; didn't see any issues scanning through 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.