Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6868: [Go] Fix slicing struct arrays #5642

Closed

Conversation

nickpoorman
Copy link
Contributor

@nickpoorman nickpoorman commented Oct 13, 2019

This PR fixes the issue in: https://issues.apache.org/jira/browse/ARROW-6868

Slicing the struct array should result in "{[1.3 1.4] [(null) 4]}", however the entire struct array is returned on a call to arr.String(): "{[1.1 (null) 1.3 1.4] [1 2 (null) 4]}". The fix was to adjust Data to account for the new offset/length:

sub := NewSliceData(child, int64(data.offset), int64(data.offset+data.length))

Result of the test before the fix:

=== RUN   TestStructArraySlice
--- FAIL: TestStructArraySlice (0.00s)
   /arrow/go/arrow/array/struct_test.go:358: invalid string representation:
        got = "{[1.1 (null) 1.3 1.4] [1 2 (null) 4]}"
        want= "{[1.3 1.4] [(null) 4]}"
FAIL
FAIL	github.com/apache/arrow/go/arrow/array	0.043s
Error: Tests failed.

@github-actions
Copy link

@github-actions github-actions bot commented Oct 13, 2019

sbinet
sbinet approved these changes Oct 14, 2019
Copy link
Contributor

@sbinet sbinet left a comment

LGTM.

what do others think? @stuartcarnie @alexandreyc ?

@sbinet sbinet self-requested a review Oct 14, 2019
@sbinet
Copy link
Contributor

@sbinet sbinet commented Oct 14, 2019

except for the panics in some of the tests :)

@nickpoorman
Copy link
Contributor Author

@nickpoorman nickpoorman commented Oct 19, 2019

I had to update the tests and fix the String() method on the Struct type to make the tests work properly.

Before the fix, calling String() on Struct would not take into account the validity bitmap of the struct when printing the child fields.

From the documentation:

While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the validity bitmap. Any of the child field arrays can have null values according to their respective independent validity bitmaps. This implies that for a particular struct slot the validity bitmap for the struct array might indicate a null slot when one or more of its child arrays has a non-null value in their corresponding slot. When reading the struct array the parent validity bitmap takes priority.

@nickpoorman
Copy link
Contributor Author

@nickpoorman nickpoorman commented Oct 21, 2019

This should be good for review now. @sbinet @stuartcarnie @alexandreyc

@sbinet
Copy link
Contributor

@sbinet sbinet commented Oct 22, 2019

Apologies. I am a bit off the grid till middle of next week.

@sbinet sbinet added the lang-go label Oct 30, 2019
sbinet
sbinet approved these changes Oct 31, 2019
Copy link
Contributor

@sbinet sbinet left a comment

thanks.

waiting a bit for @stuartcarnie and/or @alexandreyc to chime in?

go/arrow/array/struct.go Outdated Show resolved Hide resolved
// with a nullBitmapBytes adjusted according on the parent struct nullBitmapBytes.
// From the docs:
// "When reading the struct array the parent validity bitmap takes priority."
func newStructFieldWithParentValidityMask(a *Struct, fieldIndex int) Interface {
Copy link
Contributor

@sbinet sbinet Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this newStructFieldWithParentValidityMask function should actually be a method of Struct?
and, actually, shouldn't this be factored into the current func (*Struct) Field(i int) Interface method?
(this would probably mean to have an additional (caching) slice of Interface fields to not require a matching Release on the client...)

Copy link
Contributor Author

@nickpoorman nickpoorman Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about adding this to Field(i) but I decided against it because it would require unnecessary caching logic and I thought it was more in line with the way the other nested logical types implemented slicing.

func (a *List) String() string {
o := new(strings.Builder)
o.WriteString("[")
for i := 0; i < a.Len(); i++ {
if i > 0 {
o.WriteString(" ")
}
if !a.IsValid(i) {
o.WriteString("(null)")
continue
}
sub := a.newListValue(i)
fmt.Fprintf(o, "%v", sub)
sub.Release()
}
o.WriteString("]")
return o.String()
}
func (a *List) newListValue(i int) Interface {
j := i + a.array.data.offset
beg := int64(a.offsets[j])
end := int64(a.offsets[j+1])
return NewSlice(a.values, beg, end)
}

func (a *FixedSizeList) String() string {
o := new(strings.Builder)
o.WriteString("[")
for i := 0; i < a.Len(); i++ {
if i > 0 {
o.WriteString(" ")
}
if !a.IsValid(i) {
o.WriteString("(null)")
continue
}
sub := a.newListValue(i)
fmt.Fprintf(o, "%v", sub)
sub.Release()
}
o.WriteString("]")
return o.String()
}
func (a *FixedSizeList) newListValue(i int) Interface {
n := int64(a.n)
off := int64(a.array.data.offset)
beg := (off + int64(i)) * n
end := (off + int64(i+1)) * n
sli := NewSlice(a.values, beg, end)
return sli
}

Copy link
Contributor Author

@nickpoorman nickpoorman Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to a method on struct. 4cfef3c

Copy link
Contributor Author

@nickpoorman nickpoorman Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. I think I see what you're referring to with Field(i). I guess we need to decide what the behavior of Field(i) should be. Should calling Field(i) return the masked field with an updated null bitmap or the actual underling field, which could potentially have a different null bitmap?

Copy link
Contributor Author

@nickpoorman nickpoorman Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the desired behavior, but for what it's worth the Python version does not apply the struct null bitmap either.

python
Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 14:38:56) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import numpy as np
>>> pa.__version__
'0.15.0'
>>> np.__version__
'1.17.3'
>>> ty = pa.struct([('x', pa.int16()),('y', pa.bool_())])
>>> xs = pa.array([5, None, 7], type=pa.int16())
>>> ys = pa.array([False, True, True])
>>> mask = pa.array(['a', 'b', 'c'], mask=np.array([True, False, False]))
>>> mask
<pyarrow.lib.StringArray object at 0x11995e108>
[
  null,
  "b",
  "c"
]
>>> arr = pa.StructArray.from_buffers(ty, 3, mask.buffers()[0:1], children=(xs, ys))
>>> arr
<pyarrow.lib.StructArray object at 0x11995e168>
-- is_valid:
  [
    false,
    true,
    true
  ]
-- child 0 type: int16
  [
    5,
    null,
    7
  ]
-- child 1 type: bool
  [
    false,
    true,
    true
  ]
>>> arr.field(0)
<pyarrow.lib.Int16Array object at 0x11995e1c8>
[
  5,
  null,
  7
]
>>> arr.field(0).null_count
1
>>> 

Copy link
Contributor Author

@nickpoorman nickpoorman Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ implementation doesn't appear to try to apply the nulls on a call to field(): https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L612

My vote would be to add a new method to Struct where the developer can choose to select a field with the nulls applied and leave the existing Field method as is.

@wesm
Copy link
Member

@wesm wesm commented Nov 8, 2019

@stuartcarnie @alexandreyc ping

@nealrichardson
Copy link
Contributor

@nealrichardson nealrichardson commented Nov 15, 2019

Merging since this has been sitting here approved for a while. Can open a followup issue if anyone finds anything objectionable.

Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Sorry for the late review; didn't see any issues scanning through 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-go
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants