Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] ValueOffset of array.Binary panics on last index #38458

Closed
frbvianna opened this issue Oct 25, 2023 · 9 comments · Fixed by #39242
Closed

[Go] ValueOffset of array.Binary panics on last index #38458

frbvianna opened this issue Oct 25, 2023 · 9 comments · Fixed by #39242
Assignees
Milestone

Comments

@frbvianna
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

As detailed in #37976 (comment), we are using ValueOffset64 method of binary-like variable width types (string, binary) to calculate the offset differences and determine all individual value byte widths.

For an Arrow Record of e.g. 10 rows with an array.Binary field, a panic will occur on ValueOffset(10) (last index, matching the data slice length):

if i < 0 || i >= a.array.data.length {
panic("arrow/array: index out of range")
}

This seems to be due to use of >= instead of >. Apparently the same applies for array.LargeBinary fields.

For comparison, array.String fields do not panic when the index input matches the data length:

if i < 0 || i > a.array.data.length {
panic("arrow/array: index out of range")
}

Component(s)

Go

@polestar1988
Copy link

@frbvianna did you get answer regards to this issue? I have the same error.

@frbvianna
Copy link
Author

@polestar1988 No answers yet, but validated in local code that the solution is indeed to change >= for >.

@polestar1988
Copy link

I also handled it in my code instead of using func (a *Binary) ValueOffset(i int) :

if recordNumber < 0 || recordNumber > colBinary.Data().Len() {
	slog.Error("index out of range", "calculateSize", "Error")
	return 0
}
end := int(colBinary.ValueOffsets()[colBinary.Data().Offset()+recordNumber+1])
start := int(colBinary.ValueOffsets()[colBinary.Data().Offset()+recordNumber])
elementSize = end - start
....

@frbvianna
Copy link
Author

Thanks @polestar1988, that is one alternative.

In my case (#37976 (comment)), I was trying to make use of array.BinaryLike interface to simplify this handling, but that interface is constrained to ValueOffset64(int) int64 so I would not be able to access the raw ValueOffsets from it.
I also find it unsettling that ValueOffset/ValueOffset64 take in int row index parameters whereas slicing methods take in int64.
@zeroshade can you please confirm if these are intended?

@zeroshade
Copy link
Member

So, at minimum this should be fixed to be consistent between the Binary and String types, it was not intended to be inconsistent.

That all said: I believe the intent of the ValueOffset method is to provide the starting offset of the value at index i, which means that the Binary type is correct in having the >= there, because there's no value at an index equal to the length of the slice/array to get the starting offset for. Instead you'd use ValueLen, or ValueOffsets and grab the last value or something like that.

If you're trying to get the element size, there's already a ValueLen method which does that for you. If you're trying to get the actual value, the Value method returns a slice with the value without copying.

So, I believe the optimal solution here would be to expose the ValueLen method in the BinaryLike interface so that you do not need to try to calculate these byte widths yourself manually and can instead just call that method.

What do you both think?

I also find it unsettling that ValueOffset/ValueOffset64 take in int row index parameters whereas slicing methods take in int64

This was done for consistency as most of the existing row index params in the array types use int rather than an explicit int64 (likely for convenience). In theory this would only affect someone if they were running on a 32-bit machine but had enough rows that it would not fit in an int32.

@frbvianna
Copy link
Author

frbvianna commented Oct 30, 2023

So, I believe the optimal solution here would be to expose the ValueLen method in the BinaryLike interface so that you do not need to try to calculate these byte widths yourself manually and can instead just call that method.

As long as it is ensured to work for the last possible offset / record index, this seems like a reasonable solution to me. It would simplify things for us as well.

@frbvianna
Copy link
Author

If you're trying to get the element size, there's already a ValueLen method which does that for you. If you're trying to get the actual value, the Value method returns a slice with the value without copying.

Just to note here, using ValueLen directly on the array.Binary also panics on the last row index, it also checks for >= on data slice. So exposing the ValueLen in the BinaryLike interface alone does not seem to be a solution, the boundaries need to be fixed in addition to that as well, what was suggested in #38458 (comment) seems to be the only way to get the value length for the last element in a binary column right now.

@zeroshade
Copy link
Member

Sorry for the delay here, I've been at a conference all last week, and an off-site for the week before that.

Just to note here, using ValueLen directly on the array.Binary also panics on the last row index, it also checks for >= on data slice.

This doesn't quite make sense to me. ValueLen should use >= just like Value does, because the indices are 0-based. ValueLen as it currently is implemented will correctly return the length for the last element in a binary column, and this is verified in the unit tests in binary_test.go via TestBinaryValueLen. You should never be calling ValueLen(col.Len()) because it's 0-based, you would call ValueLen(col.Len()-1) to get the length of the last element, which would pass the check for >= a.data.length.

Am I missing something? Can you provide some sample code that calls ValueLen and panics on the last row index that I could test?

Otherwise, I'm gonna put together a PR to add ValueLen to the interface

@frbvianna
Copy link
Author

@zeroshade, sorry for responding so late, I just checked again that I was mistaken and probably tried to use the ValueLen starting from 1 instead of 0. So just adding it to the interface like you mentioned should help. Thank you for looking into this.

zeroshade added a commit to zeroshade/arrow that referenced this issue Dec 15, 2023
zeroshade added a commit that referenced this issue Jan 3, 2024
### Rationale for this change
Adding `ValueLen` to the `BinaryLike` interface for easy convenience of determining the length of an individual value for a Binary/String like array.

### Are these changes tested?
yes

* Closes: #38458

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
@zeroshade zeroshade added this to the 15.0.0 milestone Jan 3, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
### Rationale for this change
Adding `ValueLen` to the `BinaryLike` interface for easy convenience of determining the length of an individual value for a Binary/String like array.

### Are these changes tested?
yes

* Closes: apache#38458

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
### Rationale for this change
Adding `ValueLen` to the `BinaryLike` interface for easy convenience of determining the length of an individual value for a Binary/String like array.

### Are these changes tested?
yes

* Closes: apache#38458

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
### Rationale for this change
Adding `ValueLen` to the `BinaryLike` interface for easy convenience of determining the length of an individual value for a Binary/String like array.

### Are these changes tested?
yes

* Closes: apache#38458

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants