Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write #13770

Merged
merged 13 commits into from
Aug 3, 2022

Conversation

zeroshade
Copy link
Member

No description provided.

@github-actions
Copy link

github-actions bot commented Aug 1, 2022

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I'm not a Go developer, but I left a bunch of comments which are hope are not too far off...

@@ -77,7 +77,7 @@ Data Types
+-------------------+-------+-------+-------+------------+-------+-------+-------+
| List | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
+-------------------+-------+-------+-------+------------+-------+-------+-------+
| Large List | ✓ | ✓ | | | | ✓ | ✓ |
| Large List | ✓ | ✓ | | | | ✓ | ✓ |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the PR title incorrect? This is updating the compatibility matrix for Large List support, not Large String and Large Binary which are already ticked for Go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right that I need to rename the PR (I ended up tying the change with the IPC fixes to my other changes adding the LargeList type so i just incorporated it). Also, I honestly don't know why the Large String and Large Binary were ticked, they weren't supported until this change.

}
return bldr.NewArray()
case arrow.LARGE_LIST:
valuesSize := size * 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so, the way I understand this code, size is the logical length of the list array and valuesSize the logical length of the child list array, right?

Meaning that the multiplier here is arbitrary and you could have e.g. valuesSize := size * 13 or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I'm doing * 8 here since it's expected to be 64-bit integers for the test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I understand that cts.largeOffsets generated size 64-bit offsets between 0 and valuesSize (inclusive).
But why does the number of list offsets be equal to 8 times the number of values in the child array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size is the number of elements in the list array (which is why cts.largeOffsets outputs a slice with size+1 elements because offsets should be 1 + #of elems). I multiplied by 8 here just to have a larger child array than I was using with the initial List cases, thus the lengths of the list elements will be on average larger than the previous case.

TL;DR: size is the number of list elements (and size+1 is the number of offsets). valuesSize is the total length of the child array which will get sliced up into those list elements randomly.

offsets *Int32Builder
offsets Builder

dt arrow.DataType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the actual list type or the offset type? (perhaps add a comment?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dt is the actual List Type, there's a separate etype member which is the element type of the list.

The offset type (int32 vs int64) doesn't need to matter for the base listbuilder object and is hidden behind the Builder interface, casted when necessary to either an Int32Builder or Int64Builder as appropriate. I'll add a comment.

go/arrow/array/list.go Outdated Show resolved Hide resolved
func (b *ListBuilder) appendNextOffset() {
b.offsets.Append(int32(b.values.Len()))
func (b *listBuilder) appendNextOffset() {
b.appendOffsetVal(b.values.Len())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is going through a function pointer and/or vtable I assume? I don't know how much optimized you expect builders to be, but is some inlining automatically done by the compiler here when the derived listBuilder type is known?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, right now it's going through a function pointer. I don't know offhand whether or not the Go compiler would be able to inline this with the function pointer (though I can take a look at the generated assembly and see), my initial assumption would be that it's not going to inline this and will go through the function pointer each time.

It was a decision I made to allow the code reuse at the expense of some performance in the builder.

There are a lot of cases I would prefer to use go's generics for (like this one), but we're trying to maintain compatibility for Go versions latest-1 (ie: go1.17+) so I don't want to introduce usages of generics just yet until Go1.19 is released.

go/arrow/cdata/cdata.go Show resolved Hide resolved
go/arrow/cdata/cdata.go Outdated Show resolved Hide resolved
go/arrow/ipc/writer.go Outdated Show resolved Hide resolved
go/arrow/ipc/writer.go Show resolved Hide resolved
@@ -16,6 +16,10 @@

package arrow

type OffsetTraits interface {
BytesRequired(int) int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this get a docstring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this need to take and return int64 instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently all of the *Traits objects (defined in the type_traits_*.go files) take in an int and return an int. On 64-bit machines this would be a 64-bit integer and on 32-bit machines it would be a 32-bit integer.

I didn't want to change those objects (and risk breaking anyone who was using things like arrow.TimestampTraits.BytesRequired(n)). In this situation I just created an interface which matched those objects so I could have the data type objects return them. I'll add a docstring comment, but I don't want to modify it to be explicitly int64 as that would be a signiifcant, potentially breaking, change.

@zeroshade zeroshade changed the title ARROW-17275: [Go][Integration] Handle LargeString/LargeBinary in IPC and integration tests ARROW-17275: [Go][Integration] Handle Large offset types in IPC read/write Aug 2, 2022
@zeroshade
Copy link
Member Author

Thanks for the thorough review @pitrou I've pushed a bunch of refactoring and simplifying changes based on your suggestions by leveraging some new Interfaces. The side benefit of this is that the new interfaces will also make it easier for consumers to use the new Array types interchangeably too.

@zeroshade
Copy link
Member Author

Also the failed CI seems to be an issue with python linting.... anyone know if there's already a jira issue to address that?

@pitrou
Copy link
Member

pitrou commented Aug 2, 2022

Also the failed CI seems to be an issue with python linting.... anyone know if there's already a jira issue to address that?

cc @raulcd

@zeroshade
Copy link
Member Author

Linting issue is being fixed by #13778

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't know if a Go developer wants to review? @raceordie690 @wolfeidau

@zeroshade
Copy link
Member Author

Thanks @pitrou

I'd love it if another Go developer gave a review, I'll let this sit for another day to give people a chance and if there are no comments or anything I'll merge it.

Copy link
Contributor

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a read over the code, not an expert on the internals of arrow but nothing struck me as out of place.

👍🏻 nice work

go/arrow/array/list.go Outdated Show resolved Hide resolved
@zeroshade
Copy link
Member Author

zeroshade commented Aug 3, 2022

@raulcd

The Python failure here doesn't appear related to this change at all:

=================================== FAILURES ===================================
___________________ [doctest] pyarrow.parquet.read_metadata ____________________
3406 
3407     Examples
3408     --------
3409     >>> import pyarrow as pa
3410     >>> import pyarrow.parquet as pq
3411     >>> table = pa.table({'n_legs': [4, 5, 100],
3412     ...                   'animal': ["Dog", "Brittle stars", "Centipede"]})
3413     >>> pq.write_table(table, 'example.parquet')
3414 
3415     >>> pq.read_metadata('example.parquet')
Differences (unified diff with -expected +actual):
    @@ -1,7 +1,7 @@
    -<pyarrow._parquet.FileMetaData object at ...>
    -  created_by: parquet-cpp-arrow version ...
    +<pyarrow._parquet.FileMetaData object at 0x7f3c48eb4750>
    +  created_by: parquet-cpp-arrow version 10.0.0-SNAPSHOT
       num_columns: 2
       num_rows: 3
       num_row_groups: 1
       format_version: 2.6
    -  serialized_size: 561
    +  serialized_size: 562

@zeroshade
Copy link
Member Author

thanks for the look over @wolfeidau!

@zeroshade zeroshade merged commit db6c099 into apache:master Aug 3, 2022
@zeroshade zeroshade deleted the large-offsets-ipc branch August 3, 2022 19:43
@ursabot
Copy link

ursabot commented Aug 3, 2022

Benchmark runs are scheduled for baseline = 3b987d9 and contender = db6c099. db6c099 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.34% ⬆️0.0%] test-mac-arm
[Finished ⬇️6.25% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] db6c099b ec2-t3-xlarge-us-east-2
[Failed] db6c099b test-mac-arm
[Finished] db6c099b ursa-i9-9960x
[Finished] db6c099b ursa-thinkcentre-m75q
[Failed] 3b987d92 ec2-t3-xlarge-us-east-2
[Finished] 3b987d92 test-mac-arm
[Finished] 3b987d92 ursa-i9-9960x
[Finished] 3b987d92 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Aug 3, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants