Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12045: [Go][Parquet] Initial Chunk of Parquet port to Go #9671

Closed
wants to merge 17 commits into from

Conversation

zeroshade
Copy link
Member

Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation.

Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas.

This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync.

CC: @emkornfield @wesm @sbinet @nickpoorman

@github-actions
Copy link

@emkornfield
Copy link
Contributor

@zeroshade would it be possible to create a new JIRA and PR for the bump in version requirement. That should likely be separate from the core parquet work?

@zeroshade
Copy link
Member Author

@emkornfield sure, that's easy. i didn't realize i was gonna have to do it until the actions were failing haha.

@zeroshade
Copy link
Member Author

@emkornfield filed as #9675

sbinet pushed a commit that referenced this pull request Mar 12, 2021
@emkornfield

spun off from #9671

Closes #9675 from zeroshade/arrow-11931

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Sebastien Binet <binet@cern.ch>
@zeroshade zeroshade force-pushed the arrow-7905-p1 branch 2 times, most recently from f898486 to d4cb9cc Compare March 12, 2021 14:52
@zeroshade
Copy link
Member Author

zeroshade commented Mar 12, 2021

@sbinet @emkornfield I've rebased this to include the changes to bump to go1.15 so now this is ready for reviews. After this gets merged, i'll push the next chunk of code. Sorry for the size of this one but it was the smallest i could easily make it as a self contained piece that doesn't rely on other chunks. It's a foundational piece while the rest of the chunks will be a lot smaller.

PERL_FIXUP_ROTATE=perl -i -pe 's/(ro[rl]\s+\w{2,3})$$/\1, 1/'
C2GOASM=c2goasm -a -f
CC=clang
C_FLAGS=-masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this limit what architectures/OS this can work for? For instance in c++ we are careful to allow compilation that don't have AVX2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the actual Go code, if the architecture doesn't have AVX2 it just falls back to a pure go implementation. More explained in the reply to your main comment.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking this up into smaller chunks. It is still quite large, so I will try to take a more detailed pass but some high level things stood out to me:

  1. Documentation is pretty sparse. If this was an issue with original code, it would be good to break this anti-pattern.
  2. From a build perspective, is there a way to automatically generate the ASM files so they don't need to be checked in?
  3. From a code organization, would it make sense to separate out all the bit helpers into another package? A lot of the bit util code (not all of it) in C++ arrow is used in multiple places. This might be premature so I'm okay holding off on it.

go/parquet/internal/bmi/_lib/bitmap_bmi2.c Outdated Show resolved Hide resolved
go/parquet/internal/bmi/_lib/bitmap_bmi2.s Show resolved Hide resolved
go/parquet/internal/bmi/bitmap_bmi2.go Outdated Show resolved Hide resolved
go/parquet/internal/bmi/bitmap_bmi2.go Outdated Show resolved Hide resolved
go/parquet/internal/bmi/bmi_init.go Outdated Show resolved Hide resolved
"golang.org/x/exp/rand"
)

func RandomNonNull(dt arrow.DataType, size int) array.Interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the name of the files, Arrow might be the better package here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are actually test utilities that i end up using for generating random arrow arrays for use with the unit tests for writing arrow directly to parquet inside of a pqarrow package that isn't included here. As mentioned above, by putting it in the internal package path, these cannot be imported outside of the parquet packages. If i put this into the Arrow package, then I'd need to make them exported publicly in the arrow package which i was trying to avoid doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems OK for now. It seems like the best solution would be having a "test utility package" that both Arrow and Parquet could share for tests.

bldr.Append(buf)
}
return bldr.NewArray()
// case arrow.DECIMAL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #9671 (comment) but long story short: these rely on a function that was exported in a later chunk of this change, so rather than delete them and then re-add it later, I just commented them out until i add that chunk. If you'd prefer these to be deleted for now and then added as part of that chunk, I can do that.

go/parquet/internal/testutils/random_arrow.go Show resolved Hide resolved
go/parquet/internal/utils/math.go Show resolved Hide resolved
go/parquet/internal/utils/rle.go Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor

Regarding assembly, didn't realize Go was already doing this, if it is consistent with existing patterns makes sense to keep it.

@zeroshade
Copy link
Member Author

rebased branch with master, hopefully that fixes the failing check :)

Copy link
Contributor

@WilliamWhispell WilliamWhispell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About half way through

go/parquet/internal/bmi/bmi_noasm.go Show resolved Hide resolved
go/parquet/internal/testutils/random.go Show resolved Hide resolved
go/parquet/internal/testutils/random.go Outdated Show resolved Hide resolved
go/parquet/internal/utils/bit_reader.go Show resolved Hide resolved
go/parquet/internal/utils/bit_reader.go Show resolved Hide resolved
go/parquet/internal/utils/bit_reader.go Show resolved Hide resolved
go/parquet/internal/utils/bit_reader.go Show resolved Hide resolved
go/parquet/internal/utils/bit_reader.go Show resolved Hide resolved
}
}

// func TestBitRleOverflow(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is an artifact of an earlier impl and a test i copied from the C++ impl.

Given the way I implemented the RLE by taking in an io.WriterAt, it's not the same as the c++ version which always expects a specifically sized buffer with a maximum length that it tracks. It just writes to the io.WriterAt and returns whether or not that was successful, and should fail if it can't write. So this test ends up being not worthwhile or relevant in this implementation, I had intended to potentially come back and work this is but never did. So i'll just delete it for now rather than leaving it commented out here.

// BitRunReader is an interface that is usable by multiple callers to provide
// multiple types of bit run readers such as a reverse reader and so on.
//
// It's a convenience interface for counting contiguous set/unset bits in a bitmap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth commenting that in places where BitBlockCounter can be used that is apreferred method since it will be faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i'll add that to the comment

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more questions. I mostly skimmed the logic as this is a port of the C++ code. Overall, I think it looks ok. I'll wait a couple of days to see if @sbinet or @nickpoorman have any go specific comments.

@emkornfield
Copy link
Contributor

One more small nit. Could you create a new sub-jira issue for this PR? We generally like to maintain 1-1 mapping from JIRA to PRs

@zeroshade
Copy link
Member Author

@emkornfield When you say "sub-jira issue" do you mean a sub-issue on the linked issue? Or just a new JIRA issue in general? If it's a new JIRA issue altogether, should it be under the "PARQUET" jira instead of the ARROW one?

@zeroshade zeroshade changed the title ARROW-7905: [Go][Parquet] Initial Chunk of Parquet port to Go ARROW-12045: [Go][Parquet] Initial Chunk of Parquet port to Go Mar 22, 2021
@emkornfield
Copy link
Contributor

@emkornfield When you say "sub-jira issue" do you mean a sub-issue on the linked issue?

Yes, I think this will be the cleanest (it should give you a new ID).

@zeroshade
Copy link
Member Author

@emkornfield done and updated the title of this PR. :) Also pushed changes based on your suggestions

@github-actions
Copy link

@zeroshade
Copy link
Member Author

@emkornfield Just checking in, if there are no comments by @sbinet or @nickpoorman would we be able to get this merged either today or tomorrow? I have the next chunk ready to go, it's also much smaller than this one haha.

@emkornfield
Copy link
Contributor

Yes, I'll merge tomorrow if no further comments.

emkornfield pushed a commit that referenced this pull request Apr 16, 2021
Following up from #9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes #9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
@emkornfield

spun off from apache#9671

Closes apache#9675 from zeroshade/arrow-11931

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Sebastien Binet <binet@cern.ch>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation.

Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas.

This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync.

CC: @emkornfield @wesm @sbinet @nickpoorman

Closes apache#9671 from zeroshade/arrow-7905-p1

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes apache#9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation.

Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas.

This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync.

CC: @emkornfield @wesm @sbinet @nickpoorman

Closes apache#9671 from zeroshade/arrow-7905-p1

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes apache#9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
@emkornfield

spun off from apache#9671

Closes apache#9675 from zeroshade/arrow-11931

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Sebastien Binet <binet@cern.ch>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation.

Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas.

This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync.

CC: @emkornfield @wesm @sbinet @nickpoorman

Closes apache#9671 from zeroshade/arrow-7905-p1

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling.

Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read.

Closes apache#9817 from zeroshade/arrow-12104

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
@zeroshade zeroshade deleted the arrow-7905-p1 branch September 12, 2021 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants