-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12045: [Go][Parquet] Initial Chunk of Parquet port to Go #9671
Conversation
@zeroshade would it be possible to create a new JIRA and PR for the bump in version requirement. That should likely be separate from the core parquet work? |
@emkornfield sure, that's easy. i didn't realize i was gonna have to do it until the actions were failing haha. |
@emkornfield filed as #9675 |
@emkornfield spun off from #9671 Closes #9675 from zeroshade/arrow-11931 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Sebastien Binet <binet@cern.ch>
f898486
to
d4cb9cc
Compare
@sbinet @emkornfield I've rebased this to include the changes to bump to go1.15 so now this is ready for reviews. After this gets merged, i'll push the next chunk of code. Sorry for the size of this one but it was the smallest i could easily make it as a self contained piece that doesn't rely on other chunks. It's a foundational piece while the rest of the chunks will be a lot smaller. |
PERL_FIXUP_ROTATE=perl -i -pe 's/(ro[rl]\s+\w{2,3})$$/\1, 1/' | ||
C2GOASM=c2goasm -a -f | ||
CC=clang | ||
C_FLAGS=-masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this limit what architectures/OS this can work for? For instance in c++ we are careful to allow compilation that don't have AVX2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the actual Go code, if the architecture doesn't have AVX2 it just falls back to a pure go implementation. More explained in the reply to your main comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for breaking this up into smaller chunks. It is still quite large, so I will try to take a more detailed pass but some high level things stood out to me:
- Documentation is pretty sparse. If this was an issue with original code, it would be good to break this anti-pattern.
- From a build perspective, is there a way to automatically generate the ASM files so they don't need to be checked in?
- From a code organization, would it make sense to separate out all the bit helpers into another package? A lot of the bit util code (not all of it) in C++ arrow is used in multiple places. This might be premature so I'm okay holding off on it.
"golang.org/x/exp/rand" | ||
) | ||
|
||
func RandomNonNull(dt arrow.DataType, size int) array.Interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the name of the files, Arrow might be the better package here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are actually test utilities that i end up using for generating random arrow arrays for use with the unit tests for writing arrow directly to parquet inside of a pqarrow
package that isn't included here. As mentioned above, by putting it in the internal
package path, these cannot be imported outside of the parquet packages. If i put this into the Arrow package, then I'd need to make them exported publicly in the arrow package which i was trying to avoid doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems OK for now. It seems like the best solution would be having a "test utility package" that both Arrow and Parquet could share for tests.
bldr.Append(buf) | ||
} | ||
return bldr.NewArray() | ||
// case arrow.DECIMAL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #9671 (comment) but long story short: these rely on a function that was exported in a later chunk of this change, so rather than delete them and then re-add it later, I just commented them out until i add that chunk. If you'd prefer these to be deleted for now and then added as part of that chunk, I can do that.
Regarding assembly, didn't realize Go was already doing this, if it is consistent with existing patterns makes sense to keep it. |
df05d6d
to
545f7a8
Compare
rebased branch with master, hopefully that fixes the failing check :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About half way through
} | ||
} | ||
|
||
// func TestBitRleOverflow(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is an artifact of an earlier impl and a test i copied from the C++ impl.
Given the way I implemented the RLE by taking in an io.WriterAt, it's not the same as the c++ version which always expects a specifically sized buffer with a maximum length that it tracks. It just writes to the io.WriterAt
and returns whether or not that was successful, and should fail if it can't write. So this test ends up being not worthwhile or relevant in this implementation, I had intended to potentially come back and work this is but never did. So i'll just delete it for now rather than leaving it commented out here.
// BitRunReader is an interface that is usable by multiple callers to provide | ||
// multiple types of bit run readers such as a reverse reader and so on. | ||
// | ||
// It's a convenience interface for counting contiguous set/unset bits in a bitmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth commenting that in places where BitBlockCounter can be used that is apreferred method since it will be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, i'll add that to the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few more questions. I mostly skimmed the logic as this is a port of the C++ code. Overall, I think it looks ok. I'll wait a couple of days to see if @sbinet or @nickpoorman have any go specific comments.
One more small nit. Could you create a new sub-jira issue for this PR? We generally like to maintain 1-1 mapping from JIRA to PRs |
@emkornfield When you say "sub-jira issue" do you mean a sub-issue on the linked issue? Or just a new JIRA issue in general? If it's a new JIRA issue altogether, should it be under the "PARQUET" jira instead of the ARROW one? |
Yes, I think this will be the cleanest (it should give you a new ID). |
@emkornfield done and updated the title of this PR. :) Also pushed changes based on your suggestions |
@emkornfield Just checking in, if there are no comments by @sbinet or @nickpoorman would we be able to get this merged either today or tomorrow? I have the next chunk ready to go, it's also much smaller than this one haha. |
Yes, I'll merge tomorrow if no further comments. |
Following up from #9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes #9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
@emkornfield spun off from apache#9671 Closes apache#9675 from zeroshade/arrow-11931 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Sebastien Binet <binet@cern.ch>
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation. Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas. This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync. CC: @emkornfield @wesm @sbinet @nickpoorman Closes apache#9671 from zeroshade/arrow-7905-p1 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes apache#9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation. Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas. This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync. CC: @emkornfield @wesm @sbinet @nickpoorman Closes apache#9671 from zeroshade/arrow-7905-p1 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes apache#9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
@emkornfield spun off from apache#9671 Closes apache#9675 from zeroshade/arrow-11931 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Sebastien Binet <binet@cern.ch>
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation. Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas. This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync. CC: @emkornfield @wesm @sbinet @nickpoorman Closes apache#9671 from zeroshade/arrow-7905-p1 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Following up from apache#9671 this is the next chunk of ported code consisting of the generated Thrift Code and the utilities for supporting Encryption, Compression and Reader/Writer Property handling. Thankfully this is much smaller than the previous chunk, and so should be much easier to review and read. Closes apache#9817 from zeroshade/arrow-12104 Authored-by: Matthew Topol <mtopol@factset.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>
Based on the c++ implementation but tuned and optimized for Go, I spent the first couple months this year creating a Go implementation for Parquet with the goal of native/easy integration with the Arrow library while still being highly performant and at minimum reaching feature parity with the C++ implementation.
Based on the conversations on the JIRA card, rather than dumping a huge code bomb (there's a ton), I've chunked it up. This is the initial chunk of code comprising of an internal utils directory that is analogous to the cpp/arrow/utils/ bit readers/writers/bit run readers/etc. which were ultimately used by the go implementation, while using c2goasm to reach the performance necessary for certain areas.
This is part 1 of the implementation as I chunk it up and push it out. I'll wait for each chunk to get merged before pushing the next PR in order to make sure that everything stays in sync.
CC: @emkornfield @wesm @sbinet @nickpoorman