Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13646: [Go][Parquet] adding the parquet metadata package #10951

Closed
wants to merge 7 commits into from

Conversation

zeroshade
Copy link
Member

Here's the next chunk of code following the merging of #10716

Thankfully the metadata package is actually relatively small compared to everything else so far.

@emkornfield @sbinet @nickpoorman for visibility

@github-actions
Copy link

Copy link
Contributor

@nickpoorman nickpoorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. Left a few low priority comments.

go/parquet/metadata/column_chunk.go Outdated Show resolved Hide resolved
return nil
}

// WriteTo will always return 0 as the int64 since the thrift writer library
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder if we should keep track of the bytes or explicit return a -1. Is there precedent for this elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in internal/utils/write_utils.go I do have what i called a TellWrapper which wrapped an io.Writer in order to keep track of the bytes written so that it could add a Tell() function to tell the current position which I use elsewhere.

The main reason why I didn't do something like that here was that I didn't want every call to WriteTo to create an object wrapping the io.Writer in order to count the bytes written. It's just unfortunate that the current thrift interface for a TStruct has a Write function but doesn't return the number of bytes it wrote.

I don't remember there being anywhere else that I do this so you might be right that it makes sense to explicitly return a -1 here instead of a 0. Personally as long as the comment documents what is being returned, I don't much mind whether it's a 0 or a -1 as long as we're explicit about it.

go/parquet/metadata/file.go Outdated Show resolved Hide resolved
@zeroshade
Copy link
Member Author

Bump @emkornfield when you get a chance.

Thanks!

Also, @nickpoorman I've implemented the suggested comment, removed the old comment, and replied to your question when you have a chance. I'm curious your thoughts on my response. Thanks!

@zeroshade
Copy link
Member Author

Bump @emkornfield @nickpoorman Just trying to get eyes on this so i can get it merged and finally get the File package up here :)

@emkornfield
Copy link
Contributor

Sorry for the delay this is on my todo list for this week.

// LessThan compares the app versions and returns true if this version
// is "less than" the passed version.
//
// If the apps don't match, this always returns false. Otherwise it compares
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ternary? the contract seems strange?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's effectively a < operator, not a <=.

What seems strange about the contract?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comparison of apps, I would think we would want to sort those also just so ordering is well defined in all scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the idea was to make it easier to compare the versions themselves rather than just full regular ordering. it's intended that the ordering is only well defined when the apps are the same i think.

//
// Reference: parquet-cpp/src/parquet/metadata.cc
//
// PARQUET-686 has more discussion on statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't done this elsewhere but it isn't the only statistics bug from he C++ perspective. Decimal logical types comparisons for C++ where broken until recently and null statistics for list types are still broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check for parquet-cpp-arrow version 4.0.0 to check decimal stats etc.


thriftEncodings = append(thriftEncodings, format.Encoding(parquet.Encodings.RLE))
// Only PLAIN encoding is supported for fallback in V1
// TODO(zeroshade): Use user specified encoding for V2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: #11031 recently deprecated "V2" as a standalone concept and now reference minor versions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at that PR, it looks like the actual Parquet Format still only has the two values for version being 1 and 2 (with 2 now referring to "most recent version 2") and there's no way to actually write the minor version to the parquet file? Do I have that right? So the "fine grained" version handling is only for specific feature selection of writing, but they all still just write "2" to the parquet file metadata, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is my understanding.

createdBy := f.props.CreatedBy()
f.metadata.CreatedBy = &createdBy

// Users cannot set the `ColumnOrder` since we donot not have user defined sort order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Users cannot set the `ColumnOrder` since we donot not have user defined sort order
// Users cannot set the `ColumnOrder` since we do not not have user defined sort order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment copied from somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

fileEncProps := f.props.FileEncryptionProperties()
if fileEncProps != nil && !fileEncProps.EncryptedFooter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to allow nil fileEncProps (can defaults always indicate no encryption?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, Go doesn't allow default values for function params like C++. The standard pattern would be to have a separate function for the case with encryption or not or to use the options pattern. I'll take a look and see which pattern i like better for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i misunderstood the question here.

The defaults are that the FileEncryptionProperties() are nil which indicates no encryption.

fileEncProps := f.props.FileEncryptionProperties()
if fileEncProps != nil && !fileEncProps.EncryptedFooter() {
var signingAlgo parquet.Algorithm
algo := fileEncProps.Algorithm()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find fileEncProps a little too short because there is both encoding and encryption in parquet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to encryptProps


out := &FileMetaData{
FileMetaData: f.metadata,
version: NewAppVersion(f.metadata.GetCreatedBy()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to make sure this references go someplace by default?

Copy link
Member Author

@zeroshade zeroshade Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the top of go/parquet/writer_properties.go there is a constant DefaultCreatedBy which references parquet-go which is the default value for the CreatedBy string in the metadata unless a user overrides it in their writer properties.


out.RowGroups = make([]*format.RowGroup, 0, len(rowGroups))
for _, selected := range rowGroups {
out.RowGroups = append(out.RowGroups, f.RowGroups[selected])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it pay to make this a map for upstream consumers (I don't think it is done in other implementations just wondering if it is important).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a map instead of a slice? Not sure what benefit a map would provide here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would keep the indices of the RowGroups selected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. So, in this particular case, it's creating a new FileMetaData with the selected subset of RowGroups, I think that it would be surprising to a consumer if the RowGroups in the subset weren't in the order that was provided as the argument (allowing for the rowgroups in the subset to be in a different order than the parent). Thus RowGroup(0) of the new FileMetaData that is returned, should be the row group at index rowGroups[0] of the parent File Metadata. It might make sense to keep a map in addition to the slice, but right now I don't believe it keeps any reference to the parent FileMetaData that it is a subset of, so I'm fine with seeing if that actually gets requested as a feature, rather than doing it now.

panic("decryption not set propertly, cannot verify signature")
}

serializer := thrift.NewThriftSerializer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky not sure if you want to look at cryptographic related functions here at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A massive one, but overall looks well structured and in line with the Java and C++ implementations. The only way to verify the crypto code is to try to read encrypted files produced by the Java implementation, and to make the latter read encrypted files produced by this implementation.
(C++ version might be also ok for this purpose, but the Java one is more actively maintained).
Thanks @zeroshade and @emkornfield .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky I used the files here: https://github.com/apache/parquet-testing to test the encryption stuff the same way that I believe the C++ does its testing of the encrypted reading/writing. After this gets merged and I add the file module which has all of the direct file handling, I can add back the tests which did this verification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense. these files were written by C++ encryption code, but were tested against the Java version too, so should be fine.

assert.NoError(t, err)
assertStatsSet(t, rg1Col1)
assertStatsSet(t, rg1Col2)
assert.Equal(t, statsInt.Min, assertStats(t, rg1Col1).EncodeMin())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a little hard to grok/review if all of these tests are correct. I'm glad there are tests, but I'm wondering if there might be a better way to express the tests so that is it easier to verify the assertions are what we would expect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually copied these test cases from the C++ implementation (though that was back in january). I'll see if i can make the tests more obvious as to what they are testing.

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, I'd like to verify some old bugs aren't being propagated here.

@zeroshade
Copy link
Member Author

@emkornfield Okay, i've either addressed or asked for clarification on pretty much all your comments I think. Let me know if I missed anything :)

@zeroshade
Copy link
Member Author

bump @emkornfield

@emkornfield
Copy link
Contributor

I think a few more doc update and we can merge and come back to any remaining issues.

@zeroshade
Copy link
Member Author

updated comments as discussed.

@asfgit asfgit closed this in f2cb977 Sep 13, 2021
@zeroshade zeroshade deleted the parquet-metadata-package branch September 27, 2021 17:13
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Here's the next chunk of code following the merging of apache#10716

Thankfully the metadata package is actually relatively small compared to everything else so far.

@emkornfield @sbinet @nickpoorman for visibility

Closes apache#10951 from zeroshade/parquet-metadata-package

Authored-by: Matthew Topol <mtopol@factset.com>
Signed-off-by: Matthew Topol <mtopol@factset.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants