-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-686: Allow for Unsigned Statistics in Binary Type #362
Conversation
Is the consequence of this change that MAX and MIN are now correct? Can you add unit tests that illustrate the case that did not work before? Please create a Parquet jira for this. (https://issues.apache.org/jira/browse/PARQUET) The bug seems more related to comparing unsigned bytes rather than actual UTF-8 based comparison (which would have to take into account multi-byte characters). So possibly just change the comparison of all binary types. However we need to be careful about backward compatibility here. Another possibility is to create a separate Stats field but it seems the current min/max are wrong in some situations (bytes > 127) |
If you assume binary types are strings of nonnegative bytes then this is I can go ahead and add unit tests but would be interested knowing if this On Tuesday, 23 August 2016, Julien Le Dem notifications@github.com wrote:
Andrew Duffy |
for (int i = 0; i < min_length; i++) { | ||
int value1 = array1.get(i + offset1) & 0xFF; | ||
int value2 = array2.get(i + offset2) & 0xFF; | ||
if (value1 < value2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could simplify this by:
if (value1 != value2) {
return value1 - value2;
}
P.S. looking at the avro and spark comparisons the return is flipped from what you have (return +ve if val1 > val2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I actually was trying to make this logic as close to existing logic as possible.
The sign is flipped because if you look in BinaryStatistics it uses compareTo opposite of the way that most Java code does, though strictly speaking nothing's wrong if everything is consistent which it appears to be in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I simplify these statements should I go ahead and make that change to all the compare* methods in this project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*file, not project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is confusing. Seems like BinaryStatistics is behaving like normal Java code:
public void updateStats(Binary min_value, Binary max_value) {
if (min.compareTo(min_value) > 0) { min = min_value.copy(); }
if (max.compareTo(max_value) < 0) { max = max_value.copy(); }
}
Comparable returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.
Seems like the flip of lhs and rhs happens when you call binary1.compareTo(binary2).
The code there does:
other.compareTo(value, 0, value.length).
So yeah I think you've got the sign set up correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simplification looks cleaner, so I'd be in favor of updating the methods in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, yes that is the source of the flipping. I can simplify the comparison methods to reflect your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also go and flip the comparisons so that they return idiomatic -1 = less, 1 = greater?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave that part as is. binary1.compareTo(binary2) ends up returning the correct value so it seems to be respecting the Comparable contract. Might be useful to add a comment above your method (and the couple of others which were following that behavior) to call out that it is inverted for a reason. That'll help folks reading the code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the flip is sort of necessary due to the use of polymorphic Binary so I'll just leave it as it is and apply the simplifications.
I have a few questions about this. In general I am not sure this is a safe thing to change? While parquet has a Binary type, and a sub-type of Binary for strings, whether a Binary was originally a String or not is probably not carefully tracked throughout the codebase. I don't think that instances of Binary should have different sort orders based on how they were constructed. Fore example, if one part of the code grabs a chunk of bytes from a buffer, that happens to be a string, but this is at a layer not paying attention to that or not, it will get the normal binary sorting, not the string sorting. Additionally, if we change how strings are sorted, the min + max values in files already written by parquet would become invalid. That seems like a very serious breaking change. I think if we want lexicographic sorting of strings in the parquet statistics, that should be handled in the layer that creates the statistics, and should be done via a new field in the parquet-format thrift schema. Does that make sense? |
@isnotinvain @andreweduffy @piyushnarang: By looking at the code it seems related to java having signed bytes rather than differences in comparing strings vs bytes. Now the comparison is different only if you have bytes that are > 127 which does not happen if it's just plain english text. It would be great if we had examples of strings where the behavior is different. Possibly under the form of unit tests. If this is correct then this seems like a bug and existing min/max for others than plain 7bits ascii will be wrong anyway. There will be another issue that this is still not proper unicode ordering since it doesn't take into account multibyte characters. (Which I think we want to discuss separately to keep this PR focused) |
I would need to check, but I think UTF8 is sortable byte by byte if you That said -- parquet format doesn't have a notion of strings nor a special We should be very careful changing anything that invalidates old So I think we need to:
As long as the sorting was consistent (tho no lexicographic) then the On Tuesday, August 23, 2016, Julien Le Dem notifications@github.com wrote:
|
This is actually currently broken on Spark as spark sorts by unsigned bytes so there's a mismatch when using the statistics for queries where you actually have to read more row groups than necessary because min/max are off relative to the way Spark sorted the data. A simple way to replicate this behavior is to have a simple Parquet row group with one column and two rows, one with "é" and one with "b", here min will be "é" and max will be "b", but should actually be the other way around based on how Spark actually sorted, so you miss the chance to skip this row group for queries for "a", etc. I can't speak for Hive or Impala but I know that in Spark there is a distinction between a "binary" type and a "string" type, where Binary type is sorted like normal java |
Also as an update I realize this does affect the correctness of UTF8 string comparisons, see https://issues.apache.org/jira/browse/SPARK-17213 |
I don't think we're currently specifying the sort rules in parquet-format - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L198. I guess one of the issue that this change could cause even though its just on the write path is engines like Spark today are returning a certain set of results based on this ordering (which is incorrect like you pointed out). If we flip it and someone is trying to read the same data (written in the new format), they'll end up seeing a different set of results compared to the old files which isn't ideal. How about adding a couple of optional min / max fields to the statistics object that have this unsigned ordering? |
So I was chatting with @isnotinvain and a few points came up:
|
So currently there is no Statistics writing for Parquet, it's currently in-process for a few months now (apache/parquet-cpp#129). However their ByteArray type represents bytes as uint8_t, so they assume all binary-typed data is unsigned (https://github.com/apache/parquet-cpp/blob/f97042d9bab10fffdb5e532fcc21a9ccc27f4f1c/src/parquet/types.h#L118). |
Thanks for looking into this @piyushnarang and @andreweduffy. Given that the sort order is not specified in parquet-format, and parquet-mr is the only implementation that currently writes these statistics, I feel fairly strongly that we should update parquet-format to say that the sorting order for binary types is the one parquet-mr currently uses. Separately from that, I think it would be a good idea to propose some new fields in the parquet-format statistics object, something like:
This way, when a reader sees any of unsignedMin, unsignedMax, signedMin, signedMax set, it knows it can use those safely. When it sees only min/max set, it knows that those are implicitly signed. We should document all of this in the parquet-format spec as well. |
After looking through a few other binary formats (Thrift, Protobufs) and some databases (Postgres' BINARY format), I'm actually of the opinion that it makes the most sense for binary data to be interpreted as a string of unsigned bytes, however I mainly just want Parquet statistics to work correctly with other systems, so your recommendation is fine by me, but just putting out there that unsigned seems to be a more widely accepted interpretation. Signed interpretation is a relic of the Java's lack of unsigned types. |
I 100% agree that unsigned bytes is a better sort order. I don't think that needs any debate. The problem is, in some sense we have already picked a different sort order, and we cannot change that, even to a strictly better sort order, without maintaining backwards compatibility. So it's really not about which is better, it's about whether we can change what we have been using so far. I think the best bet is to add this better sort order, but we need to continue to support files written with the old sort order. |
And the only way to do that is probably by adding new stats fields. And we will also need the filter push down layer to be explicit about which kind of sort order a particular query wants. |
Got it, I'm in agreement that preserving compatibility is important. I can make the change to Statistics here, change to parquet-format will just be doc comment update in the thrift file which I'll do separately. |
@andreweduffy - Thanks. When you update format we should also ping folks on the cpp PR to let them know. Would be nice to have that code also compatible. |
Yep, I pinged Wes about it on the CPP ticket I linked above, says he'll want to make a separate ticket for that but should follow up later |
Just pushed a commit, it'll fail to build because it depends on updated parquet-format which I built against a local snapshot, but it builds and passes all tests on my laptop. I allow users to pass in a map of ColumnPath -> boolean where the boolean is if the user wants to use unsigned statistics and comparison operators for pushdown, defaults to using default behavior that existed before. |
Hmm, I'm not sure that's the approach we'll want take. Users would have to set this up on the write path and then very carefully mirror it on the read path. Why don't we instead just store both in the statistics? That seems like it has much less room for users to make a mistake, and it also avoids the situation where for a given parquet file you don't even know what sort order was used for a particular column. |
Oh I see I misread -- the settings are just for the read path right? I think in that case it might make more sense to just make the filter API a little more explicit, instead of configuring things per column, lets just update the filter API? Eg allow users to do: |
Yep the sorting specification is only needed for the read since all On Friday, 26 August 2016, Alex Levenson notifications@github.com wrote:
Andrew Duffy |
I think the bigger thing is to make sure we get some buy-in on the parquet-format change. If anyone has any proposed changes to that it'll have to get reflected here, so maybe we should send an email to the dev list with the parquet-format PR first. Would probably help to give some background info there as well. |
Sounds good, I'll go ahead and send one |
@@ -73,7 +73,7 @@ | |||
<hadoop1.version>1.1.0</hadoop1.version> | |||
<cascading.version>2.5.3</cascading.version> | |||
<cascading3.version>3.0.3</cascading3.version> | |||
<parquet.format.version>2.3.1</parquet.format.version> | |||
<parquet.format.version>2.3.2-SNAPSHOT</parquet.format.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo to remember to update this once the format PR is merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the travis build won't pass untill then
Thanks for the feedback @piyushnarang and @isnotinvain, will push a followup soon |
@piyushnarang @isnotinvain made the changes, in particular initializeStats and updateStats are now broken into signed/unsigned variants, renamed comparison methods to include signed/unsigned distinction |
@andreweduffy, shall take another look. Was not able to see your updates as I was out over the weekend. |
} | ||
|
||
@Override | ||
public boolean isSmallerThan(long size) { | ||
return !hasNonNullValue() || ((min.length() + max.length()) < size); | ||
return !hasNonNullValue() || (((minSigned.length() + maxSigned.length()) < size) && ((minUnsigned.length() + maxUnsigned.length()) < size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like minSigned.length() + maxSigned.length() should be the same as minUnsigned.length() + maxUnsigned.length() right? we could think about dropping one pair from this comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are cases where that might not be true. Take the following example column:
é
ello
a
minSigned = é
maxSigned = ello
minUnsigned = a
maxUnsigned = é
👍 , looks good to me |
checkNotNull(pred, "pred"); | ||
checkNotNull(columns, "columns"); | ||
checkNotNull(pred, "predicate cannot be null"); | ||
checkNotNull(columns, "columns cannot be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second parameter is just the name of the parameter.
https://github.com/apache/parquet-mr/blob/6dad1e3bd0e277f5b5e5e2a0720f474271c1648d/parquet-common/src/main/java/org/apache/parquet/Preconditions.java#L38
if renaming pred, you should rename in in the method signature as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll just change this back then, these are separate from the root of the pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just realized this isn't even guava preconditions, this is your own checkNotNull. This is definitely the wrong thing to do, will be reverted
Looks good to me. |
Hi everyone. Sorry I'm late to the party, evidently I've been missing notifications. I don't understand why the approach is to keep the signed ordering around and add API support for it. Is anyone going to opt for signed ordering? UTF-8 lexicographical ordering corresponds to the byte ordering using unsigned comparison. As far as I can tell, the signed ordering isn't standard or used anywhere else. That makes me more inclined to treat it as a bug instead of adding it to the API. I completely agree that backward-compatibility is important. I think we should address this problem like PARQUET-251, where we considered binary min and max corrupt and didn't return them when converting the footer. Then we can fix the ordering to be unsigned and not need to change the API. Since the ordering is the same for 7-bit ascii data, I think we should also add a property to use the min and max anyway if users want to override the fix. (There's also the equality predicate case, where it doesn't matter what the ordering is as long as you're consistent, but I don't think it is worth adding special handling for this. A property to use min/max should work for this case as well.) |
In the interest of getting a Parquet MR release done, I've posted an incomplete fix: PR #367. That patch prevents stats from being returned by the metadata converter, like the fix for PARQUET-251. It detects whether the logical type should use signed or unsigned order, so it addresses this bug for unsigned integer types as well as UTF8 and decimals. It also makes as few changes as possible so we can decide both how to implement different sort orders and how to store those in the format in future releases. |
Gotcha. You can probably ignore my comments on your PR then, I made those before I saw your post. In either case would be great to continue the conversation around whether or not we're gonna go with a format change or just force binaries over to unsigned statistics. |
I think we need a format change. We need to know the sort order that was used for a column in order to use min and max moving forward. While we could use the writer version to detect it, I think the better option is to store the ordering that was used. That way we can support locality-specific sort orders in the future (the writer provides a name and a |
+1 I'm on board with that |
I would prefer we don't invalidate all the old statistics. We (and presumably others) have a lot of data with those stats in them. And while unsigned byte ordering is certainly a better choice, the signed ordering wasn't explicitly a bug. It seems the ordering was never actually specified in parquet-format. Why not keep support for both orderings explicitly? We can set the defaults to the unsigned ordering. It also allows us to avoid switching on the version number when looking at stats, which is not super reliable given that a lot of our (and probably others) files were written from internal forks with non-standard version numbers. I think adding an unsigned_min and unsigned_max field makes it explicit and allows the reader to determine whether what they have is signed or not. |
Actually, I keep forgetting that this still works for equality and inequality w/ either ordering. It might be OK to update parquet-format specifying that the only supporting ordering is unsigned, add a marker field (or how about even a version number field?) to parquet-format as well, and then in the read path we can determine which kind of stats are available and not use them for range queries when they are the wrong kind. |
@isnotinvain, sorry I should have been more clear about what I think the path forward is. To get 1.9.0 out, I think we should disable signed stats by type -- for unsigned ints, UTF8, and decimals. But this also includes a flag to continue using those stats for UTF8. That's because we have a ton of data with these stats as well. (Though keep in mind that this doesn't affect dictionary filtering, which is coming in 1.9.0.) This is a stop-gap to fix correctness only. Then, I think we should add what ordering was used to the format. For older files, we default that to "unsigned" and then Parquet will have full access to the old stats as you recommend because they can still be used for equality/inequality. This also avoids needing to check version strings because we can assume the unsigned ordering when it isn't set. The problem I have with the current proposal is that it exposes the ordering bug to users through the API, when I think it should be internal, but with a flag to signal that you're okay with the correctness issues because you only wrote ASCII data strings. |
I agree with most of that, my main issue is treating this as a correctness bug. As is, parquet is self consistent and correct. The push down filters match the stats in the files. Changing the ordering parquet uses from signed to unsigned is a good feature to add, but I'm not sure we should treat the current behavior as a bug. It's more of a "in retrospect a better idea would have been" -- unless I'm missing something and the parquet-format spec specifies an ordering for binary columns? If we expose the ordering to the query writer (ideally with defaults so most users don't even notice) then instead of a flag that flips you into "there are bugs here" mode, users can push down filters and signal which ordering they expect the filter to use. At least that way we can fail when someone pushes down an unsigned filter to a file that doesn't have unsigned stats (or not fail, but skip the stats based filtering). I do get the value of not polluting the API with more options, but I prefer that we expose this complexity to users in a way that allows us to be correct all the time, instead of adding a flag that is "use at your own risk". If all our filter push downs from users are tagged with a requested ordering, then we can safely use or not use the available stats for them. We can also make the default ordering unsigned ordering, so it shouldn't pollute the API too much. |
I definitely think this is a bug. The sort order when you store a string doesn't match the sort order of java.lang.String. Sure, we're consistent with the order and it is useful sometimes, but the min and max we report for Strings are wrong.
We can take advantage of it being correct for ASCII characters, but I don't think we should make the problem go away by making it an accidental feature, nor should we perpetuate an order that doesn't make sense in our API. This would only exist in Parquet. How is exposing the signed ordering in the API any more powerful than setting when it is safe to use the signed ordering min and max? That's just detecting whether to use the signed min and max at query evaluation time or fall back. When would you not do this globally anyway? Would users specify two sets of predicates? The drawback to exposing the signed ordering in the API is that you have to change the current min and max API's behavior anyway. If we used the current API for the signed min and max, then we'd end up with downstream apps continuing to use the wrong (meaning unexpected) min and max. If we use the current API for the unsigned min and max, then we can't take advantage of existing signed min and max... unless we added a property to do so, but then we'd have a larger API with features identical to what I'm proposing. And if we removed the current API entirely then downstream apps would all have to rewrite to use stats, which is obviously a problem. I don't see value in exposing this bug to downstream users. |
@isnotinvain, can you take another look at this? I'd like to get a release out and I think it depends on whether or not we think we need to address the min/max problem as a correctness issue. Thanks! |
I agree with @rdblue that this is a bug. A spec bug but a bug. The PR #367 that @rdblue proposed is compatible with the unsigned_min/max extension proposed on parquet-format apache/parquet-format#42 and will unblock the situation for releasing 1.9.0. It will hide the stats in the case they are not the order the user expects and will prevent inconsistent push downs. Regarding exposing a setting to the user, I think @isnotinvain is right that we should not pollute the api with too many workaround settings. Possibly we should not document it since this is a workaround that will be used for a very small subset of users (people on this thread only?) |
Looks like PARQUET-686 was eventually resolved. See as this PR's 1 yr anniversary is coming up, I'm gonna go ahead and close this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge master ?
Currently, ordering of Binary in Parquet is based on byte-by-byte comparison. This doesn't match the standard method of lexicographic sorting of Unicode strings, you can see an example of this in
This overrides comparison on
FromStringBinary
to implement correct sort order.@julienledem