Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-686: Allow for Unsigned Statistics in Binary Type #362

Closed
wants to merge 3 commits into from

Conversation

a10y
Copy link

@a10y a10y commented Aug 23, 2016

Currently, ordering of Binary in Parquet is based on byte-by-byte comparison. This doesn't match the standard method of lexicographic sorting of Unicode strings, you can see an example of this in

This overrides comparison on FromStringBinary to implement correct sort order.

@julienledem

@a10y a10y changed the title Add lexicographic ordering for FromStringBinary PARQUET-686: Add lexicographic ordering for FromStringBinary Aug 23, 2016
@julienledem
Copy link
Member

julienledem commented Aug 23, 2016

Is the consequence of this change that MAX and MIN are now correct? Can you add unit tests that illustrate the case that did not work before?

Please create a Parquet jira for this. (https://issues.apache.org/jira/browse/PARQUET)
and prefix your PR title with it.

The bug seems more related to comparing unsigned bytes rather than actual UTF-8 based comparison (which would have to take into account multi-byte characters). So possibly just change the comparison of all binary types.

However we need to be careful about backward compatibility here. Another possibility is to create a separate Stats field but it seems the current min/max are wrong in some situations (bytes > 127)
@isnotinvain @rdblue what do you think?

@a10y
Copy link
Author

a10y commented Aug 23, 2016

If you assume binary types are strings of nonnegative bytes then this is
more correct. Whether or not that applies to all Binary types I'm not
certain but I at least know it affects correctness of UTF8 string
comparisons.

I can go ahead and add unit tests but would be interested knowing if this
should be changed all around. This isn't a format breaking change nor
should it have to be as far as I can tell, but it changes the sort order so
statistics calculated will be different now for binary columns

On Tuesday, 23 August 2016, Julien Le Dem notifications@github.com wrote:

Is the consequence of this change that MAX and MIN are now correct? Can
you add unit tests that illustrate the case that did not work before?

Please create a Parquet jira for this. (https://issues.apache.org/
jira/browse/PARQUET)
and prefix your PR title with it.

The bug seems more related to comparing unsigned bytes rather than actual
UTF-8 based comparison (which would have to take into account multi-byte
characters). So possibly just change the comparison of all binary types.

However we need to be careful about backward compatibility here. (another
possibility is to create a separate Stats field but it seems the current
min/max are wrong in some situaoetions)
@isnotinvain https://github.com/isnotinvain @rdblue
https://github.com/rdblue what do you think?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAUTJyeqW53jj5c5UzejlfqLU0cxt3MCks5qiylAgaJpZM4JrJX5
.

Andrew Duffy
aduffy.org

for (int i = 0; i < min_length; i++) {
int value1 = array1.get(i + offset1) & 0xFF;
int value2 = array2.get(i + offset2) & 0xFF;
if (value1 < value2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could simplify this by:

if (value1 != value2) {
  return value1 - value2;
}

P.S. looking at the avro and spark comparisons the return is flipped from what you have (return +ve if val1 > val2)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually was trying to make this logic as close to existing logic as possible.

The sign is flipped because if you look in BinaryStatistics it uses compareTo opposite of the way that most Java code does, though strictly speaking nothing's wrong if everything is consistent which it appears to be in this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I simplify these statements should I go ahead and make that change to all the compare* methods in this project?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*file, not project

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is confusing. Seems like BinaryStatistics is behaving like normal Java code:

  public void updateStats(Binary min_value, Binary max_value) {
    if (min.compareTo(min_value) > 0) { min = min_value.copy(); }
    if (max.compareTo(max_value) < 0) { max = max_value.copy(); }
  }

Comparable returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.

Seems like the flip of lhs and rhs happens when you call binary1.compareTo(binary2).
The code there does:
other.compareTo(value, 0, value.length).

So yeah I think you've got the sign set up correctly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplification looks cleaner, so I'd be in favor of updating the methods in the file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, yes that is the source of the flipping. I can simplify the comparison methods to reflect your suggestion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also go and flip the comparisons so that they return idiomatic -1 = less, 1 = greater?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave that part as is. binary1.compareTo(binary2) ends up returning the correct value so it seems to be respecting the Comparable contract. Might be useful to add a comment above your method (and the couple of others which were following that behavior) to call out that it is inverted for a reason. That'll help folks reading the code in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the flip is sort of necessary due to the use of polymorphic Binary so I'll just leave it as it is and apply the simplifications.

@isnotinvain
Copy link
Contributor

I have a few questions about this. In general I am not sure this is a safe thing to change?

While parquet has a Binary type, and a sub-type of Binary for strings, whether a Binary was originally a String or not is probably not carefully tracked throughout the codebase. I don't think that instances of Binary should have different sort orders based on how they were constructed. Fore example, if one part of the code grabs a chunk of bytes from a buffer, that happens to be a string, but this is at a layer not paying attention to that or not, it will get the normal binary sorting, not the string sorting.

Additionally, if we change how strings are sorted, the min + max values in files already written by parquet would become invalid. That seems like a very serious breaking change. I think if we want lexicographic sorting of strings in the parquet statistics, that should be handled in the layer that creates the statistics, and should be done via a new field in the parquet-format thrift schema.

Does that make sense?

@julienledem
Copy link
Member

@isnotinvain @andreweduffy @piyushnarang: By looking at the code it seems related to java having signed bytes rather than differences in comparing strings vs bytes.
typically when comparing binary data, you would consider the bytes unsigned.

Now the comparison is different only if you have bytes that are > 127 which does not happen if it's just plain english text. It would be great if we had examples of strings where the behavior is different. Possibly under the form of unit tests.

If this is correct then this seems like a bug and existing min/max for others than plain 7bits ascii will be wrong anyway.

There will be another issue that this is still not proper unicode ordering since it doesn't take into account multibyte characters. (Which I think we want to discuss separately to keep this PR focused)

@isnotinvain
Copy link
Contributor

I would need to check, but I think UTF8 is sortable byte by byte if you
treat them unsigned.

That said -- parquet format doesn't have a notion of strings nor a special
case for how they are sorted. If we currently sort all binary data byte by
byte as signed bytes that's not necessarily incorrect unless parquet-format
says otherwise?

We should be very careful changing anything that invalidates old
statistics, or else any query engine using them will return wrong results.
Does parquet cpp sort binaries in a different way? It seems that this
should be part of parquet-format.

So I think we need to:

  1. check if parquet-format specifies sorting rules and if it doesn't, it
    should
  2. think about backwards compatibility.

As long as the sorting was consistent (tho no lexicographic) then the
statistics are not currently broken, because queries will use the same sort
order right?

On Tuesday, August 23, 2016, Julien Le Dem notifications@github.com wrote:

@isnotinvain https://github.com/isnotinvain @andreweduffy
https://github.com/andreweduffy @piyushnarang
https://github.com/piyushnarang: By looking at the code it seems
related to java having signed bytes rather than differences in comparing
strings vs bytes.
typically when comparing binary data, you would consider the bytes
unsigned.

Now the comparison is different only if you have bytes that are > 127
which does not happen if it's just plain english text. It would be great if
we had examples of strings where the behavior is different. Possibly under
the form of unit tests.

If this is correct then this seems like a bug and existing min/max for
others than plain 7bits ascii will be wrong anyway.

There will be another issue that this is still not proper unicode ordering
since it doesn't take into account multibyte characters. (Which I think we
want to discuss separately to keep this PR focused)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWw6VHqrOAUB6lDEMiRANt5a3tSnlfIks5qi62HgaJpZM4JrJX5
.

@a10y
Copy link
Author

a10y commented Aug 24, 2016

This is actually currently broken on Spark as spark sorts by unsigned bytes so there's a mismatch when using the statistics for queries where you actually have to read more row groups than necessary because min/max are off relative to the way Spark sorted the data. A simple way to replicate this behavior is to have a simple Parquet row group with one column and two rows, one with "é" and one with "b", here min will be "é" and max will be "b", but should actually be the other way around based on how Spark actually sorted, so you miss the chance to skip this row group for queries for "a", etc.

I can't speak for Hive or Impala but I know that in Spark there is a distinction between a "binary" type and a "string" type, where Binary type is sorted like normal java byte[] i.e. as signed bytes and strings are UTF8 encoded byte arrays and are sorted as in this PR. I can imagine changing the format to specify something like an enum SortStrategy in the ColumnMetadata, which is either SignedBytes or UnsignedBytes, defaulting to SignedBytes when not specified to be compatible with engines that use the existing sort order. This will then be something specified on Binary subytes as well, and can be asserted on read by toplevel processing engine using Parquet. However, the assert is all this really buys you, and I still don't fully see the use for a format change since this only affects the write path.

@a10y
Copy link
Author

a10y commented Aug 24, 2016

Also as an update I realize this does affect the correctness of UTF8 string comparisons, see https://issues.apache.org/jira/browse/SPARK-17213

@piyushnarang
Copy link

I don't think we're currently specifying the sort rules in parquet-format - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L198.

I guess one of the issue that this change could cause even though its just on the write path is engines like Spark today are returning a certain set of results based on this ordering (which is incorrect like you pointed out). If we flip it and someone is trying to read the same data (written in the new format), they'll end up seeing a different set of results compared to the old files which isn't ideal.

How about adding a couple of optional min / max fields to the statistics object that have this unsigned ordering?

@piyushnarang
Copy link

So I was chatting with @isnotinvain and a few points came up:

  1. Alex's suggestion was to document in parquet-format what the current sort order is (not really add fields / enums). That's basically our spec which is used by parquet-mr (java) and parquet-cpp so we should specify the order there. As part of this we should verify that parquet-cpp also sorts binaries in the same way (signed).
  2. Based on the approach we decide, we'll also need to ensure these changes aren't restricted to FromStringBinary. Think it is possible for someone to create a Binary object from a String using any of the other Binary classes as well (e.g. directly using a byte buffer).

@a10y
Copy link
Author

a10y commented Aug 24, 2016

So currently there is no Statistics writing for Parquet, it's currently in-process for a few months now (apache/parquet-cpp#129).

However their ByteArray type represents bytes as uint8_t, so they assume all binary-typed data is unsigned (https://github.com/apache/parquet-cpp/blob/f97042d9bab10fffdb5e532fcc21a9ccc27f4f1c/src/parquet/types.h#L118).

@isnotinvain
Copy link
Contributor

Thanks for looking into this @piyushnarang and @andreweduffy.

Given that the sort order is not specified in parquet-format, and parquet-mr is the only implementation that currently writes these statistics, I feel fairly strongly that we should update parquet-format to say that the sorting order for binary types is the one parquet-mr currently uses.

Separately from that, I think it would be a good idea to propose some new fields in the parquet-format statistics object, something like:

  1. Deprecate the current min/max fields in the case of binary columns
  2. Add 4 fields: unsignedMin, unsignedMax, signedMin, signedMax for use in binary columns

This way, when a reader sees any of unsignedMin, unsignedMax, signedMin, signedMax set, it knows it can use those safely. When it sees only min/max set, it knows that those are implicitly signed. We should document all of this in the parquet-format spec as well.

@a10y
Copy link
Author

a10y commented Aug 24, 2016

After looking through a few other binary formats (Thrift, Protobufs) and some databases (Postgres' BINARY format), I'm actually of the opinion that it makes the most sense for binary data to be interpreted as a string of unsigned bytes, however I mainly just want Parquet statistics to work correctly with other systems, so your recommendation is fine by me, but just putting out there that unsigned seems to be a more widely accepted interpretation. Signed interpretation is a relic of the Java's lack of unsigned types.

@isnotinvain
Copy link
Contributor

I 100% agree that unsigned bytes is a better sort order. I don't think that needs any debate.

The problem is, in some sense we have already picked a different sort order, and we cannot change that, even to a strictly better sort order, without maintaining backwards compatibility. So it's really not about which is better, it's about whether we can change what we have been using so far. I think the best bet is to add this better sort order, but we need to continue to support files written with the old sort order.

@isnotinvain
Copy link
Contributor

And the only way to do that is probably by adding new stats fields. And we will also need the filter push down layer to be explicit about which kind of sort order a particular query wants.

@a10y
Copy link
Author

a10y commented Aug 24, 2016

Got it, I'm in agreement that preserving compatibility is important.

I can make the change to Statistics here, change to parquet-format will just be doc comment update in the thrift file which I'll do separately.

@piyushnarang
Copy link

@andreweduffy - Thanks. When you update format we should also ping folks on the cpp PR to let them know. Would be nice to have that code also compatible.

@a10y
Copy link
Author

a10y commented Aug 24, 2016

Yep, I pinged Wes about it on the CPP ticket I linked above, says he'll want to make a separate ticket for that but should follow up later

@a10y
Copy link
Author

a10y commented Aug 25, 2016

Just pushed a commit, it'll fail to build because it depends on updated parquet-format which I built against a local snapshot, but it builds and passes all tests on my laptop.

I allow users to pass in a map of ColumnPath -> boolean where the boolean is if the user wants to use unsigned statistics and comparison operators for pushdown, defaults to using default behavior that existed before.

@isnotinvain
Copy link
Contributor

Hmm, I'm not sure that's the approach we'll want take. Users would have to set this up on the write path and then very carefully mirror it on the read path.

Why don't we instead just store both in the statistics? That seems like it has much less room for users to make a mistake, and it also avoids the situation where for a given parquet file you don't even know what sort order was used for a particular column.

@isnotinvain
Copy link
Contributor

Oh I see I misread -- the settings are just for the read path right?

I think in that case it might make more sense to just make the filter API a little more explicit, instead of configuring things per column, lets just update the filter API? Eg allow users to do:
ltEq(fooColumn, "bar", SortOrder.UNSIGNED)

@a10y
Copy link
Author

a10y commented Aug 26, 2016

Yep the sorting specification is only needed for the read since all
available info is written regardless. Forcing users to specify on the
filters would also work, had some concerns but realise they're moot so I'll
go ahead and patch that. Any other large changes that need to be made here?

On Friday, 26 August 2016, Alex Levenson notifications@github.com wrote:

Oh I see I misread -- the settings are just for the read path right?

I think in that case it might make more sense to just make the filter API
a little more explicit, instead of configuring things per column, lets just
update the filter API? Eg allow users to do:

ltEq(fooColumn, "bar", SortOrder.UNSIGNED)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAUTJy1KFZkR8z-5Pyt5yzIc7cqinqmQks5qjiEXgaJpZM4JrJX5
.

Andrew Duffy
aduffy.org

@isnotinvain
Copy link
Contributor

I think the bigger thing is to make sure we get some buy-in on the parquet-format change. If anyone has any proposed changes to that it'll have to get reflected here, so maybe we should send an email to the dev list with the parquet-format PR first. Would probably help to give some background info there as well.

@a10y
Copy link
Author

a10y commented Aug 26, 2016

Sounds good, I'll go ahead and send one

@@ -73,7 +73,7 @@
<hadoop1.version>1.1.0</hadoop1.version>
<cascading.version>2.5.3</cascading.version>
<cascading3.version>3.0.3</cascading3.version>
<parquet.format.version>2.3.1</parquet.format.version>
<parquet.format.version>2.3.2-SNAPSHOT</parquet.format.version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo to remember to update this once the format PR is merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the travis build won't pass untill then

@a10y
Copy link
Author

a10y commented Sep 2, 2016

Thanks for the feedback @piyushnarang and @isnotinvain, will push a followup soon

@a10y
Copy link
Author

a10y commented Sep 2, 2016

@piyushnarang @isnotinvain made the changes, in particular initializeStats and updateStats are now broken into signed/unsigned variants, renamed comparison methods to include signed/unsigned distinction

@piyushnarang
Copy link

piyushnarang commented Sep 6, 2016

@andreweduffy, shall take another look. Was not able to see your updates as I was out over the weekend.

}

@Override
public boolean isSmallerThan(long size) {
return !hasNonNullValue() || ((min.length() + max.length()) < size);
return !hasNonNullValue() || (((minSigned.length() + maxSigned.length()) < size) && ((minUnsigned.length() + maxUnsigned.length()) < size));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like minSigned.length() + maxSigned.length() should be the same as minUnsigned.length() + maxUnsigned.length() right? we could think about dropping one pair from this comparison

Copy link
Author

@a10y a10y Sep 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are cases where that might not be true. Take the following example column:

é
ello
a

minSigned = é
maxSigned = ello

minUnsigned = a
maxUnsigned = é

@piyushnarang
Copy link

👍 , looks good to me

checkNotNull(pred, "pred");
checkNotNull(columns, "columns");
checkNotNull(pred, "predicate cannot be null");
checkNotNull(columns, "columns cannot be null");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second parameter is just the name of the parameter.
https://github.com/apache/parquet-mr/blob/6dad1e3bd0e277f5b5e5e2a0720f474271c1648d/parquet-common/src/main/java/org/apache/parquet/Preconditions.java#L38
if renaming pred, you should rename in in the method signature as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll just change this back then, these are separate from the root of the pr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just realized this isn't even guava preconditions, this is your own checkNotNull. This is definitely the wrong thing to do, will be reverted

@julienledem
Copy link
Member

Looks good to me.
depends on a parquet-format change

@rdblue
Copy link
Contributor

rdblue commented Sep 8, 2016

Hi everyone. Sorry I'm late to the party, evidently I've been missing notifications.

I don't understand why the approach is to keep the signed ordering around and add API support for it. Is anyone going to opt for signed ordering? UTF-8 lexicographical ordering corresponds to the byte ordering using unsigned comparison. As far as I can tell, the signed ordering isn't standard or used anywhere else. That makes me more inclined to treat it as a bug instead of adding it to the API.

I completely agree that backward-compatibility is important. I think we should address this problem like PARQUET-251, where we considered binary min and max corrupt and didn't return them when converting the footer. Then we can fix the ordering to be unsigned and not need to change the API. Since the ordering is the same for 7-bit ascii data, I think we should also add a property to use the min and max anyway if users want to override the fix.

(There's also the equality predicate case, where it doesn't matter what the ordering is as long as you're consistent, but I don't think it is worth adding special handling for this. A property to use min/max should work for this case as well.)

@rdblue
Copy link
Contributor

rdblue commented Sep 9, 2016

In the interest of getting a Parquet MR release done, I've posted an incomplete fix: PR #367. That patch prevents stats from being returned by the metadata converter, like the fix for PARQUET-251. It detects whether the logical type should use signed or unsigned order, so it addresses this bug for unsigned integer types as well as UTF8 and decimals. It also makes as few changes as possible so we can decide both how to implement different sort orders and how to store those in the format in future releases.

@a10y
Copy link
Author

a10y commented Sep 9, 2016

Gotcha. You can probably ignore my comments on your PR then, I made those before I saw your post. In either case would be great to continue the conversation around whether or not we're gonna go with a format change or just force binaries over to unsigned statistics.

@rdblue
Copy link
Contributor

rdblue commented Sep 9, 2016

I think we need a format change. We need to know the sort order that was used for a column in order to use min and max moving forward. While we could use the writer version to detect it, I think the better option is to store the ordering that was used. That way we can support locality-specific sort orders in the future (the writer provides a name and a Comparator for the order).

@a10y
Copy link
Author

a10y commented Sep 9, 2016

+1 I'm on board with that

@isnotinvain
Copy link
Contributor

I would prefer we don't invalidate all the old statistics. We (and presumably others) have a lot of data with those stats in them. And while unsigned byte ordering is certainly a better choice, the signed ordering wasn't explicitly a bug. It seems the ordering was never actually specified in parquet-format. Why not keep support for both orderings explicitly? We can set the defaults to the unsigned ordering. It also allows us to avoid switching on the version number when looking at stats, which is not super reliable given that a lot of our (and probably others) files were written from internal forks with non-standard version numbers.

I think adding an unsigned_min and unsigned_max field makes it explicit and allows the reader to determine whether what they have is signed or not.

@isnotinvain
Copy link
Contributor

Actually, I keep forgetting that this still works for equality and inequality w/ either ordering.

It might be OK to update parquet-format specifying that the only supporting ordering is unsigned, add a marker field (or how about even a version number field?) to parquet-format as well, and then in the read path we can determine which kind of stats are available and not use them for range queries when they are the wrong kind.

@rdblue
Copy link
Contributor

rdblue commented Sep 10, 2016

@isnotinvain, sorry I should have been more clear about what I think the path forward is.

To get 1.9.0 out, I think we should disable signed stats by type -- for unsigned ints, UTF8, and decimals. But this also includes a flag to continue using those stats for UTF8. That's because we have a ton of data with these stats as well. (Though keep in mind that this doesn't affect dictionary filtering, which is coming in 1.9.0.) This is a stop-gap to fix correctness only.

Then, I think we should add what ordering was used to the format. For older files, we default that to "unsigned" and then Parquet will have full access to the old stats as you recommend because they can still be used for equality/inequality. This also avoids needing to check version strings because we can assume the unsigned ordering when it isn't set.

The problem I have with the current proposal is that it exposes the ordering bug to users through the API, when I think it should be internal, but with a flag to signal that you're okay with the correctness issues because you only wrote ASCII data strings.

@isnotinvain
Copy link
Contributor

I agree with most of that, my main issue is treating this as a correctness bug. As is, parquet is self consistent and correct. The push down filters match the stats in the files. Changing the ordering parquet uses from signed to unsigned is a good feature to add, but I'm not sure we should treat the current behavior as a bug. It's more of a "in retrospect a better idea would have been" -- unless I'm missing something and the parquet-format spec specifies an ordering for binary columns?

If we expose the ordering to the query writer (ideally with defaults so most users don't even notice) then instead of a flag that flips you into "there are bugs here" mode, users can push down filters and signal which ordering they expect the filter to use. At least that way we can fail when someone pushes down an unsigned filter to a file that doesn't have unsigned stats (or not fail, but skip the stats based filtering).

I do get the value of not polluting the API with more options, but I prefer that we expose this complexity to users in a way that allows us to be correct all the time, instead of adding a flag that is "use at your own risk". If all our filter push downs from users are tagged with a requested ordering, then we can safely use or not use the available stats for them. We can also make the default ordering unsigned ordering, so it shouldn't pollute the API too much.

@rdblue
Copy link
Contributor

rdblue commented Sep 10, 2016

Changing the ordering parquet uses from signed to unsigned is a good feature to add, but I'm not sure we should treat the current behavior as a bug

I definitely think this is a bug. The sort order when you store a string doesn't match the sort order of java.lang.String. Sure, we're consistent with the order and it is useful sometimes, but the min and max we report for Strings are wrong.

I prefer that we expose this complexity to users in a way that allows us to be correct all the time

We can take advantage of it being correct for ASCII characters, but I don't think we should make the problem go away by making it an accidental feature, nor should we perpetuate an order that doesn't make sense in our API. This would only exist in Parquet.

How is exposing the signed ordering in the API any more powerful than setting when it is safe to use the signed ordering min and max? That's just detecting whether to use the signed min and max at query evaluation time or fall back. When would you not do this globally anyway? Would users specify two sets of predicates?

The drawback to exposing the signed ordering in the API is that you have to change the current min and max API's behavior anyway.

If we used the current API for the signed min and max, then we'd end up with downstream apps continuing to use the wrong (meaning unexpected) min and max. If we use the current API for the unsigned min and max, then we can't take advantage of existing signed min and max... unless we added a property to do so, but then we'd have a larger API with features identical to what I'm proposing. And if we removed the current API entirely then downstream apps would all have to rewrite to use stats, which is obviously a problem.

I don't see value in exposing this bug to downstream users.

@rdblue
Copy link
Contributor

rdblue commented Sep 13, 2016

@isnotinvain, can you take another look at this? I'd like to get a release out and I think it depends on whether or not we think we need to address the min/max problem as a correctness issue. Thanks!

@julienledem
Copy link
Member

I agree with @rdblue that this is a bug. A spec bug but a bug.
Consistent is not the same as correct. Being inconsistent would be incorrect but being consistent is not sufficient for being correct. If parquet-mr is currently consistent within itself, it is inconsistent with sort orders that users wants to push down which is what matters.
In the general case, the signed bytes sort order currently used for binary types (and strings in particular) should not be used. In specific cases it can be used. For example: all byte values in between 0 and 127 or 7 bit ASCII text.

The PR #367 that @rdblue proposed is compatible with the unsigned_min/max extension proposed on parquet-format apache/parquet-format#42 and will unblock the situation for releasing 1.9.0. It will hide the stats in the case they are not the order the user expects and will prevent inconsistent push downs.

Regarding exposing a setting to the user, I think @isnotinvain is right that we should not pollute the api with too many workaround settings. Possibly we should not document it since this is a workaround that will be used for a very small subset of users (people on this thread only?)

@a10y
Copy link
Author

a10y commented Sep 1, 2017

Looks like PARQUET-686 was eventually resolved. See as this PR's 1 yr anniversary is coming up, I'm gonna go ahead and close this.

@a10y a10y closed this Sep 1, 2017
Copy link

@xiexingguang xiexingguang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge master ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants