PARQUET-686: Allow for Unsigned Statistics in Binary Type #362

a10y · 2016-08-23T16:51:21Z

Currently, ordering of Binary in Parquet is based on byte-by-byte comparison. This doesn't match the standard method of lexicographic sorting of Unicode strings, you can see an example of this in

This overrides comparison on FromStringBinary to implement correct sort order.

@julienledem

julienledem · 2016-08-23T17:09:17Z

Is the consequence of this change that MAX and MIN are now correct? Can you add unit tests that illustrate the case that did not work before?

Please create a Parquet jira for this. (https://issues.apache.org/jira/browse/PARQUET)
and prefix your PR title with it.

The bug seems more related to comparing unsigned bytes rather than actual UTF-8 based comparison (which would have to take into account multi-byte characters). So possibly just change the comparison of all binary types.

However we need to be careful about backward compatibility here. Another possibility is to create a separate Stats field but it seems the current min/max are wrong in some situations (bytes > 127)
@isnotinvain @rdblue what do you think?

a10y · 2016-08-23T17:58:32Z

If you assume binary types are strings of nonnegative bytes then this is
more correct. Whether or not that applies to all Binary types I'm not
certain but I at least know it affects correctness of UTF8 string
comparisons.

I can go ahead and add unit tests but would be interested knowing if this
should be changed all around. This isn't a format breaking change nor
should it have to be as far as I can tell, but it changes the sort order so
statistics calculated will be different now for binary columns

On Tuesday, 23 August 2016, Julien Le Dem notifications@github.com wrote:

Is the consequence of this change that MAX and MIN are now correct? Can
you add unit tests that illustrate the case that did not work before?

Please create a Parquet jira for this. (https://issues.apache.org/
jira/browse/PARQUET)
and prefix your PR title with it.

The bug seems more related to comparing unsigned bytes rather than actual
UTF-8 based comparison (which would have to take into account multi-byte
characters). So possibly just change the comparison of all binary types.

However we need to be careful about backward compatibility here. (another
possibility is to create a separate Stats field but it seems the current
min/max are wrong in some situaoetions)
@isnotinvain https://github.com/isnotinvain @rdblue
https://github.com/rdblue what do you think?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAUTJyeqW53jj5c5UzejlfqLU0cxt3MCks5qiylAgaJpZM4JrJX5
.

Andrew Duffy
aduffy.org

piyushnarang · 2016-08-23T18:04:44Z

parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java

+    for (int i = 0; i < min_length; i++) {
+      int value1 = array1.get(i + offset1) & 0xFF;
+      int value2 = array2.get(i + offset2) & 0xFF;
+      if (value1 < value2) {


could simplify this by:

if (value1 != value2) { return value1 - value2; }

P.S. looking at the avro and spark comparisons the return is flipped from what you have (return +ve if val1 > val2)

Yeah I actually was trying to make this logic as close to existing logic as possible.

The sign is flipped because if you look in BinaryStatistics it uses compareTo opposite of the way that most Java code does, though strictly speaking nothing's wrong if everything is consistent which it appears to be in this case.

If I simplify these statements should I go ahead and make that change to all the compare* methods in this project?

*file, not project

Yeah this is confusing. Seems like BinaryStatistics is behaving like normal Java code:

public void updateStats(Binary min_value, Binary max_value) { if (min.compareTo(min_value) > 0) { min = min_value.copy(); } if (max.compareTo(max_value) < 0) { max = max_value.copy(); } }

Comparable returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.

Seems like the flip of lhs and rhs happens when you call binary1.compareTo(binary2).
The code there does:
other.compareTo(value, 0, value.length).

So yeah I think you've got the sign set up correctly.

I think the simplification looks cleaner, so I'd be in favor of updating the methods in the file.

Ah okay, yes that is the source of the flipping. I can simplify the comparison methods to reflect your suggestion.

Should I also go and flip the comparisons so that they return idiomatic -1 = less, 1 = greater?

I think we can leave that part as is. binary1.compareTo(binary2) ends up returning the correct value so it seems to be respecting the Comparable contract. Might be useful to add a comment above your method (and the couple of others which were following that behavior) to call out that it is inverted for a reason. That'll help folks reading the code in the future.

Actually the flip is sort of necessary due to the use of polymorphic Binary so I'll just leave it as it is and apply the simplifications.

isnotinvain · 2016-08-24T00:24:33Z

I have a few questions about this. In general I am not sure this is a safe thing to change?

While parquet has a Binary type, and a sub-type of Binary for strings, whether a Binary was originally a String or not is probably not carefully tracked throughout the codebase. I don't think that instances of Binary should have different sort orders based on how they were constructed. Fore example, if one part of the code grabs a chunk of bytes from a buffer, that happens to be a string, but this is at a layer not paying attention to that or not, it will get the normal binary sorting, not the string sorting.

Additionally, if we change how strings are sorted, the min + max values in files already written by parquet would become invalid. That seems like a very serious breaking change. I think if we want lexicographic sorting of strings in the parquet statistics, that should be handled in the layer that creates the statistics, and should be done via a new field in the parquet-format thrift schema.

Does that make sense?

julienledem · 2016-08-24T02:33:40Z

@isnotinvain @andreweduffy @piyushnarang: By looking at the code it seems related to java having signed bytes rather than differences in comparing strings vs bytes.
typically when comparing binary data, you would consider the bytes unsigned.

Now the comparison is different only if you have bytes that are > 127 which does not happen if it's just plain english text. It would be great if we had examples of strings where the behavior is different. Possibly under the form of unit tests.

If this is correct then this seems like a bug and existing min/max for others than plain 7bits ascii will be wrong anyway.

There will be another issue that this is still not proper unicode ordering since it doesn't take into account multibyte characters. (Which I think we want to discuss separately to keep this PR focused)

isnotinvain · 2016-08-24T04:33:38Z

I would need to check, but I think UTF8 is sortable byte by byte if you
treat them unsigned.

That said -- parquet format doesn't have a notion of strings nor a special
case for how they are sorted. If we currently sort all binary data byte by
byte as signed bytes that's not necessarily incorrect unless parquet-format
says otherwise?

We should be very careful changing anything that invalidates old
statistics, or else any query engine using them will return wrong results.
Does parquet cpp sort binaries in a different way? It seems that this
should be part of parquet-format.

So I think we need to:

check if parquet-format specifies sorting rules and if it doesn't, it
should
think about backwards compatibility.

As long as the sorting was consistent (tho no lexicographic) then the
statistics are not currently broken, because queries will use the same sort
order right?

On Tuesday, August 23, 2016, Julien Le Dem notifications@github.com wrote:

@isnotinvain https://github.com/isnotinvain @andreweduffy
https://github.com/andreweduffy @piyushnarang
https://github.com/piyushnarang: By looking at the code it seems
related to java having signed bytes rather than differences in comparing
strings vs bytes.
typically when comparing binary data, you would consider the bytes
unsigned.

Now the comparison is different only if you have bytes that are > 127
which does not happen if it's just plain english text. It would be great if
we had examples of strings where the behavior is different. Possibly under
the form of unit tests.

If this is correct then this seems like a bug and existing min/max for
others than plain 7bits ascii will be wrong anyway.

There will be another issue that this is still not proper unicode ordering
since it doesn't take into account multibyte characters. (Which I think we
want to discuss separately to keep this PR focused)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWw6VHqrOAUB6lDEMiRANt5a3tSnlfIks5qi62HgaJpZM4JrJX5
.

a10y · 2016-08-24T10:35:16Z

This is actually currently broken on Spark as spark sorts by unsigned bytes so there's a mismatch when using the statistics for queries where you actually have to read more row groups than necessary because min/max are off relative to the way Spark sorted the data. A simple way to replicate this behavior is to have a simple Parquet row group with one column and two rows, one with "é" and one with "b", here min will be "é" and max will be "b", but should actually be the other way around based on how Spark actually sorted, so you miss the chance to skip this row group for queries for "a", etc.

I can't speak for Hive or Impala but I know that in Spark there is a distinction between a "binary" type and a "string" type, where Binary type is sorted like normal java byte[] i.e. as signed bytes and strings are UTF8 encoded byte arrays and are sorted as in this PR. I can imagine changing the format to specify something like an enum SortStrategy in the ColumnMetadata, which is either SignedBytes or UnsignedBytes, defaulting to SignedBytes when not specified to be compatible with engines that use the existing sort order. This will then be something specified on Binary subytes as well, and can be asserted on read by toplevel processing engine using Parquet. However, the assert is all this really buys you, and I still don't fully see the use for a format change since this only affects the write path.

a10y · 2016-08-24T12:37:44Z

Also as an update I realize this does affect the correctness of UTF8 string comparisons, see https://issues.apache.org/jira/browse/SPARK-17213

piyushnarang · 2016-08-24T18:12:50Z

I don't think we're currently specifying the sort rules in parquet-format - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L198.

I guess one of the issue that this change could cause even though its just on the write path is engines like Spark today are returning a certain set of results based on this ordering (which is incorrect like you pointed out). If we flip it and someone is trying to read the same data (written in the new format), they'll end up seeing a different set of results compared to the old files which isn't ideal.

How about adding a couple of optional min / max fields to the statistics object that have this unsigned ordering?

piyushnarang · 2016-08-24T18:59:37Z

So I was chatting with @isnotinvain and a few points came up:

Alex's suggestion was to document in parquet-format what the current sort order is (not really add fields / enums). That's basically our spec which is used by parquet-mr (java) and parquet-cpp so we should specify the order there. As part of this we should verify that parquet-cpp also sorts binaries in the same way (signed).
Based on the approach we decide, we'll also need to ensure these changes aren't restricted to FromStringBinary. Think it is possible for someone to create a Binary object from a String using any of the other Binary classes as well (e.g. directly using a byte buffer).

a10y · 2016-08-24T19:17:03Z

So currently there is no Statistics writing for Parquet, it's currently in-process for a few months now (apache/parquet-cpp#129).

However their ByteArray type represents bytes as uint8_t, so they assume all binary-typed data is unsigned (https://github.com/apache/parquet-cpp/blob/f97042d9bab10fffdb5e532fcc21a9ccc27f4f1c/src/parquet/types.h#L118).

isnotinvain · 2016-08-24T19:41:59Z

Thanks for looking into this @piyushnarang and @andreweduffy.

Given that the sort order is not specified in parquet-format, and parquet-mr is the only implementation that currently writes these statistics, I feel fairly strongly that we should update parquet-format to say that the sorting order for binary types is the one parquet-mr currently uses.

Separately from that, I think it would be a good idea to propose some new fields in the parquet-format statistics object, something like:

Deprecate the current min/max fields in the case of binary columns
Add 4 fields: unsignedMin, unsignedMax, signedMin, signedMax for use in binary columns

This way, when a reader sees any of unsignedMin, unsignedMax, signedMin, signedMax set, it knows it can use those safely. When it sees only min/max set, it knows that those are implicitly signed. We should document all of this in the parquet-format spec as well.

a10y · 2016-08-24T20:25:45Z

After looking through a few other binary formats (Thrift, Protobufs) and some databases (Postgres' BINARY format), I'm actually of the opinion that it makes the most sense for binary data to be interpreted as a string of unsigned bytes, however I mainly just want Parquet statistics to work correctly with other systems, so your recommendation is fine by me, but just putting out there that unsigned seems to be a more widely accepted interpretation. Signed interpretation is a relic of the Java's lack of unsigned types.

isnotinvain · 2016-08-24T20:42:02Z

I 100% agree that unsigned bytes is a better sort order. I don't think that needs any debate.

The problem is, in some sense we have already picked a different sort order, and we cannot change that, even to a strictly better sort order, without maintaining backwards compatibility. So it's really not about which is better, it's about whether we can change what we have been using so far. I think the best bet is to add this better sort order, but we need to continue to support files written with the old sort order.

isnotinvain · 2016-08-24T20:43:13Z

And the only way to do that is probably by adding new stats fields. And we will also need the filter push down layer to be explicit about which kind of sort order a particular query wants.

a10y · 2016-08-24T21:03:41Z

Got it, I'm in agreement that preserving compatibility is important.

I can make the change to Statistics here, change to parquet-format will just be doc comment update in the thrift file which I'll do separately.

piyushnarang · 2016-08-24T21:06:39Z

@andreweduffy - Thanks. When you update format we should also ping folks on the cpp PR to let them know. Would be nice to have that code also compatible.

a10y · 2016-08-24T21:11:19Z

Yep, I pinged Wes about it on the CPP ticket I linked above, says he'll want to make a separate ticket for that but should follow up later

a10y · 2016-08-25T18:05:46Z

Just pushed a commit, it'll fail to build because it depends on updated parquet-format which I built against a local snapshot, but it builds and passes all tests on my laptop.

I allow users to pass in a map of ColumnPath -> boolean where the boolean is if the user wants to use unsigned statistics and comparison operators for pushdown, defaults to using default behavior that existed before.

isnotinvain · 2016-08-25T23:08:33Z

Hmm, I'm not sure that's the approach we'll want take. Users would have to set this up on the write path and then very carefully mirror it on the read path.

Why don't we instead just store both in the statistics? That seems like it has much less room for users to make a mistake, and it also avoids the situation where for a given parquet file you don't even know what sort order was used for a particular column.

isnotinvain · 2016-08-25T23:11:17Z

Oh I see I misread -- the settings are just for the read path right?

I think in that case it might make more sense to just make the filter API a little more explicit, instead of configuring things per column, lets just update the filter API? Eg allow users to do:
ltEq(fooColumn, "bar", SortOrder.UNSIGNED)

a10y · 2016-08-26T00:18:42Z

Yep the sorting specification is only needed for the read since all
available info is written regardless. Forcing users to specify on the
filters would also work, had some concerns but realise they're moot so I'll
go ahead and patch that. Any other large changes that need to be made here?

On Friday, 26 August 2016, Alex Levenson notifications@github.com wrote:

Oh I see I misread -- the settings are just for the read path right?

I think in that case it might make more sense to just make the filter API
a little more explicit, instead of configuring things per column, lets just
update the filter API? Eg allow users to do:

ltEq(fooColumn, "bar", SortOrder.UNSIGNED)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#362 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAUTJy1KFZkR8z-5Pyt5yzIc7cqinqmQks5qjiEXgaJpZM4JrJX5
.

Andrew Duffy
aduffy.org

isnotinvain · 2016-08-26T00:45:25Z

I think the bigger thing is to make sure we get some buy-in on the parquet-format change. If anyone has any proposed changes to that it'll have to get reflected here, so maybe we should send an email to the dev list with the parquet-format PR first. Would probably help to give some background info there as well.

a10y · 2016-08-26T00:51:35Z

Sounds good, I'll go ahead and send one

piyushnarang · 2016-09-02T02:02:40Z

pom.xml

@@ -73,7 +73,7 @@
    <hadoop1.version>1.1.0</hadoop1.version>
    <cascading.version>2.5.3</cascading.version>
    <cascading3.version>3.0.3</cascading3.version>
-    <parquet.format.version>2.3.1</parquet.format.version>
+    <parquet.format.version>2.3.2-SNAPSHOT</parquet.format.version>


todo to remember to update this once the format PR is merged

the travis build won't pass untill then

a10y · 2016-09-02T08:23:59Z

Thanks for the feedback @piyushnarang and @isnotinvain, will push a followup soon

a10y · 2016-09-02T13:58:46Z

@piyushnarang @isnotinvain made the changes, in particular initializeStats and updateStats are now broken into signed/unsigned variants, renamed comparison methods to include signed/unsigned distinction

piyushnarang · 2016-09-06T20:40:29Z

@andreweduffy, shall take another look. Was not able to see your updates as I was out over the weekend.

piyushnarang · 2016-09-06T21:10:50Z

parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java

  }

  @Override
  public boolean isSmallerThan(long size) {
-    return !hasNonNullValue() || ((min.length() + max.length()) < size);
+    return !hasNonNullValue() || (((minSigned.length() + maxSigned.length()) < size) && ((minUnsigned.length() + maxUnsigned.length()) < size));


seems like minSigned.length() + maxSigned.length() should be the same as minUnsigned.length() + maxUnsigned.length() right? we could think about dropping one pair from this comparison

I think there are cases where that might not be true. Take the following example column:

é ello a

minSigned = é
maxSigned = ello

minUnsigned = a
maxUnsigned = é

piyushnarang · 2016-09-06T21:29:24Z

👍 , looks good to me

julienledem · 2016-09-06T21:58:14Z

parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java

-    checkNotNull(pred, "pred");
-    checkNotNull(columns, "columns");
+    checkNotNull(pred, "predicate cannot be null");
+    checkNotNull(columns, "columns cannot be null");


the second parameter is just the name of the parameter.
https://github.com/apache/parquet-mr/blob/6dad1e3bd0e277f5b5e5e2a0720f474271c1648d/parquet-common/src/main/java/org/apache/parquet/Preconditions.java#L38
if renaming pred, you should rename in in the method signature as well.

i'll just change this back then, these are separate from the root of the pr

i just realized this isn't even guava preconditions, this is your own checkNotNull. This is definitely the wrong thing to do, will be reverted

julienledem · 2016-09-06T22:00:51Z

Looks good to me.
depends on a parquet-format change

rdblue · 2016-09-08T22:22:22Z

Hi everyone. Sorry I'm late to the party, evidently I've been missing notifications.

I don't understand why the approach is to keep the signed ordering around and add API support for it. Is anyone going to opt for signed ordering? UTF-8 lexicographical ordering corresponds to the byte ordering using unsigned comparison. As far as I can tell, the signed ordering isn't standard or used anywhere else. That makes me more inclined to treat it as a bug instead of adding it to the API.

I completely agree that backward-compatibility is important. I think we should address this problem like PARQUET-251, where we considered binary min and max corrupt and didn't return them when converting the footer. Then we can fix the ordering to be unsigned and not need to change the API. Since the ordering is the same for 7-bit ascii data, I think we should also add a property to use the min and max anyway if users want to override the fix.

(There's also the equality predicate case, where it doesn't matter what the ordering is as long as you're consistent, but I don't think it is worth adding special handling for this. A property to use min/max should work for this case as well.)

rdblue · 2016-09-09T21:55:46Z

In the interest of getting a Parquet MR release done, I've posted an incomplete fix: PR #367. That patch prevents stats from being returned by the metadata converter, like the fix for PARQUET-251. It detects whether the logical type should use signed or unsigned order, so it addresses this bug for unsigned integer types as well as UTF8 and decimals. It also makes as few changes as possible so we can decide both how to implement different sort orders and how to store those in the format in future releases.

a10y · 2016-09-09T22:04:21Z

Gotcha. You can probably ignore my comments on your PR then, I made those before I saw your post. In either case would be great to continue the conversation around whether or not we're gonna go with a format change or just force binaries over to unsigned statistics.

rdblue · 2016-09-09T22:24:31Z

I think we need a format change. We need to know the sort order that was used for a column in order to use min and max moving forward. While we could use the writer version to detect it, I think the better option is to store the ordering that was used. That way we can support locality-specific sort orders in the future (the writer provides a name and a Comparator for the order).

a10y · 2016-09-09T22:26:07Z

+1 I'm on board with that

isnotinvain · 2016-09-10T00:31:48Z

I would prefer we don't invalidate all the old statistics. We (and presumably others) have a lot of data with those stats in them. And while unsigned byte ordering is certainly a better choice, the signed ordering wasn't explicitly a bug. It seems the ordering was never actually specified in parquet-format. Why not keep support for both orderings explicitly? We can set the defaults to the unsigned ordering. It also allows us to avoid switching on the version number when looking at stats, which is not super reliable given that a lot of our (and probably others) files were written from internal forks with non-standard version numbers.

I think adding an unsigned_min and unsigned_max field makes it explicit and allows the reader to determine whether what they have is signed or not.

isnotinvain · 2016-09-10T00:40:58Z

Actually, I keep forgetting that this still works for equality and inequality w/ either ordering.

It might be OK to update parquet-format specifying that the only supporting ordering is unsigned, add a marker field (or how about even a version number field?) to parquet-format as well, and then in the read path we can determine which kind of stats are available and not use them for range queries when they are the wrong kind.

rdblue · 2016-09-10T00:54:05Z

@isnotinvain, sorry I should have been more clear about what I think the path forward is.

To get 1.9.0 out, I think we should disable signed stats by type -- for unsigned ints, UTF8, and decimals. But this also includes a flag to continue using those stats for UTF8. That's because we have a ton of data with these stats as well. (Though keep in mind that this doesn't affect dictionary filtering, which is coming in 1.9.0.) This is a stop-gap to fix correctness only.

Then, I think we should add what ordering was used to the format. For older files, we default that to "unsigned" and then Parquet will have full access to the old stats as you recommend because they can still be used for equality/inequality. This also avoids needing to check version strings because we can assume the unsigned ordering when it isn't set.

The problem I have with the current proposal is that it exposes the ordering bug to users through the API, when I think it should be internal, but with a flag to signal that you're okay with the correctness issues because you only wrote ASCII data strings.

isnotinvain · 2016-09-10T01:29:38Z

I agree with most of that, my main issue is treating this as a correctness bug. As is, parquet is self consistent and correct. The push down filters match the stats in the files. Changing the ordering parquet uses from signed to unsigned is a good feature to add, but I'm not sure we should treat the current behavior as a bug. It's more of a "in retrospect a better idea would have been" -- unless I'm missing something and the parquet-format spec specifies an ordering for binary columns?

If we expose the ordering to the query writer (ideally with defaults so most users don't even notice) then instead of a flag that flips you into "there are bugs here" mode, users can push down filters and signal which ordering they expect the filter to use. At least that way we can fail when someone pushes down an unsigned filter to a file that doesn't have unsigned stats (or not fail, but skip the stats based filtering).

I do get the value of not polluting the API with more options, but I prefer that we expose this complexity to users in a way that allows us to be correct all the time, instead of adding a flag that is "use at your own risk". If all our filter push downs from users are tagged with a requested ordering, then we can safely use or not use the available stats for them. We can also make the default ordering unsigned ordering, so it shouldn't pollute the API too much.

rdblue · 2016-09-10T02:51:13Z

Changing the ordering parquet uses from signed to unsigned is a good feature to add, but I'm not sure we should treat the current behavior as a bug

I definitely think this is a bug. The sort order when you store a string doesn't match the sort order of java.lang.String. Sure, we're consistent with the order and it is useful sometimes, but the min and max we report for Strings are wrong.

I prefer that we expose this complexity to users in a way that allows us to be correct all the time

We can take advantage of it being correct for ASCII characters, but I don't think we should make the problem go away by making it an accidental feature, nor should we perpetuate an order that doesn't make sense in our API. This would only exist in Parquet.

How is exposing the signed ordering in the API any more powerful than setting when it is safe to use the signed ordering min and max? That's just detecting whether to use the signed min and max at query evaluation time or fall back. When would you not do this globally anyway? Would users specify two sets of predicates?

The drawback to exposing the signed ordering in the API is that you have to change the current min and max API's behavior anyway.

If we used the current API for the signed min and max, then we'd end up with downstream apps continuing to use the wrong (meaning unexpected) min and max. If we use the current API for the unsigned min and max, then we can't take advantage of existing signed min and max... unless we added a property to do so, but then we'd have a larger API with features identical to what I'm proposing. And if we removed the current API entirely then downstream apps would all have to rewrite to use stats, which is obviously a problem.

I don't see value in exposing this bug to downstream users.

rdblue · 2016-09-13T17:29:27Z

@isnotinvain, can you take another look at this? I'd like to get a release out and I think it depends on whether or not we think we need to address the min/max problem as a correctness issue. Thanks!

julienledem · 2016-09-21T19:39:19Z

I agree with @rdblue that this is a bug. A spec bug but a bug.
Consistent is not the same as correct. Being inconsistent would be incorrect but being consistent is not sufficient for being correct. If parquet-mr is currently consistent within itself, it is inconsistent with sort orders that users wants to push down which is what matters.
In the general case, the signed bytes sort order currently used for binary types (and strings in particular) should not be used. In specific cases it can be used. For example: all byte values in between 0 and 127 or 7 bit ASCII text.

The PR #367 that @rdblue proposed is compatible with the unsigned_min/max extension proposed on parquet-format apache/parquet-format#42 and will unblock the situation for releasing 1.9.0. It will hide the stats in the case they are not the order the user expects and will prevent inconsistent push downs.

Regarding exposing a setting to the user, I think @isnotinvain is right that we should not pollute the api with too many workaround settings. Possibly we should not document it since this is a workaround that will be used for a very small subset of users (people on this thread only?)

a10y · 2017-09-01T20:43:27Z

Looks like PARQUET-686 was eventually resolved. See as this PR's 1 yr anniversary is coming up, I'm gonna go ahead and close this.

xiexingguang

merge master ?

a10y changed the title ~~Add lexicographic ordering for FromStringBinary~~ PARQUET-686: Add lexicographic ordering for FromStringBinary Aug 23, 2016

piyushnarang reviewed Aug 23, 2016
View reviewed changes

a10y mentioned this pull request Aug 24, 2016

PARQUET-593: Add API for writing Page statistics apache/parquet-cpp#129

Closed

a10y mentioned this pull request Aug 25, 2016

PARQUET-686: Add optional {signed,unsigned}_{min,max} fields to Statistics apache/parquet-format#42

Closed

PARQUET-686: Allow for Unsigned Statistics in Binary Type

7c1b963

a10y force-pushed the utf8 branch from 290818c to 7c1b963 Compare September 1, 2016 15:48

piyushnarang reviewed Sep 2, 2016
View reviewed changes

Fixes for comments

0b0fe10

piyushnarang reviewed Sep 6, 2016
View reviewed changes

julienledem reviewed Sep 6, 2016
View reviewed changes

Fixup incorrect usages of checkNotNull

8202fea

a10y force-pushed the utf8 branch from 62388c1 to 8202fea Compare September 6, 2016 23:28

julienledem mentioned this pull request Sep 21, 2016

PARQUET-686: Do not return min/max for the wrong order. #367

Closed

a10y closed this Sep 1, 2017

xiexingguang reviewed Jul 17, 2018

View reviewed changes

This was referenced Jun 23, 2024

Allow for Unsigned Statistics in Binary Type apache/parquet-format#314

Closed

[C++][Parquet] Implement PARQUET-686 statistics bug fixes apache/arrow#42562

Closed

PARQUET-686: Allow for Unsigned Statistics in Binary Type #362

PARQUET-686: Allow for Unsigned Statistics in Binary Type #362

Conversation

a10y commented Aug 23, 2016

julienledem commented Aug 23, 2016 • edited Loading

a10y commented Aug 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isnotinvain commented Aug 24, 2016

julienledem commented Aug 24, 2016

isnotinvain commented Aug 24, 2016

a10y commented Aug 24, 2016

a10y commented Aug 24, 2016

piyushnarang commented Aug 24, 2016

piyushnarang commented Aug 24, 2016

a10y commented Aug 24, 2016

isnotinvain commented Aug 24, 2016

a10y commented Aug 24, 2016

isnotinvain commented Aug 24, 2016

isnotinvain commented Aug 24, 2016

a10y commented Aug 24, 2016

piyushnarang commented Aug 24, 2016

a10y commented Aug 24, 2016

a10y commented Aug 25, 2016

isnotinvain commented Aug 25, 2016

isnotinvain commented Aug 25, 2016

a10y commented Aug 26, 2016

isnotinvain commented Aug 26, 2016

a10y commented Aug 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a10y commented Sep 2, 2016

a10y commented Sep 2, 2016

piyushnarang commented Sep 6, 2016 • edited Loading

Choose a reason for hiding this comment

a10y Sep 6, 2016 • edited Loading

Choose a reason for hiding this comment

piyushnarang commented Sep 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Sep 6, 2016

rdblue commented Sep 8, 2016 • edited Loading

rdblue commented Sep 9, 2016

a10y commented Sep 9, 2016

rdblue commented Sep 9, 2016

a10y commented Sep 9, 2016

isnotinvain commented Sep 10, 2016

isnotinvain commented Sep 10, 2016

rdblue commented Sep 10, 2016

isnotinvain commented Sep 10, 2016

rdblue commented Sep 10, 2016

rdblue commented Sep 13, 2016

julienledem commented Sep 21, 2016

a10y commented Sep 1, 2017

xiexingguang left a comment

Choose a reason for hiding this comment

julienledem commented Aug 23, 2016 •

edited

Loading

piyushnarang commented Sep 6, 2016 •

edited

Loading

a10y Sep 6, 2016 •

edited

Loading

rdblue commented Sep 8, 2016 •

edited

Loading