Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-372: Do not write stats larger than 4k. #275

Closed

Conversation

@rdblue
Copy link
Contributor

rdblue commented Sep 23, 2015

This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Sep 23, 2015

@spena, @julienledem, @isnotinvain what do you guys think of this change? We're hitting cases where the min/max stats are about the size of a page, so we end up with pages 3x larger than they should be.

I think the best solution is to not write stats over some limit, which I've set at 4k here. Scanning would be quick when this is the case because if values are that large there wouldn't be many of them in a page. Also, since the large values must be min and/or max, we don't expect many cases where just one huge value prevents stats from being written.

I had suggested on the ticket that we might want to modify the value, but that gets messy and wouldn't allow us to satisfy queries from stats alone (select max(col) from table).

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Sep 23, 2015

Test failures are again due to flaky memory manager tests, fixed in #263.

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Sep 23, 2015

I looked at stats merging and I think this is safe because statistics are aggregated before they are written to the file. A column chunk will accumulate the min and max values for all pages accurately, but when individual pages are written, the stats may be discarded instead of written to the file.

@rdblue rdblue force-pushed the rdblue:PARQUET-372-fix-min-max-for-long-values branch 2 times, most recently from 8a7224c to 048cc62 Sep 24, 2015
@spena

This comment has been minimized.

Copy link

spena commented Sep 29, 2015

@rdblue The patch looks good.

I just have a question about the limit for the statistics. Why not limiting the size to the current page size used on the row group (or pagesize / 4 or something)? If a user ends setting a big page size, then they won't have the advantage of using the statistics to check whether the row group should be read or not.

@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Oct 2, 2015

Thanks for taking a look, @spena. There are a couple of motivations for the 4k static limit. First, I'd like to keep a hard limit on the page size so they don't get unexpectedly large again. The other reason is just to keep the code simple and not need to pass the current page size through unless we know it's going to be beneficial. We can always make it more fancy later.

@rdblue rdblue force-pushed the rdblue:PARQUET-372-fix-min-max-for-long-values branch from 048cc62 to 8c1323e Oct 27, 2015
@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 1, 2015

@julienledem, I think this (or something similar) should go into 1.9.0. Could you have a look at it?

@rdblue rdblue force-pushed the rdblue:PARQUET-372-fix-min-max-for-long-values branch from 8c1323e to 3a5207e Dec 18, 2015
@rdblue

This comment has been minimized.

Copy link
Contributor Author

rdblue commented Dec 18, 2015

I've rebased this and added a comment to explain why stats are ignored rather than truncated.

@julienledem could you take a look when you have time? Thanks!

@lw-lin

This comment has been minimized.

Copy link
Contributor

lw-lin commented Jan 30, 2016

hi @rdblue
Although I wonder if we could just stop updating the stats once we encounter large-sized max+min (to save the max/min compassion for upcoming writes), I'm +1 for this. I've looked through the code changes as well as the added unit tests -- LGTM.

// rationale is that some engines may use the minimum value in the page as
// the true minimum for aggregations and there is no way to mark that a
// value has been truncated and is a lower bound and not in the page.
if (!statistics.isEmpty() && statistics.isSmallerThan(MAX_STATS_SIZE)) {

This comment has been minimized.

Copy link
@danielcweeks

danielcweeks Apr 5, 2016

Contributor

Is it possible to have null count without min/max. This is a nit, but engines could use null count even without min/max present.

This comment has been minimized.

Copy link
@rdblue

rdblue Apr 20, 2016

Author Contributor

I just looked into this. There's nothing preventing us from doing this in the format, but the implementation assumes that there will be non-null min and max if numNulls != numRows. I think it makes sense to add this as a follow-up instead of making this patch larger.

* @param size a size in bytes
* @return true iff the min and max values are less than size bytes
*/
abstract public boolean isSmallerThan(long size);

This comment has been minimized.

Copy link
@danielcweeks

danielcweeks Apr 5, 2016

Contributor

size() might be a more intuitive approach and would remove the burden of evaluating this check in the subclasses.

This comment has been minimized.

Copy link
@rdblue

rdblue Apr 20, 2016

Author Contributor

It's been a while, but I think my logic was this:

  • getMinBytes and getMaxBytes allocate byte arrays and generally too much work for this check. We could cache the bytes, but that requires more work.
  • size isn't determined by Statistics because the metadata converter is responsible for actually converting stats into bytes. This only assumes that the entire min and max bytes will be written.

So we delegate to a stats object to avoid work, but can't delegate to stats because it doesn't do serialization. I could change it to getSizeHint, but I don't think that's much better and I'm inclined to leave it as is.

@danielcweeks

This comment has been minimized.

Copy link
Contributor

danielcweeks commented Apr 25, 2016

+1 I agree with all of the follow up to my questions.

rdblue added 2 commits Sep 23, 2015
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.
@rdblue rdblue force-pushed the rdblue:PARQUET-372-fix-min-max-for-long-values branch from 3a5207e to 61e05d9 May 5, 2016
@asfgit asfgit closed this in c3f3830 May 5, 2016
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.

Author: Ryan Blue <blue@apache.org>

Closes apache#275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashes the following commits:

61e05d9 [Ryan Blue] PARQUET-372: Add comment to explain not truncating values.
fbbc1c4 [Ryan Blue] PARQUET-372: Do not write stats larger than 4k.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.