Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to load column stats with data files. #206

Merged
merged 5 commits into from Jun 6, 2019

Conversation

Projects
None yet
4 participants
@rdblue
Copy link
Contributor

commented Jun 5, 2019

This adds an option to BaseTableScan to load column-level stats. Many clients do not need column-level stats when scanning Iceberg tables and removing them before copying data files speeds up query planning.

When filtering files listed in a manifest, stats columns are now always loaded for the InclusiveMetricsEvaluator. When a data file is selected, its contents are either copied using a full copy or a "slim" copy that discards the large maps of column stats.

rdblue added some commits Jun 5, 2019

@rdsr

rdsr approved these changes Jun 6, 2019

Show resolved Hide resolved core/src/main/java/org/apache/iceberg/BaseTableScan.java Outdated
this.upperBounds = SerializableByteBufferMap.wrap(copy(toCopy.upperBounds));
if (fullCopy) {
// TODO: support lazy conversion to/from map
this.columnSizes = copy(toCopy.columnSizes);

This comment has been minimized.

Copy link
@rdsr

rdsr Jun 6, 2019

Contributor

columnSizes was not part of the STATS_COLUMN field in FilteredManifest class

This comment has been minimized.

Copy link
@rdblue

rdblue Jun 6, 2019

Author Contributor

That's because it isn't required to run the filters in FilteredManifest. Those columns are the ones that need to be projected for the stats filter to work, but may not have been requested by the caller.

rdblue added some commits Jun 6, 2019

@danielcweeks

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2019

+1 Looks good

@danielcweeks danielcweeks merged commit cb600c6 into apache:master Jun 6, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@aokolnychyi

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

@rdblue how do we fetch columnSizes now? Is it supposed to be for internal use only?

@rdblue

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

@aokolnychyi, you can load column sizes by calling includeColumnStats. Looks like it also needs to be in the list of stats in FilteredManifest as well. I think that's a bug.

rdblue added a commit to rdblue/iceberg that referenced this pull request Jun 12, 2019

Add option to load column stats with data files. (apache#206)
* Add option to load column stats with data files.

* Fix style problems.

* Rename slimCopy to copyWithoutStats, fix other comments.

* Fix javadoc.

* Fix style.

rdblue added a commit to rdblue/iceberg that referenced this pull request Jun 12, 2019

@aokolnychyi

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

@rdblue, exactly. By calling includeColumnStats we cannot fetch columnSizes as of now.

@rdblue rdblue deleted the rdblue:add-slim-copy branch Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.