Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't group files with different partition specs in BaseRewriteManifests #480

Closed
aokolnychyi opened this issue Sep 17, 2019 · 10 comments · Fixed by #666
Closed

Don't group files with different partition specs in BaseRewriteManifests #480

aokolnychyi opened this issue Sep 17, 2019 · 10 comments · Fixed by #666

Comments

@aokolnychyi
Copy link
Contributor

I think we should not combine files with potentially different partition specs into the same manifest in BaseRewriteManifests as we in do in MergingSnapshotProducer.

@aokolnychyi
Copy link
Contributor Author

@bryanck @rdblue what do you think?

@bryanck
Copy link
Contributor

bryanck commented Sep 17, 2019

The downside is that this could potentially result in many small manifests. For example, we have a bucketed table with 10k buckets, which is additionally partitioned by date and hour. So for a week of data we'd have 10k * 24 * 7 = 1.68 million manifests. What we do with RewriteManifests today is combine files into groups of four buckets and ignore date and hour, which results in 10k/4 = 2.5k manifests for a week of data. Performance with file pruning is good even though the manifests have multiple partition specs.

@aokolnychyi
Copy link
Contributor Author

@bryanck I mean a partitioning strategy in general by partition spec, which seems to be the same in your case. What will happen in case of partition schema evolution? Manifests are supposed to have files written with the same partition spec:

Manifests store data files from any partition, as long as the partition spec is the same for the data files

@aokolnychyi
Copy link
Contributor Author

I also raised a question similar to what you describe in #481.

@bryanck
Copy link
Contributor

bryanck commented Sep 17, 2019

Right, I was thinking partition, not partition spec. In that case, then I agree it makes sense.

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2019

I agree with this. A manifest can only be written for one partition spec, so this is definitely a bug!

@rdblue rdblue added this to the Java 0.8.0 Release milestone Oct 13, 2019
@manishmalhotrawork
Copy link
Contributor

@aokolnychyi trying to understand the problem and thinking loud for solution.

so, would the fix be to group manifests by partition spec before cluster by clusterFunction.
so for group of each partition spec's manifest files, the cluster by logic will be applied, where currently its applied for every manifests irrespective of the underlying manifests partition spec.

                     ManifestReader.read(ops.io().newInputFile(manifest.path()), ops.current().specsById())) {
                FilteredManifest filteredManifest = reader.select(Arrays.asList("*"));
                filteredManifest.liveEntries().forEach(
                    entry -> appendEntry(entry, clusterByFunc.apply(entry.file()))
                );

Manifest files grouped by partition spec is done in MergingSnapshotProducer at this line. So similar logic can be applied.

One more interesting point,
as per this line ManifestFileWriter is created using the TableOperations.current.spec which will be used to create all ManifestWriters, though ManifestReader is created based on the partitionSpec belongs to the Manifest File, as per this line

Not sure if my understanding is correct or not, So that is also part of the problem ?

@aokolnychyi
Copy link
Contributor Author

@manishmalhotrawork, I think you got the problem correctly. We need to take into account that there might be multiple partition specs in a table. We can get a map of partition specs from Table and use that to correctly initialize ManifestWriter.

@aokolnychyi
Copy link
Contributor Author

@manishmalhotrawork, do you want to submit a PR?

@manishmalhotrawork
Copy link
Contributor

@aokolnychyi yes, let me submit PR. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants