Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for path generation in Iceberg frameworks #55

Closed
mccheah opened this issue Dec 18, 2018 · 8 comments
Closed

API for path generation in Iceberg frameworks #55

mccheah opened this issue Dec 18, 2018 · 8 comments

Comments

@mccheah
Copy link
Contributor

mccheah commented Dec 18, 2018

In the integrations that Iceberg supports out of the box (Spark, Pig), the frameworks decide how to generate paths for written files. However, some sources would prefer to pick their own paths for new files. Some questions for designing such an API include:

  • Should this be bundled with the FileIO API?
  • Should such a path generation API be concerned about the file's partition values? Partition metadata would be required for the default implementation to maintain existing behavior for Spark.
@mccheah
Copy link
Contributor Author

mccheah commented Dec 18, 2018

@vinooganesh @yifeih @rdblue

@rdblue
Copy link
Contributor

rdblue commented Dec 18, 2018

I think this API should pass in the partition tuple. Now that I think about it, the metadata location is determined by the TableOperations implementation, not by FileIO. I think it would make sense to keep those in the same place. So TableOperations would be responsible for determining where to place both metadata and data files, while FileIO would just handle low-level file tasks.

@mccheah
Copy link
Contributor Author

mccheah commented Jan 8, 2019

Ok I'll start taking a look at this now!

@mccheah
Copy link
Contributor Author

mccheah commented Jan 9, 2019

Remark: The paths module has to be a separate entity because once again, we will be instantiating it once on the driver and serializing the instance for executors. I think we can move the metadata file lookup to this module as well.

@rdblue
Copy link
Contributor

rdblue commented Jan 9, 2019

@mccheah, in that case is this something that we should add to the FileIO interface? It wouldn't be that much of a stretch to do path selection as well. What do you think?

@mccheah
Copy link
Contributor Author

mccheah commented Jan 9, 2019

Yup I don't have a strong opinion here, so let's bundle it with the FileIO interface. I'll propose a diff but I anticipate wanting to fine tune the exact method signatures as part of the PR review.

@mccheah
Copy link
Contributor Author

mccheah commented Feb 26, 2019

Looks like this is done! I'll close it.

@mccheah mccheah closed this as completed Feb 26, 2019
@xabriel
Copy link
Contributor

xabriel commented Feb 26, 2019

(For completeness, linking this issue to PR that made it to master: #87)

jun-ma-0 pushed a commit to jun-ma-0/incubator-iceberg that referenced this issue May 11, 2020
PLAT-47949 - fix the inconsistency in tombstone merge
puchengy added a commit to puchengy/iceberg that referenced this issue Jun 20, 2023
(cherry picked from commit 54119bd)

Co-authored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants