Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issues of iceberg-rust v0.3.0 #348

Open
20 of 72 tasks
Fokko opened this issue Apr 24, 2024 · 9 comments
Open
20 of 72 tasks

Tracking issues of iceberg-rust v0.3.0 #348

Fokko opened this issue Apr 24, 2024 · 9 comments
Milestone

Comments

@Fokko
Copy link
Contributor

Fokko commented Apr 24, 2024

Iceberg-rust 0.3.0

The main objective of 0.3.0 is to have a working read path (non-exhaustive list :)

Blocking issues:

Nice to have (related to the query plan optimizations above):

State of catalog integration:

For the release after that, I think the commit path is going to be important.

Iceberg-rust 0.4.0 and beyond

Nice to have for the 0.3.0 release, but not required. Of course, open for debate.

  • Support for Positional Deletes Entails matching the deletes to the datafiles based on the statistics.
  • Support for Equality Deletes Entails putting the delete files in the right order to apply them in the right sequence.

Commit path

The commit path entails writing a new metadata JSON.

  • Applying updates to the metadata Updating the metadata is important both for writing a new version of the JSON in case of a non-REST catalog, but also to keep an up-to-date version in memory. It is very much recommended to re-use the Updates/Requirement objects provided by the REST catalog protocol.
  • Update table properties Sets properties on the table. Probably the best to start with since it doesn't require a complicated API.
  • Schema evolution API to update the schema, and produce new metadata.
  • Partition spec evolution API to update the partition spec, and produce new metadata.
  • Sort order evolution API to update the schema, and produce new metadata.

Metadata tables

Metadata tables are used to inspect the table. Having these tables also allows easy implementation of the maintenance procedures since you can easily list all the snapshots, and expire the ones that are older than a certain threshold.

Write support

Most of the work in write support is around generating the correct Iceberg metadata. Some decisions can be made, for example first supporting only FastAppends, and only V2 metadata.

It is common to have multiple snapshots in a single commit to the catalog. For example, an overwrite operation of a partition can be a delete + append operation. This makes the implementation easier since you can separate the problems, and tackle them one by one. Also, for the roadmap it makes it easier since their operations can be developed in parallel.

  • Commit semantics
    • MergeAppend appends new manifest list entries to existing manifest files. Reduces the amount of metadata produced, but takes some more time to commit since existing metadata has to be rewritten, and retries are also more costly.
    • FastAppend Generates a new manifest per commit, which allows fast commits, but generates more metadata in the long run. PR by @ZENOTME in feat: support append data file and add e2e test #349
  • Snapshot generation manipulation of data within a table is done by appending snapshots to the metadata JSON.
    • APPEND Only data files were added and no files were removed.
    • REPLACE Data and delete files were added and removed without changing table data; i.e., compaction, changing the data file format, or relocating data files.
    • OVERWRITE Data and delete files were added and removed in a logical overwrite operation.
    • DELETE Data files were removed and their contents logically deleted and/or delete files were added to delete rows.
  • Add files to add existing Parquet files to a table. Issue in Support to append file on table  #345
  • [Summary generations] Part of the snapshot that indicates what's in the snapshot.
  • Metrics collection There are two situations:
    • Collect metrics when writing This is done with the Java API where during writing the upper, lower bound are tracked and the number of null- and nan records are counted.
    • Collect metrics from footer When an existing file is added, the footer of the Parquet file is opened to reconstruct all the metrics needed for Iceberg.
  • Deletes This mainly relies on strict projection to check if the data files cannot match with the predicate.
    • Strict projection needs to be added to the transforms.
    • Strict Metrics Evaluator to determine if the predicate cannot match.

Future topics

  • Python bindings
  • WASM to run Iceberg-rust in the browser

Contribute

If you want to contribute to the upcoming milestone, feel free to comment on this issue. If there is anything unclear or missing, feel free to reach out here as well 👍

@Fokko Fokko added this to the 0.3.0 Release milestone Apr 24, 2024
@Fokko Fokko changed the title Tracking issues of iceberg-rust v0.2.0 Tracking issues of iceberg-rust v0.3.0 Apr 24, 2024
@Fokko Fokko pinned this issue Apr 24, 2024
@marvinlanhenke
Copy link
Contributor

@Fokko thanks for your effort here

@Fokko
Copy link
Contributor Author

Fokko commented Apr 24, 2024

@marvinlanhenke No problem, thank you for all the work on the project. While compiling this I realized how much work has been done 🚀

@sdd
Copy link
Contributor

sdd commented Apr 24, 2024

Thanks for putting this together @Fokko! It's great to have this clarity on where we're heading. Let's go! 🙌

@liurenjie1024
Copy link
Collaborator

liurenjie1024 commented Apr 25, 2024

Hi, @Fokko About the read projection part, currently we can convert parquet files into arrow streams, but there are some limitations: it only support primitive types, and schema evolution is not supported yet. Our discussion is in this issue: #244 And here is the first step of projection by @viirya : #245

@liurenjie1024
Copy link
Collaborator

Also as we discussed in this doc, do you mind to add datafusion integration, python binding, wasm binding into futures?

@Fokko
Copy link
Contributor Author

Fokko commented Apr 25, 2024

Hi, @Fokko About the read projection part, currently we can convert parquet files into arrow streams, but there are some limitations: it only support primitive types, and schema evolution is not supported yet. Our discussion is in this issue: #244 And here is the first step of projection by @viirya : #245

Thanks for the context, I've just added this to the list.

About the glue, hive, rest catalogs, I think we already have integrations:

Ah yes, I forgot to check those marks, thanks!

Also as we discussed in this doc, do you mind to add datafusion integration, python binding, wasm binding into futures?

Certainly! Great suggestions! I'm less familiar on some of these topics (like Datafusion), feel free to edit the post if you feel something is missing.

@marvinlanhenke
Copy link
Contributor

Certainly! Great suggestions! I'm less familiar on some of these topics (like Datafusion), feel free to edit the post if you feel something is missing.

...for Datafusion I have provided a basic design proposal and implementation for some of the datafustion traits, like catalog & schema provider; Perhaps we can also move forward on this: #324

@liurenjie1024
Copy link
Collaborator

Certainly! Great suggestions! I'm less familiar on some of these topics (like Datafusion), feel free to edit the post if you feel something is missing.

...for Datafusion I have provided a basic design proposal and implementation for some of the datafustion traits, like catalog & schema provider; Perhaps we can also move forward on this: #324

Yeah, I'll take a review later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants