feat: add builder to TableMetadata interface #62

y0psolo · 2023-09-21T17:15:05Z

related issue : #52

Add a builder for TableMetadata interface

fields not mandatory have default values
Formatversion defaults to V2

zeodtr · 2023-09-22T03:02:54Z

Hi,
(I'm not sure this is the right place to comment. if not, sorry.)

I'm unsure whether a generic builder is the best approach for building a TableMetadata.
It seems that this approach somewhat violates the decision of issue #3.
While TableMetadata hides fields and imposes some logic to set the fields, the generic builder allows users to set the fields to whatever value (even an invalid one) they want.
For example, in the test case of this commit, last_column_id and schema are set separately, which can result in an invalid (or unhealthy?) TableMetadata.
(BTW, the Schema struct avoids this problem by manually implementing the SchemaBuilder struct.)

I think that the main usage patterns of TableMetadata are as follows:

CREATE TABLE: build a fresh new TableMetadata with mandatory fields -> mutate it if necessary -> serialize
All other cases: deserialize -> read fields -> clone if necessary -> mutate -> serialize

So, maybe a 'builder' is required only for the CREATE TABLE: build a fresh new TableMetadata with mandatory fields case.

Since feat: Add Catalog API (pull request #54) introduced the TableCreation struct in catalog.rs, perhaps it would be good (and sufficient?) to have a public new function (named new_table or for_a_new_table?) that takes the TableCreation struct as an argument in TableMatadata struct.

Thank you.

liurenjie1024 · 2023-09-25T03:07:07Z

+1 with @zeodtr We should have a validation check in the final phase of build, so an auto generated builder maybe inappropricate. How about this:

struct TableMetadataBuilder(TableMetadata);

impl TableMetadataBuilder {
    ...
    pub fn build(self) {
       // validate 
      self.0
    }
}

zeodtr · 2023-09-25T04:59:29Z

How about a method like this for the 'CREATE TABLE' case?
(For this method to compile, a few modification and addition is needed in other modules, like adding Error conversion, etc.
And this method has not been tested yet.)

impl TableMetadata {
    /// New struct to create a table
    pub fn try_new_to_create_table(table_creation: TableCreation) -> Result<Self, Error> {
        let last_column_id = table_creation.schema.highest_field_id();
        let current_schema_id = table_creation.schema.schema_id();
        let schemas = HashMap::from([(
            table_creation.schema.schema_id(),
            Arc::new(table_creation.schema),
        )]);
        let default_spec_id;
        let partition_specs;
        let last_partition_id;
        match table_creation.partition_spec {
            Some(partition_spec) => {
                default_spec_id = partition_spec.spec_id;
                last_partition_id =
                    max(partition_spec.fields.iter().map(|field| field.field_id)).unwrap_or(-1);
                partition_specs = HashMap::from([(partition_spec.spec_id, partition_spec)]);
            }
            None => {
                default_spec_id = 0;
                last_partition_id = -1;
                partition_specs = HashMap::from([(
                    default_spec_id,
                    PartitionSpec::builder()
                        .with_spec_id(0)
                        .with_fields(vec![])
                        .build()?,
                )]);
            }
        }
        let default_sort_order_id = table_creation.sort_order.order_id;
        let sort_orders = HashMap::from([(default_sort_order_id, table_creation.sort_order)]);
        Ok(Self {
            format_version: FormatVersion::V2,
            table_uuid: Uuid::new_v4(),
            location: table_creation.location,
            last_sequence_number: -1,
            last_updated_ms: since_the_epoch_as_millis()?,
            last_column_id,
            schemas,
            current_schema_id,
            partition_specs,
            default_spec_id,
            last_partition_id,
            properties: table_creation.properties,
            current_snapshot_id: None,
            snapshots: None,
            snapshot_log: vec![],
            metadata_log: vec![],
            sort_orders,
            default_sort_order_id,
            refs: HashMap::new(),
        })
    }
    ...
}

fn since_the_epoch_as_millis() -> Result<i64, Error> {
    let start = SystemTime::now();
    let since_the_epoch = start.duration_since(UNIX_EPOCH)?;
    Ok(since_the_epoch.as_millis().try_into()?)
}

liurenjie1024 · 2023-09-25T05:02:36Z

How about a method like this for the 'CREATE TABLE' case? (For this method to compile, a few modification and addition is needed in other modules, like adding Error conversion, etc. And this method has not been tested yet.)

impl TableMetadata {
    /// New struct to create a table
    pub fn try_new_to_create_table(table_creation: TableCreation) -> Result<Self, Error> {
        let last_column_id = table_creation.schema.highest_field_id();
        let current_schema_id = table_creation.schema.schema_id();
        let schemas = HashMap::from([(
            table_creation.schema.schema_id(),
            Arc::new(table_creation.schema),
        )]);
        let default_spec_id;
        let partition_specs;
        let last_partition_id;
        match table_creation.partition_spec {
            Some(partition_spec) => {
                default_spec_id = partition_spec.spec_id;
                last_partition_id =
                    max(partition_spec.fields.iter().map(|field| field.field_id)).unwrap_or(-1);
                partition_specs = HashMap::from([(partition_spec.spec_id, partition_spec)]);
            }
            None => {
                default_spec_id = 0;
                last_partition_id = -1;
                partition_specs = HashMap::from([(
                    default_spec_id,
                    PartitionSpec::builder()
                        .with_spec_id(0)
                        .with_fields(vec![])
                        .build()?,
                )]);
            }
        }
        let default_sort_order_id = table_creation.sort_order.order_id;
        let sort_orders = HashMap::from([(default_sort_order_id, table_creation.sort_order)]);
        Ok(Self {
            format_version: FormatVersion::V2,
            table_uuid: Uuid::new_v4(),
            location: table_creation.location,
            last_sequence_number: -1,
            last_updated_ms: since_the_epoch_as_millis()?,
            last_column_id,
            schemas,
            current_schema_id,
            partition_specs,
            default_spec_id,
            last_partition_id,
            properties: table_creation.properties,
            current_snapshot_id: None,
            snapshots: None,
            snapshot_log: vec![],
            metadata_log: vec![],
            sort_orders,
            default_sort_order_id,
            refs: HashMap::new(),
        })
    }
    ...
}

fn since_the_epoch_as_millis() -> Result<i64, Error> {
    let start = SystemTime::now();
    let since_the_epoch = start.duration_since(UNIX_EPOCH)?;
    Ok(since_the_epoch.as_millis().try_into()?)
}

This is a builder tailored for CreateTable, why not just put the check in builder?

zeodtr · 2023-09-25T07:28:31Z

This is a builder tailored for CreateTable, why not just put the check in builder?

Maybe the builder approach is ok.
But in my opinion, the 'CreateTable' case is a major case of constructing the TableMetadata struct from scratch. So it would be nice to have a tailored (and guided) method. Any software that supports CREATE TABLE for Iceberg would need that function, and unless the method is provided they must write the function in a very similar manner.
Maybe it's just a matter of preference.

liurenjie1024 · 2023-09-25T07:50:39Z

This is a builder tailored for CreateTable, why not just put the check in builder?

Maybe the builder approach is ok. But in my opinion, the 'CreateTable' case is a major case of constructing the TableMetadata struct from scratch. So it would be nice to have a tailored (and guided) method. Any software that supports CREATE TABLE for Iceberg would need that function, and unless the method is provided they must write the function in a very similar manner. Maybe it's just a matter of preference.

I prefer the builder approach since the transaction actions may also need to modify table metadata, such as add schema, add snapshot.

y0psolo · 2023-09-26T08:25:39Z

Thanks for your feedback :) !

I will add a validation function to the builder, seems logical. From my point of view it should validate the inner field of the TableMetadata but not the inner struct that should be validated when build with their own builder (Schema, ...).
I see also that you prefer to have a valid random generated UUID by default. I asked myself the question, i have now the answer.
I could also see if we can add an additional method to the builder (i think it's possible) to handle TableCreation struct. Something like with_table_creation.

I will update the pull request to have new discussion materials.

Update builder default values Cast Builder Error to Iceberg Error

Strenghten builder validation

y0psolo · 2023-10-05T16:02:15Z

Ok i added some validation steps. Tell me if its ok
i don't know if we need to have all fields visible on the builder. Tell me. We can choose to open and restrict later or do the reverse. But for compatibility reason the later will be easier.

liurenjie1024 · 2023-10-07T02:52:22Z

crates/iceberg/src/spec/table_metadata.rs

+impl TableMetadataBuilder {
+    /// Initialize a TableMetadata with a TableCreation struct
+    /// the Schema, sortOrder and PartitionSpec will be set as current
+    pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self {


Suggested change

pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self {

pub fn from_table_creation(tc: TableCreation) -> &mut Self {

Since all fields starts with with_, I would suggest to change it to from so that it's more readable.

liurenjie1024 · 2023-10-07T03:35:30Z

crates/iceberg/src/spec/table_metadata.rs

+    /// Add a schema to the TableMetadata
+    /// schema : Schema to be added or replaced
+    /// current : True if the schema is the current one
+    pub fn with_schema(&mut self, schema: Schema, current: bool) -> &mut Self {


We should also update last_updated_ms?

i could update it each time we have modification, but does it should be updated in theory at write time from what i read in spec ?

liurenjie1024 · 2023-10-07T03:36:24Z

crates/iceberg/src/spec/table_metadata.rs

+    /// - map : the map to scan
+    /// - default : default value for the key if any
+    /// - field : map name for throwing a better error
+    fn check_id_in_map<K: std::hash::Hash + Eq + std::fmt::Display + Copy, T>(


Do we really need to check this at last? Why not check it during modification?

if we restrict with only the given list of setter below we should not need this kind of validation as we will compute every value inside the builder. So they should be coherent.

liurenjie1024 · 2023-10-07T03:39:23Z

crates/iceberg/src/spec/table_metadata.rs

+    /// Add a snapshot to the TableMetadata and update last_sequence_number
+    /// snapshot : Snapshot to be added or replaced
+    /// current : True if the snapshot is the current one
+    pub fn with_snapshot(&mut self, snapshot: Snapshot, current: bool) -> &mut Self {


This in fact updates the main branch, we also need to update snapshot ref?

liurenjie1024 · 2023-10-07T03:47:53Z

Hi, @y0psolo Thanks for the effort! But I'm a little worried about the api here, since it's error prone, e.g. it's quite easy to construct an invalid table metadata. My suggestion would be to remove the derived builder, and for public apis of modification we can follow the java implementation: https://github.com/apache/iceberg/blob/abc3f746bfb00fdcd7ea1f170df242f1857ced75/core/src/main/java/org/apache/iceberg/TableMetadata.java#L842

y0psolo · 2023-10-07T19:55:28Z

Hi, @y0psolo Thanks for the effort! But I'm a little worried about the api here, since it's error prone, e.g. it's quite easy to construct an invalid table metadata. My suggestion would be to remove the derived builder, and for public apis of modification we can follow the java implementation: https://github.com/apache/iceberg/blob/abc3f746bfb00fdcd7ea1f170df242f1857ced75/core/src/main/java/org/apache/iceberg/TableMetadata.java#L842

Ok from what i read from the java class, below is the short list of functionalities you want to only provide on the builder :

set UUID
set FormatVersion
add a schema (a single schema not a map)
set current schema
add a PartitonSpec (a single PartitonSpec not a map)
set default PartitionSpec
add a default SortOrder (a single SortOrder not a map)
set the default sort order
add a snapshot (a single snpashot not a map)
set branch snapshot
add a ref (a single ref not a map)

I will leave only this part public and not allow the builder to generate useless method. I don't know yet if we should remove the builder as it seems to provide some internal struct builder and default value at least. We will see with what is left at the end under his responsability.

liurenjie1024 · 2023-10-08T01:39:20Z

I don't know yet if we should remove the builder as it seems to provide some internal struct builder and default value at least. We will see with what is left at the end under his responsability.

The reason why I suggest removing derived builder is that TableMetadata is a quite complex data structure, also a core data structure for iceberg. Derived builder makes it quite difficult for reviewers to identify what's auto generated implicitly. As we reached an agreement that most public apis should follow java's builder implementation, I don't think that it's still useful to leave it.

cc @JanKaul @Xuanwo @ZENOTME How do you think?

do not allow generation by builder for other fields except few

liurenjie1024

Hi, @y0psolo Thanks! I've left some comments

crates/iceberg/src/spec/table_metadata.rs

liurenjie1024 · 2023-10-13T08:09:16Z

crates/iceberg/src/spec/table_metadata.rs

+
+    /// Add or replace a schema
+    /// schema : Schema to be added or replaced
+    pub fn with_schema(&mut self, schema: Schema) -> &mut Self {


Suggested change

pub fn with_schema(&mut self, schema: Schema) -> &mut Self {

pub fn add_schema(&mut self, schema: Schema) -> &mut Self {

Currently we use 0 as default schema id, e.g. the schema id is not set by user. I think we should check this and auto assign a new schema id for it.

I think by default we should fail when id already exists. We can add another method for the insert or update behavior.

the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.
I think it's better to reject schema creation if no id is set if we want to avoid this corner case. For adding Schema the method should be on the TableMetadata struct itself as it may fail in case of duplicate id.
What do you think ?

the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.

Why builder pattern should return Self rather Result? In fact, I don't want to make a normal builder, since it's quite easy to construct an invalid table metadata.

I think it's better to reject schema creation if no id is set if we want to avoid this corner case.

This doesn't work. Thinking about the transaction api, the user wants to change the schema of the table, why he needs to know the scheme id in advance which should be allocated by catalog?

For adding Schema the method should be on the TableMetadata struct itself as it may fails in case of duplicate id.

I prefer to make TableMetadata immutable and leave the states for validity checking in TableMetadataBuilder.

Please be aware that the main use case of TableMetadataBuilder is transaction api, so we should ensure that the result is valid.

the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.

Why builder pattern should return Self rather Result? In fact, I don't want to make a normal builder, since it's quite easy to construct an invalid table metadata.

to avoid code where we need to unwrap and test error at each call but have more idiomatic way to build a new struct.

I think it's better to reject schema creation if no id is set if we want to avoid this corner case.

This doesn't work. Thinking about the transaction api, the user wants to change the schema of the table, why he needs to know the scheme id in advance which should be allocated by catalog?

I think a miss a step here. The builder is here to build a struct from scratch. I f we want to update an existing struct, this is another use case and it should be handled differently. I agree with you if a user update a Schema he doesn't need to change the id and so this option should not be exposed.
If we have a look to Table Interface (https://iceberg.apache.org/javadoc/master/org/apache/iceberg/Table.html) all this functions are on the Table object, not the builder.

For adding Schema the method should be on the TableMetadata struct itself as it may fails in case of duplicate id.

I prefer to make TableMetadata immutable and leave the states for validity checking in TableMetadataBuilder.

Please be aware that the main use case of TableMetadataBuilder is transaction api, so we should ensure that the result is valid.

Thanks for letting me know this information, i don't fully know the iceberg format and internals and try to catch up. Having a TableMetadata immutable is not incompatible to have specific function for handling transaction on the TableMetadata struct itself. We could imagine having a transaction function that would return a new struct encapsulating the existing TableMetadataBuilder (and other struct if needed) and provide only the needed function to update the existing Table (like https://iceberg.apache.org/javadoc/master/org/apache/iceberg/Transaction.html). This object could queue all modification, leaving the encapsulated struct as is and when transaction is committed will deliver a new struct.

We keep immutability of underlying struct, we offer a way to restrict easily possible action on the update operation. We could reuse some part of the validation code between both.

This is just an example. I am trying to illustrate that updating and creating are different use cases and could be addressed differently in order to avoid more complexity, better control on action and better usability.

I am sorry for all these discussions but i really want to understand to best way to end this pull request. I leave below conversation unresolved as the decision on this one will certainly influence their outcome.

liurenjie1024 · 2023-10-13T08:10:10Z

crates/iceberg/src/spec/table_metadata.rs

+    pub fn with_current_schema(&mut self, schema: Schema) -> &mut Self {
+        self.current_schema_id = Some(schema.schema_id());
+        self.last_column_id = Some(schema.highest_field_id());
+        self.with_schema(schema)


Same as problem mentioned above, we need to consider the default schema id problem.

liurenjie1024 · 2023-10-13T08:23:13Z

crates/iceberg/src/spec/table_metadata.rs

+
+    /// Add or replace a partition_spec and update the last_partition_id accordingly
+    /// partition_spec : PartitionSpec to be added or replaced
+    pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {


Suggested change

pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {

pub fn add_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {

There are same problems as scheme id.

We need to validate spec with schema:

If source_id exists in schema

If the field type supports such transformation.

liurenjie1024 · 2023-10-13T08:23:32Z

crates/iceberg/src/spec/table_metadata.rs

+    /// Add or replace a partition_spec, update the last_partition_id accordingly
+    /// and set the default spec id to the partition_spec id
+    /// partition_spec : PartitionSpec to be added or replaced
+    pub fn with_default_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add builder to TableMetadata interface #62

feat: add builder to TableMetadata interface #62

y0psolo commented Sep 21, 2023

zeodtr commented Sep 22, 2023 •

edited

Loading

liurenjie1024 commented Sep 25, 2023

zeodtr commented Sep 25, 2023

liurenjie1024 commented Sep 25, 2023

zeodtr commented Sep 25, 2023

liurenjie1024 commented Sep 25, 2023

y0psolo commented Sep 26, 2023

y0psolo commented Oct 5, 2023

liurenjie1024 Oct 7, 2023

liurenjie1024 Oct 7, 2023

y0psolo Oct 7, 2023

liurenjie1024 Oct 7, 2023

y0psolo Oct 7, 2023

liurenjie1024 Oct 7, 2023

liurenjie1024 commented Oct 7, 2023

y0psolo commented Oct 7, 2023

liurenjie1024 commented Oct 8, 2023 •

edited

Loading

liurenjie1024 left a comment

liurenjie1024 Oct 13, 2023

y0psolo Oct 18, 2023 •

edited

Loading

liurenjie1024 Oct 18, 2023

y0psolo Oct 20, 2023

liurenjie1024 Oct 13, 2023

liurenjie1024 Oct 13, 2023

liurenjie1024 Oct 13, 2023

liurenjie1024 Oct 13, 2023

liurenjie1024 Oct 13, 2023

liurenjie1024 Oct 13, 2023

liurenjie1024 commented Mar 21, 2024

liurenjie1024 commented Mar 22, 2024

	pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self {
	pub fn from_table_creation(tc: TableCreation) -> &mut Self {

	pub fn with_schema(&mut self, schema: Schema) -> &mut Self {
	pub fn add_schema(&mut self, schema: Schema) -> &mut Self {

	pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {
	pub fn add_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {

+                  /// Add or replace a sort_order
+                  /// sort_order : SortOrder to be added or replaced
+                  pub fn with_sort_order(&mut self, sort_order: SortOrder) -> &mut Self {

+                  /// Add or replace a snapshot to the main branch, update last_sequence_number
+                  /// snapshot : Snapshot to be added or replaced
+                  pub fn with_snapshot(&mut self, snapshot: Snapshot) -> &mut Self {

	pub fn with_snapshot(&mut self, snapshot: Snapshot) -> &mut Self {
	pub fn add_snapshot(&mut self, snapshot: Snapshot) -> &mut Self {

+                      if self.last_updated_ms < Some(snapshot.timestamp()) {
+                          self.last_updated_ms = Some(snapshot.timestamp());
+                      }
+                      self.with_ref(

feat: add builder to TableMetadata interface #62

feat: add builder to TableMetadata interface #62

Conversation

y0psolo commented Sep 21, 2023

zeodtr commented Sep 22, 2023 • edited Loading

liurenjie1024 commented Sep 25, 2023

zeodtr commented Sep 25, 2023

liurenjie1024 commented Sep 25, 2023

zeodtr commented Sep 25, 2023

liurenjie1024 commented Sep 25, 2023

y0psolo commented Sep 26, 2023

y0psolo commented Oct 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Oct 7, 2023

y0psolo commented Oct 7, 2023

liurenjie1024 commented Oct 8, 2023 • edited Loading

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

y0psolo Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Mar 21, 2024

liurenjie1024 commented Mar 22, 2024

zeodtr commented Sep 22, 2023 •

edited

Loading

liurenjie1024 commented Oct 8, 2023 •

edited

Loading

y0psolo Oct 18, 2023 •

edited

Loading