Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add builder to TableMetadata interface #62

Closed
wants to merge 7 commits into from

Conversation

y0psolo
Copy link
Contributor

@y0psolo y0psolo commented Sep 21, 2023

related issue : #52

Add a builder for TableMetadata interface

  • fields not mandatory have default values
  • Formatversion defaults to V2

@zeodtr
Copy link

zeodtr commented Sep 22, 2023

Hi,
(I'm not sure this is the right place to comment. if not, sorry.)

I'm unsure whether a generic builder is the best approach for building a TableMetadata.
It seems that this approach somewhat violates the decision of issue #3.
While TableMetadata hides fields and imposes some logic to set the fields, the generic builder allows users to set the fields to whatever value (even an invalid one) they want.
For example, in the test case of this commit, last_column_id and schema are set separately, which can result in an invalid (or unhealthy?) TableMetadata.
(BTW, the Schema struct avoids this problem by manually implementing the SchemaBuilder struct.)

I think that the main usage patterns of TableMetadata are as follows:

  • CREATE TABLE: build a fresh new TableMetadata with mandatory fields -> mutate it if necessary -> serialize
  • All other cases: deserialize -> read fields -> clone if necessary -> mutate -> serialize

So, maybe a 'builder' is required only for the CREATE TABLE: build a fresh new TableMetadata with mandatory fields case.

Since feat: Add Catalog API (pull request #54) introduced the TableCreation struct in catalog.rs, perhaps it would be good (and sufficient?) to have a public new function (named new_table or for_a_new_table?) that takes the TableCreation struct as an argument in TableMatadata struct.

Thank you.

@liurenjie1024
Copy link
Collaborator

+1 with @zeodtr We should have a validation check in the final phase of build, so an auto generated builder maybe inappropricate. How about this:

struct TableMetadataBuilder(TableMetadata);

impl TableMetadataBuilder {
    ...
    pub fn build(self) {
       // validate 
      self.0
    }
}

@zeodtr
Copy link

zeodtr commented Sep 25, 2023

How about a method like this for the 'CREATE TABLE' case?
(For this method to compile, a few modification and addition is needed in other modules, like adding Error conversion, etc.
And this method has not been tested yet.)

impl TableMetadata {
    /// New struct to create a table
    pub fn try_new_to_create_table(table_creation: TableCreation) -> Result<Self, Error> {
        let last_column_id = table_creation.schema.highest_field_id();
        let current_schema_id = table_creation.schema.schema_id();
        let schemas = HashMap::from([(
            table_creation.schema.schema_id(),
            Arc::new(table_creation.schema),
        )]);
        let default_spec_id;
        let partition_specs;
        let last_partition_id;
        match table_creation.partition_spec {
            Some(partition_spec) => {
                default_spec_id = partition_spec.spec_id;
                last_partition_id =
                    max(partition_spec.fields.iter().map(|field| field.field_id)).unwrap_or(-1);
                partition_specs = HashMap::from([(partition_spec.spec_id, partition_spec)]);
            }
            None => {
                default_spec_id = 0;
                last_partition_id = -1;
                partition_specs = HashMap::from([(
                    default_spec_id,
                    PartitionSpec::builder()
                        .with_spec_id(0)
                        .with_fields(vec![])
                        .build()?,
                )]);
            }
        }
        let default_sort_order_id = table_creation.sort_order.order_id;
        let sort_orders = HashMap::from([(default_sort_order_id, table_creation.sort_order)]);
        Ok(Self {
            format_version: FormatVersion::V2,
            table_uuid: Uuid::new_v4(),
            location: table_creation.location,
            last_sequence_number: -1,
            last_updated_ms: since_the_epoch_as_millis()?,
            last_column_id,
            schemas,
            current_schema_id,
            partition_specs,
            default_spec_id,
            last_partition_id,
            properties: table_creation.properties,
            current_snapshot_id: None,
            snapshots: None,
            snapshot_log: vec![],
            metadata_log: vec![],
            sort_orders,
            default_sort_order_id,
            refs: HashMap::new(),
        })
    }
    ...
}

fn since_the_epoch_as_millis() -> Result<i64, Error> {
    let start = SystemTime::now();
    let since_the_epoch = start.duration_since(UNIX_EPOCH)?;
    Ok(since_the_epoch.as_millis().try_into()?)
}

@liurenjie1024
Copy link
Collaborator

How about a method like this for the 'CREATE TABLE' case? (For this method to compile, a few modification and addition is needed in other modules, like adding Error conversion, etc. And this method has not been tested yet.)

impl TableMetadata {
    /// New struct to create a table
    pub fn try_new_to_create_table(table_creation: TableCreation) -> Result<Self, Error> {
        let last_column_id = table_creation.schema.highest_field_id();
        let current_schema_id = table_creation.schema.schema_id();
        let schemas = HashMap::from([(
            table_creation.schema.schema_id(),
            Arc::new(table_creation.schema),
        )]);
        let default_spec_id;
        let partition_specs;
        let last_partition_id;
        match table_creation.partition_spec {
            Some(partition_spec) => {
                default_spec_id = partition_spec.spec_id;
                last_partition_id =
                    max(partition_spec.fields.iter().map(|field| field.field_id)).unwrap_or(-1);
                partition_specs = HashMap::from([(partition_spec.spec_id, partition_spec)]);
            }
            None => {
                default_spec_id = 0;
                last_partition_id = -1;
                partition_specs = HashMap::from([(
                    default_spec_id,
                    PartitionSpec::builder()
                        .with_spec_id(0)
                        .with_fields(vec![])
                        .build()?,
                )]);
            }
        }
        let default_sort_order_id = table_creation.sort_order.order_id;
        let sort_orders = HashMap::from([(default_sort_order_id, table_creation.sort_order)]);
        Ok(Self {
            format_version: FormatVersion::V2,
            table_uuid: Uuid::new_v4(),
            location: table_creation.location,
            last_sequence_number: -1,
            last_updated_ms: since_the_epoch_as_millis()?,
            last_column_id,
            schemas,
            current_schema_id,
            partition_specs,
            default_spec_id,
            last_partition_id,
            properties: table_creation.properties,
            current_snapshot_id: None,
            snapshots: None,
            snapshot_log: vec![],
            metadata_log: vec![],
            sort_orders,
            default_sort_order_id,
            refs: HashMap::new(),
        })
    }
    ...
}

fn since_the_epoch_as_millis() -> Result<i64, Error> {
    let start = SystemTime::now();
    let since_the_epoch = start.duration_since(UNIX_EPOCH)?;
    Ok(since_the_epoch.as_millis().try_into()?)
}

This is a builder tailored for CreateTable, why not just put the check in builder?

@zeodtr
Copy link

zeodtr commented Sep 25, 2023

This is a builder tailored for CreateTable, why not just put the check in builder?

Maybe the builder approach is ok.
But in my opinion, the 'CreateTable' case is a major case of constructing the TableMetadata struct from scratch. So it would be nice to have a tailored (and guided) method. Any software that supports CREATE TABLE for Iceberg would need that function, and unless the method is provided they must write the function in a very similar manner.
Maybe it's just a matter of preference.

@liurenjie1024
Copy link
Collaborator

This is a builder tailored for CreateTable, why not just put the check in builder?

Maybe the builder approach is ok. But in my opinion, the 'CreateTable' case is a major case of constructing the TableMetadata struct from scratch. So it would be nice to have a tailored (and guided) method. Any software that supports CREATE TABLE for Iceberg would need that function, and unless the method is provided they must write the function in a very similar manner. Maybe it's just a matter of preference.

I prefer the builder approach since the transaction actions may also need to modify table metadata, such as add schema, add snapshot.

@y0psolo
Copy link
Contributor Author

y0psolo commented Sep 26, 2023

Thanks for your feedback :) !

I will add a validation function to the builder, seems logical. From my point of view it should validate the inner field of the TableMetadata but not the inner struct that should be validated when build with their own builder (Schema, ...).
I see also that you prefer to have a valid random generated UUID by default. I asked myself the question, i have now the answer.
I could also see if we can add an additional method to the builder (i think it's possible) to handle TableCreation struct. Something like with_table_creation.

I will update the pull request to have new discussion materials.

Update builder default values
Cast Builder Error to Iceberg Error
Strenghten builder validation
@y0psolo
Copy link
Contributor Author

y0psolo commented Oct 5, 2023

Ok i added some validation steps. Tell me if its ok
i don't know if we need to have all fields visible on the builder. Tell me. We can choose to open and restrict later or do the reverse. But for compatibility reason the later will be easier.

impl TableMetadataBuilder {
/// Initialize a TableMetadata with a TableCreation struct
/// the Schema, sortOrder and PartitionSpec will be set as current
pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn with_table_creation(&mut self, tc: TableCreation) -> &mut Self {
pub fn from_table_creation(tc: TableCreation) -> &mut Self {

Since all fields starts with with_, I would suggest to change it to from so that it's more readable.

/// Add a schema to the TableMetadata
/// schema : Schema to be added or replaced
/// current : True if the schema is the current one
pub fn with_schema(&mut self, schema: Schema, current: bool) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update last_updated_ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could update it each time we have modification, but does it should be updated in theory at write time from what i read in spec ?

/// - map : the map to scan
/// - default : default value for the key if any
/// - field : map name for throwing a better error
fn check_id_in_map<K: std::hash::Hash + Eq + std::fmt::Display + Copy, T>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to check this at last? Why not check it during modification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we restrict with only the given list of setter below we should not need this kind of validation as we will compute every value inside the builder. So they should be coherent.

/// Add a snapshot to the TableMetadata and update last_sequence_number
/// snapshot : Snapshot to be added or replaced
/// current : True if the snapshot is the current one
pub fn with_snapshot(&mut self, snapshot: Snapshot, current: bool) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in fact updates the main branch, we also need to update snapshot ref?

@liurenjie1024
Copy link
Collaborator

Hi, @y0psolo Thanks for the effort! But I'm a little worried about the api here, since it's error prone, e.g. it's quite easy to construct an invalid table metadata. My suggestion would be to remove the derived builder, and for public apis of modification we can follow the java implementation: https://github.com/apache/iceberg/blob/abc3f746bfb00fdcd7ea1f170df242f1857ced75/core/src/main/java/org/apache/iceberg/TableMetadata.java#L842

@y0psolo
Copy link
Contributor Author

y0psolo commented Oct 7, 2023

Hi, @y0psolo Thanks for the effort! But I'm a little worried about the api here, since it's error prone, e.g. it's quite easy to construct an invalid table metadata. My suggestion would be to remove the derived builder, and for public apis of modification we can follow the java implementation: https://github.com/apache/iceberg/blob/abc3f746bfb00fdcd7ea1f170df242f1857ced75/core/src/main/java/org/apache/iceberg/TableMetadata.java#L842

Ok from what i read from the java class, below is the short list of functionalities you want to only provide on the builder :

  • set UUID
  • set FormatVersion
  • add a schema (a single schema not a map)
  • set current schema
  • add a PartitonSpec (a single PartitonSpec not a map)
  • set default PartitionSpec
  • add a default SortOrder (a single SortOrder not a map)
  • set the default sort order
  • add a snapshot (a single snpashot not a map)
  • set branch snapshot
  • add a ref (a single ref not a map)

I will leave only this part public and not allow the builder to generate useless method. I don't know yet if we should remove the builder as it seems to provide some internal struct builder and default value at least. We will see with what is left at the end under his responsability.

@liurenjie1024
Copy link
Collaborator

liurenjie1024 commented Oct 8, 2023

I don't know yet if we should remove the builder as it seems to provide some internal struct builder and default value at least. We will see with what is left at the end under his responsability.

The reason why I suggest removing derived builder is that TableMetadata is a quite complex data structure, also a core data structure for iceberg. Derived builder makes it quite difficult for reviewers to identify what's auto generated implicitly. As we reached an agreement that most public apis should follow java's builder implementation, I don't think that it's still useful to leave it.

cc @JanKaul @Xuanwo @ZENOTME How do you think?

do not allow generation by builder for other fields except few
Copy link
Collaborator

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @y0psolo Thanks! I've left some comments

crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved
crates/iceberg/src/spec/table_metadata.rs Outdated Show resolved Hide resolved

/// Add or replace a schema
/// schema : Schema to be added or replaced
pub fn with_schema(&mut self, schema: Schema) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn with_schema(&mut self, schema: Schema) -> &mut Self {
pub fn add_schema(&mut self, schema: Schema) -> &mut Self {
  1. Currently we use 0 as default schema id, e.g. the schema id is not set by user. I think we should check this and auto assign a new schema id for it.
  2. I think by default we should fail when id already exists. We can add another method for the insert or update behavior.

Copy link
Contributor Author

@y0psolo y0psolo Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.
I think it's better to reject schema creation if no id is set if we want to avoid this corner case. For adding Schema the method should be on the TableMetadata struct itself as it may fail in case of duplicate id.
What do you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.

Why builder pattern should return Self rather Result? In fact, I don't want to make a normal builder, since it's quite easy to construct an invalid table metadata.

I think it's better to reject schema creation if no id is set if we want to avoid this corner case.

This doesn't work. Thinking about the transaction api, the user wants to change the schema of the table, why he needs to know the scheme id in advance which should be allocated by catalog?

For adding Schema the method should be on the TableMetadata struct itself as it may fails in case of duplicate id.

I prefer to make TableMetadata immutable and leave the states for validity checking in TableMetadataBuilder.

Please be aware that the main use case of TableMetadataBuilder is transaction api, so we should ensure that the result is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problematic with failing in the builder method is that we will not respect the builder pattern that should return "Self". Instead we will have a Result struct.

Why builder pattern should return Self rather Result? In fact, I don't want to make a normal builder, since it's quite easy to construct an invalid table metadata.

to avoid code where we need to unwrap and test error at each call but have more idiomatic way to build a new struct.

I think it's better to reject schema creation if no id is set if we want to avoid this corner case.

This doesn't work. Thinking about the transaction api, the user wants to change the schema of the table, why he needs to know the scheme id in advance which should be allocated by catalog?

I think a miss a step here. The builder is here to build a struct from scratch. I f we want to update an existing struct, this is another use case and it should be handled differently. I agree with you if a user update a Schema he doesn't need to change the id and so this option should not be exposed.
If we have a look to Table Interface (https://iceberg.apache.org/javadoc/master/org/apache/iceberg/Table.html) all this functions are on the Table object, not the builder.

For adding Schema the method should be on the TableMetadata struct itself as it may fails in case of duplicate id.

I prefer to make TableMetadata immutable and leave the states for validity checking in TableMetadataBuilder.

Please be aware that the main use case of TableMetadataBuilder is transaction api, so we should ensure that the result is valid.

Thanks for letting me know this information, i don't fully know the iceberg format and internals and try to catch up. Having a TableMetadata immutable is not incompatible to have specific function for handling transaction on the TableMetadata struct itself. We could imagine having a transaction function that would return a new struct encapsulating the existing TableMetadataBuilder (and other struct if needed) and provide only the needed function to update the existing Table (like https://iceberg.apache.org/javadoc/master/org/apache/iceberg/Transaction.html). This object could queue all modification, leaving the encapsulated struct as is and when transaction is committed will deliver a new struct.

We keep immutability of underlying struct, we offer a way to restrict easily possible action on the update operation. We could reuse some part of the validation code between both.

This is just an example. I am trying to illustrate that updating and creating are different use cases and could be addressed differently in order to avoid more complexity, better control on action and better usability.

I am sorry for all these discussions but i really want to understand to best way to end this pull request. I leave below conversation unresolved as the decision on this one will certainly influence their outcome.

pub fn with_current_schema(&mut self, schema: Schema) -> &mut Self {
self.current_schema_id = Some(schema.schema_id());
self.last_column_id = Some(schema.highest_field_id());
self.with_schema(schema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as problem mentioned above, we need to consider the default schema id problem.


/// Add or replace a partition_spec and update the last_partition_id accordingly
/// partition_spec : PartitionSpec to be added or replaced
pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn with_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {
pub fn add_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {
  1. There are same problems as scheme id.
  2. We need to validate spec with schema:
  • If source_id exists in schema
  • If the field type supports such transformation.

/// Add or replace a partition_spec, update the last_partition_id accordingly
/// and set the default spec id to the partition_spec id
/// partition_spec : PartitionSpec to be added or replaced
pub fn with_default_partition_spec(&mut self, partition_spec: PartitionSpec) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem to schema.


/// Add or replace a sort_order
/// sort_order : SortOrder to be added or replaced
pub fn with_sort_order(&mut self, sort_order: SortOrder) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar problem as partition_spec


/// Add or replace a snapshot to the main branch, update last_sequence_number
/// snapshot : Snapshot to be added or replaced
pub fn with_snapshot(&mut self, snapshot: Snapshot) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn with_snapshot(&mut self, snapshot: Snapshot) -> &mut Self {
pub fn add_snapshot(&mut self, snapshot: Snapshot) -> &mut Self {

if self.last_updated_ms < Some(snapshot.timestamp()) {
self.last_updated_ms = Some(snapshot.timestamp());
}
self.with_ref(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should copy existing snapshot ref's retention policy if it already exists.

@liurenjie1024
Copy link
Collaborator

cc @y0psolo Should we close this now? I think it's resolved by #262

@liurenjie1024
Copy link
Collaborator

cc @y0psolo Should we close this now? I think it's resolved by #262

I'll close this for now, feel free to reopen it if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants