Skip to content

Conversation

@CodingCat
Copy link
Contributor

@CodingCat CodingCat commented Aug 15, 2023

currently, users are not allowed to do the following operation which is valid use case when we want to change a column type and don't care about old data anymore

// tableSchema as below
//        new Schema(
//            required(1, "id", Types.LongType.get()),
//            required(2, "data", Types.StringType.get()),
//            required(3, "data_1", Types.StringType.get()));

// id is deleted and added back with a different type and then we want to move this new column after an existing column "data"
new SchemaUpdate(schema, 3)
    .allowIncompatibleChanges()
    .deleteColumn("id")
    .addRequiredColumn("id", Types.IntegerType.get())
    .moveAfter("id", "data")
    .apply();

(of course we can add/delete columns, commit and then move columns and commit again, but this workaround brings two transactions which leaves a batch data application in the risk of a partially updated schema when it failed in the middle)

when user run the above code, the below line will throw an NoSuchElement
exception

Iterables.find(reordered, field -> field.fieldId() == move.fieldId());
)

more details analysis: code

because List<Types.NestedField> fields only contains undeleted column: in the above example, id (with field 1) is not there. Additionally, when we build the Collection<Move> moves the .moveAfter("id", "data") will refer to field 1 instead of 4 (representing the column with the same name added by addRequiredColumn("id", Types.IntegerType.get()))

the proposal in this PR is essentially to reconstruct moves in the constructor of ApplyChanges by replacing the references to the deleted column with a reference to the newly added column with the same name

@github-actions github-actions bot added the core label Aug 15, 2023
@zinking
Copy link
Contributor

zinking commented Aug 17, 2023

I'm not sure, for me I would do this in two actions. remove the old column, then add the new column.
although the name is same but columnId would be different.
does it work

@CodingCat
Copy link
Contributor Author

I'm not sure, for me I would do this in two actions. remove the old column, then add the new column.

although the name is same but columnId would be different.

does it work

it won't work I think, remove and add will only make the column as the last column in the schema , but I want to move the column to a certain place . Additionally it leaves the table in the risk of a partially updated schema if the job failed after delete and before add

you can remove + add in a single transaction and then move in the other one, but still it has the problem of the partially updated schema

@Fokko
Copy link
Contributor

Fokko commented Sep 19, 2023

@CodingCat Thanks for raising this PR, and sorry for the long wait. As you mentioned in the PR description, this operation should be able to be done in a single commit. I dug into it, and I was able to let the test pass (thank you for writing those!) with the following change:

  private Integer findForMove(String name) {
    Types.NestedField field = findField(name);
    if (field != null) {
      return field.fieldId();
    }
    return addedNameToId.get(name);
  }

To:

  private Integer findForMove(String name) {
    Integer addedId = addedNameToId.get(name);
    if (addedId != null) {
      return addedId;
    }

    Types.NestedField field = findField(name);
    if (field != null) {
      return field.fieldId();
    }

    return null;
  }

This way we keep all the logic of moving fields in the same place. WDYT?

@CodingCat
Copy link
Contributor Author

@CodingCat Thanks for raising this PR, and sorry for the long wait. As you mentioned in the PR description, this operation should be able to be done in a single commit. I dug into it, and I was able to let the test pass (thank you for writing those!) with the following change:

  private Integer findForMove(String name) {
    Types.NestedField field = findField(name);
    if (field != null) {
      return field.fieldId();
    }
    return addedNameToId.get(name);
  }

To:

  private Integer findForMove(String name) {
    Integer addedId = addedNameToId.get(name);
    if (addedId != null) {
      return addedId;
    }

    Types.NestedField field = findField(name);
    if (field != null) {
      return field.fieldId();
    }

    return null;
  }

This way we keep all the logic of moving fields in the same place. WDYT?

Hi, @Fokko thanks for the review and comments, I agree, this is a cleaner fix, I updated the PR

@CodingCat
Copy link
Contributor Author

TestStructuredStreamingRead3 > [catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] > testReadingStreamFromFutureTimetsamp[catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}] FAILED

this is a flaky test?

@CodingCat
Copy link
Contributor Author

@Fokko ok, all tests passed now, thank you for the review! would you mind taking another look?

@CodingCat
Copy link
Contributor Author

gentle ping @Fokko ? would you mind giving it another review? thank you!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CodingCat Thanks for pinging me, this looks great! 👍

@Fokko Fokko changed the title allow to move a column with the same name as a deleted column Core: Move a column with the same name as a deleted column Sep 22, 2023
@Fokko Fokko merged commit 9aeea63 into apache:master Sep 22, 2023
@CodingCat CodingCat deleted the update_delete branch September 22, 2023 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants