[HUDI-2560] introduce id_based schema to support full schema evolution. #3808

xiarixiaoyao · 2021-10-15T08:23:54Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Introduce id_based schema to support full schema evolution.
this pr is split from [RFC-33] [HUDI-2429][WIP] Full schema evolution #3668, since that pr is too large.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

UT added.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2021-10-15T08:27:45Z

@bvaradar @codope . could you help me to review those codes, thanks.
those codes are split from #3668. and addressed all comments.

bvaradar

cc @codope

@xiarixiaoyao : Went through initial few files and added review comments. Will review the rest in a day.

bvaradar · 2021-11-01T02:44:18Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

+   * @param position col position to be added
+   * @param positionType col position change type. now support three change types: first/after/before
+   */
+  public void addCol(String colName, Schema schema, String doc, String position, TableChange.ColumnPositionChange.ColumnPositionType positionType) {


nit: addCol -> addColumns

A question for all these schema methods at this class level -> what is the behavior if the same operation is applied twice. For example what is the behavior if the same column is deleted or added ?

second deleted or added will be failed and throw exception。 Before actual operation We will check the column operation。 i will add test case ， thanks

add repeat delete and add case check in test class TestTableChanges.

bvaradar · 2021-11-01T03:37:48Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

+  }
+
+  public void addCol(String colName, Schema schema) {
+    addCol(colName, schema, null, null, null);


Instead of null column type, Can you create an enum value "NULL_COLUMN" in ColumnPositionType ?

Also, why is the param position has type string instead of int?

This is because such operations as hive / spark / MySQL use string as a parameter。 eg: alter table xxx add columns(name string after/before id . i think it will be better to use String as param

using enum value "NO_OPERATION" instead of use "NULL_COLUMN" in ColumnPositionType

bvaradar · 2021-11-01T03:53:28Z

...lient/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java

+    TableSchemaResolver schemaUtil = new TableSchemaResolver(metaClient);
+    String historySchemaStr = schemaUtil.getTableHistorySchemaStrFromCommitMetadata().orElse("");
+    Schema schema = AvroInternalSchemaConverter.convert(newSchema, config.getTableName());
+    String commitActionType = CommitUtils.getCommitActionType(WriteOperationType.INSERT, metaClient.getTableType());


Let's introduce a new operation type for schema changes : "ALTER_SCHEMA"

bvaradar · 2021-11-01T04:19:35Z

hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java

+   * @return InternalSchema for this table
+   */
+  public Option<InternalSchema> getTableInternalSchemaFromCommitMetadata() {
+    HoodieTimeline timeline = metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();


Instead of reading internal metadata from commit file, can we read it from the .hoodie/.schema folder (using FileBaseInternalSchemasManager).

I feel OK。 Just a little performance worried， as historySchema will gradually increase, read from commit file has better performance than read from FileBaseInternalSchemasManager.

Save it as an indexed file (HFile) so you can just read the last record or first record and be done? Having one source of truth would be good. We can also do this as follow up

xiarixiaoyao · 2021-11-02T03:24:04Z

@bvaradar Thank you for your review， will update the code today。

xiarixiaoyao · 2021-11-04T01:28:37Z

@bvaradar thanks for your review. update the code.
only a little doubt, reading internal metadata from commit file have a better performnace than read from FileBaseInternalSchemasManager. of course if you think read from FileBaseInternalSchemasManager is better i will update the code

bvaradar

@xiarixiaoyao : Few more comments.

bvaradar · 2021-11-10T05:40:33Z