-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-2560] introduce id_based schema to support full schema evolution. #3808
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @codope
@xiarixiaoyao : Went through initial few files and added review comments. Will review the rest in a day.
* @param position col position to be added | ||
* @param positionType col position change type. now support three change types: first/after/before | ||
*/ | ||
public void addCol(String colName, Schema schema, String doc, String position, TableChange.ColumnPositionChange.ColumnPositionType positionType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: addCol -> addColumns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question for all these schema methods at this class level -> what is the behavior if the same operation is applied twice. For example what is the behavior if the same column is deleted or added ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second deleted or added will be failed and throw exception。 Before actual operation We will check the column operation。 i will add test case , thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add repeat delete and add case check in test class TestTableChanges.
} | ||
|
||
public void addCol(String colName, Schema schema) { | ||
addCol(colName, schema, null, null, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of null column type, Can you create an enum value "NULL_COLUMN" in ColumnPositionType ?
Also, why is the param position has type string instead of int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because such operations as hive / spark / MySQL use string as a parameter。 eg: alter table xxx add columns(name string after/before id . i think it will be better to use String as param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using enum value "NO_OPERATION" instead of use "NULL_COLUMN" in ColumnPositionType
TableSchemaResolver schemaUtil = new TableSchemaResolver(metaClient); | ||
String historySchemaStr = schemaUtil.getTableHistorySchemaStrFromCommitMetadata().orElse(""); | ||
Schema schema = AvroInternalSchemaConverter.convert(newSchema, config.getTableName()); | ||
String commitActionType = CommitUtils.getCommitActionType(WriteOperationType.INSERT, metaClient.getTableType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's introduce a new operation type for schema changes : "ALTER_SCHEMA"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* @return InternalSchema for this table | ||
*/ | ||
public Option<InternalSchema> getTableInternalSchemaFromCommitMetadata() { | ||
HoodieTimeline timeline = metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reading internal metadata from commit file, can we read it from the .hoodie/.schema folder (using FileBaseInternalSchemasManager).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel OK。 Just a little performance worried, as historySchema will gradually increase, read from commit file has better performance than read from FileBaseInternalSchemasManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save it as an indexed file (HFile) so you can just read the last record or first record and be done? Having one source of truth would be good. We can also do this as follow up
@bvaradar Thank you for your review, will update the code today。 |
56e3932
to
440a128
Compare
@bvaradar thanks for your review. update the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiarixiaoyao : Few more comments.
} | ||
} | ||
|
||
private void cleanOldFiles() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do the cleanup as part of archiving commit metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
} | ||
}); | ||
// clean old files, keep at most ten schema files | ||
if (validateSchemaFiles.size() > 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this value 10 configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
} | ||
} | ||
|
||
private List getValidateCommits() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rename to getValidInstants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
import java.util.TreeMap; | ||
import java.util.stream.Collectors; | ||
|
||
public class FileBaseInternalSchemasManager extends InternalSchemasManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to FileBasedInternalSchemaStorageManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
Path saveTempPath = new Path(baseSchemaPath, instantTime + java.util.UUID.randomUUID().toString()); | ||
try { | ||
cleanOldFiles(); | ||
byte[] writeContent = historySchemaStr.getBytes(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had earlier removed usage of renames when handling state transitions because renames are not atomic in all storage types. why do we need to write temp file here ? Can we avoid using renames and instead use same state transitions like commits - e.g <instant.requested> <instant.inflight> and <instant.commit>. This way, we can reuse HoodieTimeline logic for scanning and constructing valid commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will redo this logcial, thanks
|
||
import org.apache.hudi.common.util.Option; | ||
|
||
abstract class InternalSchemasManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to AbstractInternalSchemaStorageManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
} | ||
IndexedRecord indexedRecord = (IndexedRecord) oldRecord; | ||
List<Schema.Field> fields = newSchema.getFields(); | ||
Map<Integer, Object> helper = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename helper to recordBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
for (int i = 0; i < fields.size(); i++) { | ||
Schema.Field field = fields.get(i); | ||
if (oldSchema.getField(field.name()) != null) { | ||
Schema.Field oldField = oldSchema.getField(field.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we are resolving the schema by name. How will rename work here? Is it because in subsequent diff, you will be addressing it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question。
this function must be used together with InternalSchemaUtils.mergeSchema。InternalSchemaUtils.mergeSchema will resolve the rename.
see the code AbstractHoodieLogRecordScanner.processDataBlock from [RFC-33] [HUDI-2429][WIP] Full schema evolution
#3668. and the test case also prove that we can handle the rename operation
@bvaradar thanks for your review. i will try address all comments next few days |
440a128
to
6b639ce
Compare
@bvaradar Sorry for not updating the code in time, #3668 will updated also later to verify the correctness of this modification。 |
6b639ce
to
dd08f05
Compare
ce54c0b
to
e7054c5
Compare
@hudi-bot run azure |
@bvaradar already rebase the code and addressed all comments. could you pls help me review those code again. thanks |
hello @bvaradar @vinothchandar |
Thanks A lot. I will go over another pass in the interim
Sent from Yahoo Mail for iPhone
On Thursday, December 16, 2021, 6:51 PM, xiarixiaoyao ***@***.***> wrote:
hello @bvaradar @vinothchandar
We are currently testing this feature on a large scale, and I will fix all the bugs found in the test。
hope you wait patiently, thanks
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@xiarixiaoyao : Any updates on the PR. |
@bvaradar @vinothchandar Current progress: once finished the work of support hive, will update code so soon as possible. |
@xiarixiaoyao : Thanks for the update. Were you able to finish the integration with Hive ? Can you update and rebase the PR. Thanks, |
@bvaradar Hive's work has been completed and will start updating the code today. Thank you very much for your patience. |
b6c2389
to
40e49cf
Compare
@bvaradar @leesf @codope @XuQianJin-Stars
look forward to your comments. |
c051dfd
to
a39dba8
Compare
ad1630e
to
a1b1cdb
Compare
@@ -48,6 +48,8 @@ | |||
INSERT_OVERWRITE_TABLE("insert_overwrite_table"), | |||
// compact | |||
COMPACT("compact"), | |||
// alter schema | |||
ALTER_SCHEMA("alter_schema"), | |||
// used for old version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if ALTER SCHEMA
belong to the write operation, and if the ddl operation should be put in BaseHoodieWriteClient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review.
I think it should belong to write opertion. DDL operations of sparksql will treat it as a write operation.
of course, if you think this is inappropriate, i will remove it from WriteOperation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store Alter Schema commands as a separate commits like other write operations. Hence, the need for a separate WriteOperation enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
156c20b
to
b395803
Compare
@bvaradar we stored history schemas into .schema for the reason that we want track the lineage of schema changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit comments. Overall looks good. Once, you are done with the changes, I will approve the diff. Looking into the Spark PR on top of this.
@xiarixiaoyao : We need one diff to document the usage and constraints (for eg: Hoodie columns should not be reordered, nested types ) of Alter commands as a section in Apache Hudi website.
/** | ||
* set the version ID for this schema. | ||
*/ | ||
public void setMax_column_id(int maxColumnId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make all the method names camelCase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to use this nonstandard naming, already fixed
public class InternalSchemaMerger { | ||
private final InternalSchema fileSchema; | ||
private final InternalSchema querySchema; | ||
// now there exist some bugs when we use spark update/merge api, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some pointers to the issue GH/Jira issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, already added
} | ||
|
||
private void cleanResidualFiles() { | ||
List<String> validateCommits = getValidInstants(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename validateCommits to validInstants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if (fs.exists(baseSchemaPath)) { | ||
List<String> candidateSchemaFiles = Arrays.stream(fs.listStatus(baseSchemaPath)).filter(f -> f.isFile()) | ||
.map(file -> file.getPath().getName()).collect(Collectors.toList()); | ||
List<String> validateSchemaFiles = candidateSchemaFiles.stream().filter(f -> validateCommits.contains(f.split("\\.")[0])).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: validateSchemaFiles => validSchemaFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@bvaradar Thank you very much for your comments yes, i will prepare a doc for alter commands |
@bvaradar @xiarixiaoyao Please let me know if/when this is ready for a review. We plan to code freeze this friday |
cc @xushiyan |
@vinothchandar yes, Thank you for your attention. In this PR, bvaradar also puts forward some modification suggestions. I'm actively address those comments |
I have been syncing with @bvaradar actually. Going to take my pass over this PST tomorrow. Stay tuned! |
@xiarixiaoyao In the meantime, please see if we can make it as safe as possible, in terms of having flags that will safeguard normal code paths. |
@vinothchandar thanks your very much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like mostly new code. I ll have to review more deeply. Flagged one concern on how we are saving the new instant
@xiarixiaoyao what testing has been done for this PR?
@@ -550,6 +551,9 @@ public void archive(HoodieEngineContext context, List<HoodieInstant> instants) t | |||
} | |||
} | |||
writeToFile(wrapperSchema, records); | |||
// try to clean old history schema. | |||
FileBasedInternalSchemaStorageManager fss = new FileBasedInternalSchemaStorageManager(metaClient); | |||
fss.cleanOldFiles(instants.stream().map(is -> is.getTimestamp()).collect(Collectors.toList())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be in cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am ok to move this to cleaner.
@@ -48,6 +48,8 @@ | |||
INSERT_OVERWRITE_TABLE("insert_overwrite_table"), | |||
// compact | |||
COMPACT("compact"), | |||
// alter schema | |||
ALTER_SCHEMA("alter_schema"), | |||
// used for old version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
* @return InternalSchema for this table | ||
*/ | ||
public Option<InternalSchema> getTableInternalSchemaFromCommitMetadata() { | ||
HoodieTimeline timeline = metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save it as an indexed file (HFile) so you can just read the last record or first record and be done? Having one source of truth would be good. We can also do this as follow up
@@ -591,6 +591,10 @@ private void revertCompleteToInflight(HoodieInstant completed, HoodieInstant inf | |||
} | |||
} | |||
|
|||
private Path getInstantFileNamePath(String fileName) { | |||
return new Path(fileName.contains(SAVE_SCHEMA_ACTION) ? metaClient.getSchemaFolderName() : metaClient.getMetaPath(), fileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we treat something as an instant, splitting this apart into folders seem a bit odd to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently our history schema is saved in/ hoodie/. In the schema directory, we need to save the schema through the metatable later, once done we don't need such strange logic anymore
when DDL happened, a new commit will be create and saved, just like what we do when we first create table by using sparksql |
as #4910 merged, |
Tips
What is the purpose of the pull request
Introduce id_based schema to support full schema evolution.
this pr is split from [RFC-33] [HUDI-2429][WIP] Full schema evolution #3668, since that pr is too large.
Brief change log
(for example:)
Verify this pull request
UT added.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.