-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1453] Throw Exception when input data schema is not equal to th… #2334
Conversation
4a0901b
to
cb2acaa
Compare
8d0122b
to
2e086b9
Compare
Codecov Report
@@ Coverage Diff @@
## master #2334 +/- ##
=========================================
Coverage 50.73% 50.73%
- Complexity 3064 3065 +1
=========================================
Files 419 419
Lines 18797 18797
Branches 1922 1922
=========================================
+ Hits 9536 9537 +1
Misses 8485 8485
+ Partials 776 775 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
ddafdb1
to
b58bf4d
Compare
@pengzhiwei2018 : sorry I am not following the problem here. Can you throw some light. |
Hi @nsivabalan, When I write a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -38,8 +38,7 @@ | |||
public HoodieBootstrapHandle(HoodieWriteConfig config, String commitTime, HoodieTable<T, I, K, O> hoodieTable, | |||
String partitionPath, String fileId, TaskContextSupplier taskContextSupplier) { | |||
super(config, commitTime, hoodieTable, partitionPath, fileId, | |||
Pair.of(HoodieAvroUtils.RECORD_KEY_SCHEMA, | |||
HoodieAvroUtils.addMetadataFields(HoodieAvroUtils.RECORD_KEY_SCHEMA)), taskContextSupplier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bvaradar : can you confirm that this change looks good.
@@ -152,9 +155,9 @@ public void write() { | |||
final String key = keyIterator.next(); | |||
HoodieRecord<T> record = recordMap.get(key); | |||
if (useWriterSchema) { | |||
write(record, record.getData().getInsertValue(writerSchemaWithMetafields)); | |||
write(record, record.getData().getInsertValue(inputSchemaWithMetaFields)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. not exactly related to this patch. But can we rename "useWriterSchema" to something like "isCompactor" so that its explicit and comprehensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion!isCompactor
well be better understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually disagree completely :) . Leaking such call hierarchy into a lower level class will lead to more confusions, if say one more code path uses this code.
*/ | ||
protected final Schema inputSchema; | ||
/** | ||
* The input shema with meta fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. typo. "schema"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your correct.
/** | ||
* The table schema is the schema of the table which used to read/write record from table. | ||
*/ | ||
protected final Schema tableSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this change. Definitely looks more readable now and avoids any confusion
writerSchema = HoodieAvroUtils.createHoodieWriteSchema(config.getSchema()); | ||
tableSchema = HoodieAvroUtils.createHoodieWriteSchema(schemaUtil.getTableAvroSchemaWithoutMetadataFields()); | ||
tableSchema = getTableSchema(config, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't the 2nd arg supposed to be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nsivabalan ,here createHoodieWriteSchema
will add the meta fields to the schema, so the tableSchema
should also have the meta fields here.
this.writerSchemaWithMetafields = writerSchemaIncludingAndExcludingMetadataPair.getValue(); | ||
// Here the hoodieTable#config may not equal to the config field, we use the 'config' from the | ||
// construct method | ||
this.tableSchema = schemaOption.orElseGet(() -> hoodieTable.getTableSchema(config, false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to confirm my understanding. Only for bootstrap code path, we need the table schema to be initialized w/ incoming Schema and can't rely on hoodieTable.getTableSchema(). If not, we should not be looking at schemaOption only to set tableSchema(in every other call paths).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @nsivabalan ,the schemaOption
is only used for HoodieBootstrapHandle
to pass it's schema.
public Schema getWriterSchema() { | ||
return writerSchema; | ||
return tableSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel we could rename this method to getReaderSchema() since this is used only at one place
bootstrapReadSchema = mergeHandle.getWriterSchema();
And we assign it to a var to store read schema anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reasonable!
@pengzhiwei2018 @nsivabalan Is there a different way to resolve this issue ? The writer/reader schema is baked into many parts of the code, even on the reader side, see example here -> https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L70 To be able to remove the concept of writer into input requires more changes in the code and also requires us to understand what does the readerSchema mean in this scenario. I'm OK if you want to introduce another transient field called |
Hi @n3nash ,Thanks for your suggestion. And I have update the describe the changes in more detail in the Brief change log part for better to understand. You take a look again when you have time. |
af78a34
to
9afb3be
Compare
39c4f57
to
908bb9e
Compare
…e hoodie table schema add test case fix test case fix test case fix test case fix test fix test case fix format fix format fix format fix test case & add comment add some comment fix change schema bug revert merge schema refactor code
908bb9e
to
bbb604a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pengzhiwei2018 first of all, thanks for these great contributions.
Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands for inputSchema, input is what is being written, right? We can probably just leave it as is. and introduce new tableSchema
variables as you have in the HoodieWriteHandle
class.?
Like someone else pointed out as well, so far, we are using read and write schemas consistently. Love to not introduce a new input schema, unless its absolutely necessary .
Hi @vinothchandar ,thanks for your reply on this issue.
|
So to rephrase the description this solves the case where the input-data has a compatible field (int) to be written to the table (long field). Can this issue not be solved at the input record level by converting the "int" data into the "long" before writing into HUDI? hoodieTable.getTableSchema() always returns the "latest" schema which is the schema used for the last HoodieWriteClient (saved into commit instants). So when the "int" based RDD is written the table schema will no long have a "long" field. When this table schema is used to read an older file in the table (merge during updatehandle) then the reading should fail as a long (from parquet) cannot be converted to an int (from schema). This is actually a backward incompatible schema change and hence is not allowed by HUDI. @pengzhiwei2018 Can you add a test to verify my hypothesis? In your existing test in TestCOWDataSource, can you write a long to the table in the next write? Also, can you read all the data using the "int" schema, even the older records which contain a long? |
just to clarify. some schema evolution just for types are working fine w/ hoodie. for eg: integer to double works fine. Problem is that, double to int is where the issue is. I am not sure why schema compatibility check does not fail this evolution. |
oh, didn't realize default schema compatibility check is false. I assume if I enable schema compatibility check, double to int evolution is likely to fail. So, not sure what this PR tries to achieve. At least from my local test run, integer to double worked for me. |
Hi @nsivabalan , I have take a review for your test code. First you write a "int" to the table, so the table schema type is "int". Then you write a "double" to the table, so the table schema become to "double". The table schema changed from "int" to "double". I think this is more reasonable. In my original idea, I think the first write schema(e.g. "int") is the table schema forever. The incoming records after that should be compatible with the origin table schema(e.g. "int"). This is this PR wants to solve. I can understand more clearly now. The table schema should be change to a more generic type (e.g. from "int" to "double"), but not always be the first write schema. |
Yeah, I did verify by enabling schema compatability check. it will fail if we try to evolve a field from double to int.
|
…e hoodie table schema
add test case
Tips
What is the purpose of the pull request
Fix the bug when write compatible data type to the hudi table, an exception throw out. For example when first write a double to the table, then write a int to the table, an exception throw out as the ISSUE [HUDI-1453] describe.
Brief change log
Currently, Hudi will store the incoming record schema(we call it inputSchema here) in
HoodieWriteConfig
. In the write stage, Hoodie use the inputSchema to parse the incoming record and read&write record to the hoodie table.In most case it works well.However when the incoming schema is compatible but not equal to the table schema(write a "int" to "long").It will throw an Exception in write stage as the ISSUE [HUDI-1453] describe. I fix this issue by distinguishing the two kind schemas(
inputSchema
andtableSchema
). Here is the major change:writerSchema
intoinputSchema
andtableSchema
inHoodieWriteHandle
.TheinputSchema
is used forgetInsertValue
to parse the incoming record.And thetableSchema
is used for read/write data in/to the table which is the schema of the hoodie table.writerSchema
toinputSchema
ortableSchema
, according to the usage scenario. If parser the incoming record, we use theinputSchema
. If we read/write data from the table, we usetableSchema
.getTableSchema
method inHoodieTable
to read the table schema.HoodieWriteHandle
. Add an Option param to the construct method forHoodieBootstrapHandle
to pass the specified schema.Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.