New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3982] Comprehensive schema evolution in flink when read/batch/cow/snapshot #5443
Conversation
@danny0405 @bvaradar If you have free time,could you pls help review this pr, thanks very much |
What do you mean when you saying |
This PR covers the following case
|
<artifactId>spark-hive_${scala.binary.version}</artifactId> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduces the spark dependency in flink pom ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prepare test data. Currently, only Spark engine provides way to change schema and write new data after that.
I think when full support of schema evolution is implemented, we can remove this dependency by rewriting test to pure Flink
@@ -447,6 +453,17 @@ private Schema inferSchemaFromDdl() { | |||
return HoodieAvroUtils.addMetadataFields(schema, conf.getBoolean(FlinkOptions.CHANGELOG_ENABLED)); | |||
} | |||
|
|||
private SchemaEvoContext getSchemaEvoContext() { | |||
if (!conf.getBoolean(FlinkOptions.SCHEMA_EVOLUTION_ENABLED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns Option<SchemaEvoContext>
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. removed enabled field. isPresent means enabled.
LogicalTypeRoot to = toType.getTypeRoot(); | ||
switch (to) { | ||
case BIGINT: { | ||
// Integer => Long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the philosophy of these mappings ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume schema evolution DDL
alter table t1 alter column val type bigint
which changes type of val from int to bigint
We want to be able to read old data. To do it we need to cast val from int to long
otherwise, an exception will be thrown
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
at org.apache.flink.table.data.GenericRowData.getLong(GenericRowData.java:154)
This class is an analogue of org.apache.hudi.client.utils.SparkInternalSchemaConverter#convertColumnVectorType
which converts Spark's types
* Data class to pass schema evolution info from table source to input format. | ||
*/ | ||
public final class SchemaEvoContext implements Serializable { | ||
private final boolean enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this clazz necessary ? The enabled
flag can be replaced by Option< querySchema> non empty instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this clazz necessary ?
I think yes. Presented schema evolution methods moved from CopyOnWriteInputFormat to SchemaEvoContext to be reused in MergeOnReadInputFormat
The enabled flag can be replaced by Option< querySchema> non empty instead.
fixed by Option<SchemaEvoContext>
} | ||
|
||
private static final class ActualFields { | ||
private final String[] names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally i don't like the style that we introduces too many intermediate POJOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pojo removed
@@ -61,7 +68,10 @@ public static RowDataProjection instance(LogicalType[] types, int[] positions) { | |||
public RowData project(RowData rowData) { | |||
GenericRowData genericRowData = new GenericRowData(this.fieldGetters.length); | |||
for (int i = 0; i < this.fieldGetters.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do not affect the normal code path for non evolution ? Something like
public RowData project(RowData rowData, CastMap castMap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faire enough. Fixed. I extended RowDataProjection instead of project(RowData rowData, CastMap castMap)
because it is convenient to keep the CastMap
inside the projection
@hudi-bot run azure |
I merged all supported modes into one patch and reworked pull request. |
What is the purpose of the pull request
This PR adds support of reading by flink when comprehensive schema evolution(RFC-33) enabled and there were some operations add column, rename column, change type of column, drop column.
Supported mode: batch/cow/snapshot
Brief change log
CopyOnWriteInputFormat
. Now, during the opening, it calculates schema of file, if this schema differs from queried schema, it creates cast map. After reading file, type conversion is performed according to constructed map.Verify this pull request
This change added tests and can be verified as follows:
TestCastMap
to verify that type conversion is correctITTestReadWithSchemaEvo
to verify that table with added, renamed, casted, dropped columns is read as expected. This test usesTestSpark3DDL
to prepare data, so it works only with-P scala-2.12,spark3.2
, sinceTestSpark3DDL
works only with it.Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.