[HUDI-2560][RFC-33] Support full Schema evolution for Spark #4910

xiarixiaoyao · 2022-02-25T12:46:30Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Support full schema evolution for hoodie:

support spark3 DDL. include:
alter statement:
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
support follow types

int => long/float/double/string
long => float/double/string
float => double/String
double => String/Decimal
Decimal => Decimal/String
String => date/decimal
date => String

ALTER TABLE table1 ALTER COLUMN a.b.c SET NOT NULL
ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
ALTER TABLE table1 ALTER COLUMN a.b.c FIRST
ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x
add statement:
ALTER TABLE table1 ADD COLUMNS (col_name data_type [COMMENT col_comment], ...);
rename:
ALTER TABLE table1 RENAME COLUMN a.b.c TO x
drop:
ALTER TABLE table1 DROP COLUMN a.b.c
ALTER TABLE table1 DROP COLUMNS a.b.c, x, y
set/unset Properties:
ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value');
ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key');

support mor(incremental/realtime/optimize) read/write
support cow (incremental/realtime) read/write
support mor compaction

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2022-02-25T13:06:27Z

@bvaradar @codope @vinothchandar @xushiyan
could you pls help me review this pr, this pr stacked on hudi-2569, all sql support can be find in TestSpark3DDL.scala
support for hive and flink will come soon

xushiyan · 2022-03-01T09:57:58Z

@YannByron if you get a chance, please take a pass also. thanks

xiarixiaoyao · 2022-03-03T12:21:28Z

@hudi-bot run azure

xiarixiaoyao · 2022-03-04T07:39:16Z

@hudi-bot run azure

xiarixiaoyao · 2022-03-06T07:42:07Z

@hudi-bot run azure

xiarixiaoyao · 2022-03-07T03:56:11Z

@bvaradar sorry to bother you，if you have free time，could you pls help me review this pr, thanks.

YannByron · 2022-03-07T09:36:51Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/Types.java

+      this.fields = new Field[fields.size()];
+      for (int i = 0; i < this.fields.length; i += 1) {
+        this.fields[i] = fields.get(i);
+      }


can we initialize nameToFields and idToFields here.

Thank you very much for your advice. But I think these two methods are more appropriate in the internal schema, and it feels strange in the types

YannByron · 2022-03-07T09:43:30Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java

+  public InternalSchema(long versionId, List<Field> cols) {
+    this.versionId = versionId;
+    this.record = RecordType.get(cols);
+    if (versionId >= 0) {


i think here don't need to judge if versionId >= 0. Even though this a dummy schema, we also can initialize idToField, nameToId, idToName and maxColumnId according to cols.

And also, if cols is empty, all member variables is empty or initial values like maxColumnId and versionId are -1. so i suggest to finish filling up members here.

YannByron · 2022-03-07T09:49:40Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java

+        nameToId = idToName.entrySet().stream().collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey));
+        return nameToId;
+      }
+      nameToId = InternalSchemaUtils.buildNameToId(record);


Both nameToId and idToName must be empty or not empty at the same time. Here, maybe not. So init both inside of the construct method above.

YannByron · 2022-03-07T09:55:02Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java

+   * set the version ID for this schema.
+   */
+  public InternalSchema setSchemaId(long versionId) {
+    this.versionId = versionId;


when will the setSchemaId and setMax_column_id be called alone.

1)when we add a new column, the max_column_id will be +1
2) when we do DDL, before we save internalSchema, we will setSchemaId for it.

YannByron · 2022-03-07T09:58:07Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/InternalSchema.java

+ * Internal schema for hudi table.
+ * used to support schema evolution.
+ */
+public class InternalSchema implements Serializable {


I suggest that separate the get methods like getXXX and findXXX and set methods strictly. The getXXX methods should not change or initialize the private member variables.

it's ok to me. i just want to realize lazy find/get, when the internalSchema is very large, class initialization will be slow

not all fields in InternalSchema will be used，so use lazy initialization mode

YannByron · 2022-03-07T10:05:51Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/action/TableChange.java

+  class ColumnPositionChange {
+    public enum ColumnPositionType {
+      FIRST,
+      BEFORE,


maybe FIRST and AFTER is enough. BEFORE is not necessary. we can keep this syntax consistent with Spark.

This PR is not only for spark. Users can also modfiy column changes using the API interface

YannByron · 2022-03-08T02:58:09Z

hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java

+   */
+  private Option<InternalSchema> getTableInternalSchemaFromCommitMetadata(HoodieInstant instant) {
+    try {
+      HoodieTimeline timeline = metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();


just using metaClient.getActiveTimeline() may be enough.

YannByron · 2022-03-08T03:25:01Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala

@@ -177,6 +192,70 @@ class DefaultSource extends RelationProvider

  override def shortName(): String = "hudi_v1"

+  private def getBaseFileOnlyView(useHoodieFileIndex: Boolean,


these codes have been discarded. please adapt the new BaseFileOnlyViewRelation.

Yes, this is a issue I want to discuss with you, BaseFileOnlyViewRelation cannot support vector read/ DPP, We ran over 1t of tpcds, and the performance decreased significantly
I will use another PR to support BaseFileOnlyViewRelation.

@xiarixiaoyao : Are we regressing to poor performance with this change ? @YannByron mentions that this code has been discarded.

@bvaradar yes i will try to paste a simple perf result， tomorrow.
Adapting this change is very easy, but I don't want to lose too much performance

YannByron · 2022-03-08T03:29:06Z

@xiarixiaoyao 'Cause this pr is so huge. So please help me to sort the implement out.

do the SerDeHelper.LATESTSCHEMA attribute of one commit file and the SAVE_SCHEMA_ACTION file save the same thing, or can they convert each other?
if enable hoodie.schema.evolution.enable, will every commit persist SerDeHelper.LATESTSCHEMA in meta file ?
When will commit the SAVE_SCHEMA_ACTION file? Once that schema is changed ?
How to make the Hudi Table with old version like 0.10 compatible with this ? If enable hoodie.schema.evolution.enable on an existed old-version hudi table, what will happen? Or we are not about to make them compatible, then how to refuse this.
this pr can work when enable hoodie.metadata.enable ?
why we need to separate Spark3.1 and Spark3.2? See more repeated codes. so try to optimize them if indeed need to deal with spark3.1 and spark3.2 separately.

melin · 2022-03-08T03:49:41Z

spark 3.3 column support default value， apache/spark#35690 (comment) Yann Byron ***@***.***> 于2022年3月8日周二 11:29写道：

…

@xiarixiaoyao <https://github.com/xiarixiaoyao> 'Cause this pr is so huge. So please help me to sort the implement out. 1. do the SerDeHelper.LATESTSCHEMA attribute of one commit file and the SAVE_SCHEMA_ACTION file save the same thing, or can they convert each other? 2. if enable hoodie.schema.evolution.enable, will every commit persist SerDeHelper.LATESTSCHEMA in meta file ? 3. When will commit the SAVE_SCHEMA_ACTION file? Once that schema is changed ? 4. How to make the Hudi Table with old version like 0.10 compatible with this ? If enable hoodie.schema.evolution.enable on an existed old-version hudi table, what will happen? Or we are not about to make them compatible, then how to refuse this. 5. this pr can work when enable hoodie.metadata.enable ? 6. why we need to separate Spark3.1 and Spark3.2? — Reply to this email directly, view it on GitHub <#4910 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIXXZTARW3OWW5B5B6XX2TU63CQ5ANCNFSM5PKF6GRQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

xiarixiaoyao · 2022-04-01T13:06:03Z

@xiarixiaoyao : Finished reviewing new code changes. Making one final pass now. Meanwhile, please take a look at the comments.

Also one more comment: Its good that we moved the version definition of caffeine to parent pom file. We would still need a dependency statement in hudi-common/pom.xml (without version) as caffeine is directly used in the code. Can you add the below patch
diff --git a/hudi-common/pom.xml b/hudi-common/pom.xml
index f60573875c..ece8d242ce 100644
--- a/hudi-common/pom.xml
+++ b/hudi-common/pom.xml
@@ -117,6 +117,12 @@
       <artifactId>avro</artifactId>
     </dependency>
 
+    
+    <dependency>
+      <groupId>com.github.ben-manes.caffeine</groupId>
+      <artifactId>caffeine</artifactId>
+    </dependency>
+
     
     <dependency>
       <groupId>org.apache.parquet</groupId>

@bvaradar Thanks for reminding, already addressed all comments

bvaradar

FInished taking a full pass @xiarixiaoyao : One question in HoodieSparkSqlWriter. Otherwise, looks safe to land.

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala

bvaradar · 2022-04-01T13:44:46Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-              schema = getLatestTableSchema(fs, basePath, sparkContext, schema)
+              schema = lastestSchema
+            }
+            schema = {


@xiarixiaoyao : Do we need to do this for all cases ? Is it safe to do this only cases where internalSchema is not empty ?

it workd for all case in hudi 0.9
for safe, let me add some logcial to fallback original logical
fixed. as follow code.

if (internalSchemaOpt.isDefined) { schema = { val newSparkSchema = AvroConversionUtils.convertAvroSchemaToStructType(AvroSchemaEvolutionUtils.canonicalizeColumnNullability(schema, lastestSchema)) AvroConversionUtils.convertStructTypeToAvroSchema(newSparkSchema, structName, nameSpace) } }

xiarixiaoyao · 2022-04-01T14:24:50Z

@bvaradar have address all comments.
Thank you very much for your comments， It's a pleasure to cooperate with you

vinothchandar · 2022-04-01T15:08:03Z

CI here. Still queued
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=7707&view=results

bvaradar

LGTM @xiarixiaoyao . This is awesome work. Thanks a lot for contributing this feature and waiting for the review patiently.

To have users understand how to utilize schema on read feature, can you add a section in Spark Guide in the next PR which we can land independently.

We still have few more PRs to go for this feature after 0.11 - Hive, Concurrency, Trino, Flink... Looking forward to reviewing them.

hudi-bot · 2022-04-01T17:39:52Z

CI report:

52be34d UNKNOWN
78e86dd UNKNOWN
fa9cee1 UNKNOWN
60d9cf8 UNKNOWN
9729597 UNKNOWN
a543ce2 UNKNOWN
f7a1729 UNKNOWN
1816108 UNKNOWN
85cc0c3 UNKNOWN
c41514d UNKNOWN
52b0671 UNKNOWN
d9cc545 UNKNOWN
4096466 UNKNOWN
222bf09 UNKNOWN
ce6743b UNKNOWN
7ff8b85 UNKNOWN
b3d94a1 UNKNOWN
7c80ff0 UNKNOWN
d41e118 UNKNOWN
6d51019 UNKNOWN
f431089 UNKNOWN
a66629f UNKNOWN
c7c98b0 UNKNOWN
2415e5c Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar · 2022-04-01T18:48:36Z

Still waiting for 1 job to finish for landing

xiarixiaoyao · 2022-04-02T01:37:38Z

LGTM @xiarixiaoyao . This is awesome work. Thanks a lot for contributing this feature and waiting for the review patiently.

To have users understand how to utilize schema on read feature, can you add a section in Spark Guide in the next PR which we can land independently.

We still have few more PRs to go for this feature after 0.11 - Hive, Concurrency, Trino, Flink... Looking forward to reviewing them.
yes，let me add a new section in Spark Guide

pratyakshsharma · 2022-04-07T09:00:53Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

+   * @param colName col name to be changed. if we want to change col from a nested filed, the fullName should be specify
+   * @param doc .
+   */
+  public void updateColumnComment(String colName, String doc) {


Where are all these functions getting used? I do not see any caller for these @xiarixiaoyao

These are the exposed API interfaces.
i recommend to do DDL operation by SparkSQL #5238

pratyakshsharma · 2022-04-07T11:23:41Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java

+    }
+    // try to find all added columns
+    if (diffFromOldSchema.size() != 0) {
+      throw new UnsupportedOperationException("Cannot evolve schema implicitly, find delete/rename operation");


Just trying to understand, is this being done for old Hudi tables? What is meant by evolve schema implicitly? I guess the error message is not very clear?

@xiarixiaoyao

pratyakshsharma · 2022-04-07T14:34:16Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

+    extraMeta.put(SerDeHelper.LATEST_SCHEMA, SerDeHelper.toJson(newSchema.setSchemaId(Long.getLong(instantTime))));
+    // try to save history schemas
+    FileBasedInternalSchemaStorageManager schemasManager = new FileBasedInternalSchemaStorageManager(metaClient);
+    schemasManager.persistHistorySchemaStr(instantTime, SerDeHelper.inheritSchemas(newSchema, historySchemaStr));


what is the purpose of storing history schema here, I guess this is redundant since we are anyways storing the evolved schema as history schema in saveInternalSchema() method which gets called from commitStats(). WDYT @xiarixiaoyao ?
Also can you share your slack id with me, it will be easier to coordinate with you.

of course, already ping you in slack

danny0405 · 2022-04-13T07:25:02Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

+        metadata.addMetadata(SerDeHelper.LATEST_SCHEMA, newSchemaStr);
+        schemasManager.persistHistorySchemaStr(instantTime, SerDeHelper.inheritSchemas(evolvedSchema, historySchemaStr));
+      }
+      // update SCHEMA_KEY


For what concern we start a separate timeline for schemas, is there possibility we reuse the existing meta files for the internal schema ? And do we have plan to replace the avro schema with internal schema in the future ? The Avro schema can not handle data types like small int and tiny int.

answer
1）I think DDL should be an independent operation and should not intersect with the original commit
2）yes，we plan to do that， but before we start that we need flink to support full schema evolution， Otherwise, the gap of the Flink module and other modules will become larger and larger

So what is the relationship with the DDL schema change and the schema change on write ? For schema change on write, we already reuse the schema in the instant metadata file, we should elaborate more to have uniform abstraction for these two cases.

Hi danney Do you want to ask line 292 why we use a new timeline to save history schema？

With this patch, we have avro schema for metadata file and an separate internal schema for DDL operations, and the avro schema can also handle the shema change on write, these abstraction is not that clear and we need to elaborate more with the behaviors.

I guess we probably need a blog or proper documentation describing the changes on a high level. WDYT @xiarixiaoyao ?

danny0405 · 2022-04-13T07:30:39Z

...t-common/src/main/java/org/apache/hudi/table/action/compact/RunCompactionActionExecutor.java

@@ -70,8 +73,19 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> execute() {
      HoodieCompactionPlan compactionPlan =
          CompactionUtils.getCompactionPlan(table.getMetaClient(), instantTime);

+      // try to load internalSchema to support schema Evolution
+      HoodieWriteConfig configCopy = config;
+      Pair<Option<String>, Option<String>> schemaPair = InternalSchemaCache


What is the purpose we re-assign the config to configCopy and modify it directly ? I mean, you should either modify the config directly or copy the whole config to configCopy !

origin config will not be modfy, when schema evolution happen，we will copy whole config to configCopy and modfiy the copy config； otherwize nothing will happen
maybe i miss something

The configCopy and config reference the same Java object ;)

danny0405 · 2022-04-13T07:33:23Z

hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/SchemaChangeUtils.java

+   * @return whether to allow the column type to be updated.
+   */
+  public static boolean isTypeUpdateAllow(Type src, Type dsr) {
+    if (src.isNestedType() || dsr.isNestedType()) {


isTypeUpdateAllow => isTypeUpdateAllowed

lets open another pr to track and modfiy those new comments

flashJd · 2022-10-10T03:49:49Z

@xiarixiaoyao As ref-33 said, Partition evolution is not included in this design, Partition evolution will come soon after schema evolution., severaI months passed, I want to know if partition evolution is being considered, i contact you in DingTalk also

pratyakshsharma · 2022-11-04T13:49:38Z

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

+    // TODO support bootstrap
+    if (querySchemaOpt.isPresent() && !baseFile.getBootstrapBaseFile().isPresent()) {
+      // check implicitly add columns, and position reorder(spark sql may change cols order)
+      InternalSchema querySchema = AvroSchemaEvolutionUtils.evolveSchemaFromNewAvroSchema(readSchema, querySchemaOpt.get(), true);


this line merges the internalSchema with the incoming schema and gives us another internalSchema (querySchema).

querySchema is the latest table schema which parsed from commit file. eg: a int, b string, c int

readSchema is write schema(from dataFrame) by defalut, eg: a int, b string, c int, d string, e string

we do evolution to get the final schema, eg: a int, b string, c int, d string, e string

pratyakshsharma · 2022-11-04T13:51:18Z

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

+                      && writeInternalSchema.findIdByName(f) != -1
+                      && writeInternalSchema.findType(writeInternalSchema.findIdByName(f)).equals(querySchema.findType(writeInternalSchema.findIdByName(f)))).collect(Collectors.toList());
+      readSchema = AvroInternalSchemaConverter.convert(new InternalSchemaMerger(writeInternalSchema, querySchema, true, false).mergeSchema(), readSchema.getName());
+      Schema writeSchemaFromFile = AvroInternalSchemaConverter.convert(writeInternalSchema, readSchema.getName());


what is the need of the above 2 convert calls @xiarixiaoyao ? I guess it would be better to add some documentation or comments around it.

line112: we need to convert interalSchema to avroSchema and pass it to readSchema, and HoodieAvroUtils.rewriteRecordWithNewSchema will use reaSchema to get correct GenericRecord from parquet file.

eg: old parquet schema is: a int, b double , and genericRecord1 is data read from old parquet
but now incoming schema is: a long, c int, b string, and genericRecord2 is incoming data
we cannot merge genericRecord1 and genericRecord2, so we need rewrite genericRecord1 with new schema: a long, c int, b string

line113 is only used to check SchemaCompatibilityType

So let me reframe my question. On line 99, we only take care of addition of new columns in incoming schema for combining latest schema from commit file (S1) and incoming schema (S2). After combining them, we populate the combined schema to the variable querySchema (S3). As I understand, writeInternalSchema (S4) variable contains the same schema as S1.
Now on line 112, we merge S3 and S4 to take care of column type change and column renames. We finally convert this from InternalSchema to avroSchema using convert call.

Please correct me if I am wrong in the above explanation. Now I have below questions -

If S4 is same as S1, why do we even need this variable writeInternalSchema? We can simply use S1 throughout the if block.

Are we not supporting deletion of columns yet? Can you point me to the lines of code or the method where we are taking care of deletion?

good question！
question1：
S4 is not same as S1. S4 is the really schema from parquet file, if we do lots of DDL opertion on current table, S4 and s1 may differ greatly.
eg:
tableA: a int, b string, c double and there exist three files in this table: f1, f2, f3

drop column from tableA and add new column d, and then we update tableA, but we only update f2,and f3, f1 is not touched
now schema

tableA: a int, b string, d long. S1: a int, b string, d long S4 from f1 is: a int, b string , c double

question2:
no we supporting delete of columns, Let's use the above example to illustrate：
line 112， we merge S3 and S4 to get the final read Schema,

tableA: a int, b string, d long. S3: a int, b string, d long S4 from f1 is: a int, b string , c double merge S3 and S4: a int, b string, d long column c is dropped,

the values read from parquet f1 will be

a b d 1 'test' null

d is null, since f1 is not contains column d. column c is dropped, since current table is not contains column c.

pratyakshsharma · 2022-11-06T07:36:06Z

.../hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java

+      readSchema = AvroInternalSchemaConverter.convert(new InternalSchemaMerger(writeInternalSchema, querySchema, true, false).mergeSchema(), readSchema.getName());
+      Schema writeSchemaFromFile = AvroInternalSchemaConverter.convert(writeInternalSchema, readSchema.getName());
+      needToReWriteRecord = sameCols.size() != colNamesFromWriteSchema.size()
+              || SchemaCompatibility.checkReaderWriterCompatibility(writeSchemaFromFile, readSchema).getType() == org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;


why do we want to rewrite the record if writer and reader schemas are compatible? What do we miss if we do not do this?

i think this is a bug. will fixed it today, thanks

xiarixiaoyao · 2022-11-07T03:48:45Z

@xiarixiaoyao As ref-33 said, Partition evolution is not included in this design, Partition evolution will come soon after schema evolution., severaI months passed, I want to know if partition evolution is being considered, i contact you in DingTalk also

sorry，miss the message. wait for HUDI-5148.
i am not a fun of DingTalk, my webchat is 1037817390,

TengHuo · 2022-11-25T09:25:27Z

...ommon/src/main/java/org/apache/hudi/internal/schema/convert/AvroInternalSchemaConverter.java

+        primitiveSchema = LogicalTypes.decimal(decimal.precision(), decimal.scale())
+                .addToSchema(Schema.createFixed(
+                        "decimal_" + decimal.precision() + "_" + decimal.scale(),
+                        null, null, computeMinBytesForPrecision(decimal.precision())));


Hi @xiarixiaoyao

We are doing trouble shooting on one issue #7284 recently, which is an inconsistent Avro schema namespace exception Caused by: org.apache.avro.AvroTypeException: Found decimal_25_4, expecting union during reading log files. After checking, we found this name decimal_25_4 generated here.

May I ask the reason why the name is setup as this pattern "decimal_" + decimal.precision() + "_" + decimal.scale() here? Can we keep the original name if we add more arguments in method visitInternalPrimitiveToBuildAvroPrimitiveType?

E.g.

private static Schema visitInternalPrimitiveToBuildAvroPrimitiveType(Type.PrimitiveType primitive, String name, String space)

Nvm, this issue has been fixed in PR #6358

xiarixiaoyao force-pushed the ssc branch from 9c09a92 to 4e04e20 Compare February 25, 2022 12:56

xiarixiaoyao assigned bvaradar Feb 25, 2022

xiarixiaoyao force-pushed the ssc branch 6 times, most recently from cb8c6f4 to fa9cee1 Compare March 1, 2022 04:12

xiarixiaoyao mentioned this pull request Mar 1, 2022

[HUDI-2560] introduce id_based schema to support full schema evolution. #3808

Closed

5 tasks

xushiyan assigned vinothchandar Mar 1, 2022

xushiyan added priority:blocker big-needle-movers labels Mar 1, 2022

xiarixiaoyao force-pushed the ssc branch from fa9cee1 to 059686f Compare March 3, 2022 12:02

xiarixiaoyao force-pushed the ssc branch from 059686f to 3c293b7 Compare March 4, 2022 07:22

YannByron reviewed Mar 7, 2022

View reviewed changes

YannByron reviewed Mar 8, 2022

View reviewed changes

xiarixiaoyao added 3 commits April 1, 2022 21:03

rebase code , and prepare for address new comments

dc3b29a

address commits

3c9bfc9

address new comments

71fcd14

xiarixiaoyao force-pushed the ssc branch from a85abe9 to a66629f Compare April 1, 2022 13:04

xiarixiaoyao force-pushed the ssc branch from a66629f to e28e7f2 Compare April 1, 2022 13:11

bvaradar reviewed Apr 1, 2022

View reviewed changes

fix new issues

c7c98b0

xiarixiaoyao force-pushed the ssc branch from e28e7f2 to c7c98b0 Compare April 1, 2022 14:15

control fallback original write logical

2415e5c

bvaradar approved these changes Apr 1, 2022

View reviewed changes

vinothchandar merged commit 444ff49 into apache:master Apr 1, 2022

xushiyan changed the title ~~[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark~~ [HUDI-2560][RFC-33] Support full Schema evolution for Spark Apr 5, 2022

pratyakshsharma reviewed Apr 7, 2022

View reviewed changes

danny0405 reviewed Apr 13, 2022

View reviewed changes

pratyakshsharma reviewed Nov 4, 2022

View reviewed changes

pratyakshsharma reviewed Nov 6, 2022

View reviewed changes

TengHuo reviewed Nov 25, 2022

View reviewed changes

		@@ -177,6 +192,70 @@ class DefaultSource extends RelationProvider

		override def shortName(): String = "hudi_v1"

		private def getBaseFileOnlyView(useHoodieFileIndex: Boolean,

[HUDI-2560][RFC-33] Support full Schema evolution for Spark #4910

[HUDI-2560][RFC-33] Support full Schema evolution for Spark #4910

Conversation

xiarixiaoyao commented Feb 25, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

xiarixiaoyao commented Feb 25, 2022 • edited

xushiyan commented Mar 1, 2022

xiarixiaoyao commented Mar 3, 2022

xiarixiaoyao commented Mar 4, 2022

xiarixiaoyao commented Mar 6, 2022

xiarixiaoyao commented Mar 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YannByron commented Mar 8, 2022 • edited

melin commented Mar 8, 2022 via email

xiarixiaoyao commented Apr 1, 2022 • edited

bvaradar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Apr 1, 2022

vinothchandar commented Apr 1, 2022

bvaradar left a comment

Choose a reason for hiding this comment

hudi-bot commented Apr 1, 2022

CI report:

vinothchandar commented Apr 1, 2022

xiarixiaoyao commented Apr 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratyakshsharma Apr 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flashJd commented Oct 10, 2022 • edited

Choose a reason for hiding this comment

xiarixiaoyao Nov 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao Nov 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiarixiaoyao commented Nov 7, 2022

xiarixiaoyao commented Feb 25, 2022 •

edited

YannByron commented Mar 8, 2022 •

edited

xiarixiaoyao commented Apr 1, 2022 •

edited

pratyakshsharma Apr 7, 2022 •

edited

flashJd commented Oct 10, 2022 •

edited

xiarixiaoyao Nov 7, 2022 •

edited

xiarixiaoyao Nov 8, 2022 •

edited