[HUDI-5533] Support spark columns comments#8683
[HUDI-5533] Support spark columns comments#8683parisni wants to merge 12 commits intoapache:masterfrom
Conversation
...datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Outdated
Show resolved
Hide resolved
Scala Option does not have ofNullable (java Optinonal do have). BTW |
| case INT => avroSchema.getLogicalType match { | ||
| case _: Date => SchemaType(DateType, nullable = false) | ||
| case _ => SchemaType(IntegerType, nullable = false) | ||
| (avroSchema.getType, Option(avroSchema.getDoc)) match { |
There was a problem hiding this comment.
The conversion tool is copied from Spark: https://github.com/apache/spark/blob/dd4db21cb69a9a9c3715360673a76e6f150303d4/connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L58, just noticed that Spark also does not support keeping comments from Avro fields while doing the converison.
There was a problem hiding this comment.
Likely spark also have this limitation when retrieving schema from avro. But spark don't usually infer spark schema from avro. Hudi does, and that's the reason of the patch.
There was a problem hiding this comment.
Can you write a test case for it, especially for creating table.
There was a problem hiding this comment.
Kind of feel there is no need to change each match for every data types, can we write another method similiar with toSqlTypeHelper which invokes toSqlTypeHelper firstly then fix the comment separately.
There was a problem hiding this comment.
Kind of feel there is no need to change each match for every data types
Any columns including nested columns also may have comments, so I don't see why we should'nt look after all avro content for doc.
which invokes toSqlTypeHelper firstly then fix the comment separately.
This would lead to walk thought the avro schema two times, and also lead to complex merge of results. Am i missing something ?
|
while |
…umns # Conflicts: # hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
|
I have investigated a bit, and here my current understanding: Reading hudi table w/ spark has two path:
So right now, using this PR and setting both To fix this we could:
I would go for Thought @danny0405 @bhasudha @yihua ? |
|
@danny0405 The PR is ready. The way to activate comment for all engines is as below: To recap:
BTW, I will provide some details about comments in the doc IYW The reason spark datasource should be disabled is our current table property builder rely on parquet types which does not know about comments. It would be painful to modify and spark table properties are useless and leads to errors. |
|
@danny0405 just learned here why the spark data source stuff shall be kept within hms
Then there is a need to update the datasource code to be aware of comments within hive sync. @prashantwason can you confirm this part is still applicable and as a result |
|
@danny0405 implemented hive sync datasource, so to clarify about what brings this PR :
#4960 provided:
#8740 provided:
Not covered yet:
|
|
Thanks @parisni , would find some time for the review ~ |
| Map<String, String> serdeProperties = ConfigUtils.toMap(config.getString(ADB_SYNC_SERDE_PROPERTIES)); | ||
| if (config.getBoolean(ADB_SYNC_SYNC_AS_SPARK_DATA_SOURCE_TABLE)) { | ||
| List<FieldSchema> fromStorage = syncClient.getStorageFieldSchemas(); | ||
| Map<String, String> sparkTableProperties = SparkDataSourceTableUtils.getSparkTableProperties(config.getSplitStrings(META_SYNC_PARTITION_FIELDS), |
There was a problem hiding this comment.
fromStorage -> fieldSchema
| stmt = connection.createStatement(); | ||
| LOG.info("Executing SQL " + s); | ||
| stmt.execute(s); | ||
| stmt.execute(escapeAntiSlash(s)); |
There was a problem hiding this comment.
Is the escape related with this change ?
There was a problem hiding this comment.
yes, both JDBC and HMS generate sql and antislash should be doubled escaped otherwise it is lost.
Antislash is used to escape double quotes in the comments DDL.
|
|
||
| /** | ||
| * SQL statement should be be escaped in order to consider anti-slash | ||
| * |
There was a problem hiding this comment.
Every sentence should end up with .. Every new paragraph should start with <p>, remove the params comments if there are no comments at all.
| private static String getComment(String name, List<FieldSchema> fromStorage) { | ||
| return fromStorage.stream() | ||
| .filter(f -> name.equals(f.getName())) | ||
| .filter(f -> f.getComment().isPresent()) |
There was a problem hiding this comment.
So we only match from the first level for the fields by name.
There was a problem hiding this comment.
Yes that's the current limitation of comments support. It's a flatten list of fields.
|
|
||
| private static String escapeQuote(String s) { | ||
| return s.replaceAll("\"", Matcher.quoteReplacement("\\\"")); | ||
| } |
There was a problem hiding this comment.
Not sure why we need a escape.
There was a problem hiding this comment.
see UT. I added a double quote in the column comment, and for some hive sync ddl exectutors, this is needed
| if (field.getOriginalType() == null) { | ||
| return convertToSparkSchemaJson(field); | ||
| return convertToSparkSchemaJson(field, Arrays.asList()); | ||
| } |
There was a problem hiding this comment.
Does this mean composition data type is not supported?
There was a problem hiding this comment.
yes, currently in hive sync we don't have the composition data type List<FieldSchema> is flat
| messageType); | ||
| messageType, | ||
| // flink does not support comment yet | ||
| Arrays.asList()); |
There was a problem hiding this comment.
Collections.emptyList() ?
Change Logs
fixes #7531 ie: show comments within spark schemas
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
None
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist