[SUPPORT] table comments not fully supported #7531

parisni · 2022-12-21T10:59:33Z

Hudi 12.1

When upsert spark DF with comments metadata, then it is present un the Avro shema commited. Also if enabled it is propagated in HMS.
But spark datasource likely omit them while reading. As a result they are hidden when reading from spark

yihua · 2022-12-21T20:31:56Z

@parisni Thanks for raising this issue. Could you provide more details and reproducible steps? When saying spark DF with comments metadata, do you mean the schema associated with the dataframe has the comments?

parisni · 2022-12-21T21:07:52Z

When saying spark DF with comments metadata, do you mean the schema associated with the dataframe has the comments?

That's it. Well, basically the steps are: 1. Create a DF 2. Add comment to it 3. Write the hudi table from that df 4. Read the resulting table and print schema 5. The comments are not shown while being in the avro schema

…

On December 21, 2022 8:32:07 PM UTC, Y Ethan Guo ***@***.***> wrote: @parisni Thanks for raising this issue. Could you provide more details and reproducible steps? When saying `spark DF with comments metadata`, do you mean the schema associated with the dataframe has the comments? -- Reply to this email directly or view it on GitHub: #7531 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

parisni · 2022-12-22T09:02:32Z

@yihua reproductible example

# add uuid column with comment foo bar
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
data = [
    (1, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b21", "A", "BC", "C"),
    (2, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b22", "A", "BC", "C"),
    (3, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b21", "A", "BC", "C"),
    (4, "f5c2ebfd-f57b-4ff3-ac5c-f30674037b22", "A", "BC", "C"),
]

schema = StructType(
    [
        StructField("uuid", IntegerType(), True, {"comment": "foo bar"}),
        StructField("user_id", StringType(), True),
        StructField("col1", StringType(), True),
        StructField("ts", StringType(), True),
        StructField("part", StringType(), True),
    ]
)
df = spark.createDataFrame(data=data, schema=schema)


tableName = "test_hudi_comment"
basePath = f"/tmp/hudi/"

hudi_options = {
    "hoodie.table.name": tableName,
    "hoodie.datasource.write.recordkey.field": "uuid",
    "hoodie.datasource.write.partitionpath.field": "part",
    "hoodie.datasource.write.table.name": tableName,
    "hoodie.datasource.write.operation": "insert",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.upsert.shuffle.parallelism": 1,
    "hoodie.insert.shuffle.parallelism": 1,
    "hoodie.datasource.hive_sync.enable": "false",
}
(df.write.format("hudi").options(**hudi_options).mode("append").save(basePath))
spark.read.format("hudi").load(basePath).registerTempTable("foo")
spark.sql("desc extended foo").show()

# there is no foo bar on the hudi side
+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
| _hoodie_commit_time|   string|   null|
|_hoodie_commit_seqno|   string|   null|
|  _hoodie_record_key|   string|   null|
|_hoodie_partition...|   string|   null|
|   _hoodie_file_name|   string|   null|
|                uuid|      int|   null|
|             user_id|   string|   null|
|                col1|   string|   null|
|                  ts|   string|   null|
|                part|   string|   null|
+--------------------+---------+-------+

# the avro has foo bar doc
  "partitionToWriteStats" : {
    "C" : [ {
      "fileId" : "e90400c1-5311-4fdc-83f2-757326c7560d-0",
      "path" : "C/e90400c1-5311-4fdc-83f2-757326c7560d-0_0-17-36_20221222095833220.parquet",
      "prevCommit" : "null",
      "numWrites" : 4,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 4,
      "totalWriteBytes" : 435614,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "C",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 435614,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"test_hudi_comment_record\",\"namespace\":\"hoodie.test_hudi_comment\",\"fields\":[{\"name\":\"uuid\",\"type\":[\"null\",\"int\"],\"doc\":\"foo bar\",\"default\":null},{\"name\":\"user_id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"col1\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"ts\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"part\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
  },
  "operationType" : "INSERT"
}

xushiyan · 2023-01-07T16:50:31Z

@jonvex can you look into this please? looks like some config fixes should resolve it

jonvex · 2023-01-11T16:10:21Z

Verified this issue and created a Jira ticket

codope · 2023-03-29T06:33:58Z

Tracked in HUDI-5533

yihua added priority:minor everything else; usability gaps; questions; feature reqs spark Issues related to spark schema-and-data-types labels Dec 21, 2022

parisni linked a pull request May 10, 2023 that will close this issue

[HUDI-5533] Support spark columns comments #8683

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] table comments not fully supported #7531

[SUPPORT] table comments not fully supported #7531

parisni commented Dec 21, 2022 •

edited

Loading

yihua commented Dec 21, 2022

parisni commented Dec 21, 2022 via email

parisni commented Dec 22, 2022

xushiyan commented Jan 7, 2023

jonvex commented Jan 11, 2023

codope commented Mar 29, 2023

[SUPPORT] table comments not fully supported #7531

[SUPPORT] table comments not fully supported #7531

Comments

parisni commented Dec 21, 2022 • edited Loading

yihua commented Dec 21, 2022

parisni commented Dec 21, 2022 via email

parisni commented Dec 22, 2022

xushiyan commented Jan 7, 2023

jonvex commented Jan 11, 2023

codope commented Mar 29, 2023

parisni commented Dec 21, 2022 •

edited

Loading