You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently also met the issue #9305, but with the different cause(we still use hudi 0.12).
The user set the configure spark.sql.parquet.enableVectorizedReader to false manually, and read a hive table and cache it. Given spark will analyze the plan firstly if it needs to be cached, so currently spark won't add C2R to that cached plan since vectorized reader is false. At currently, spark won't execute that plan since there's no action operator.
Then user tries to read a MOR read_optimized table and join that cached plan and get the result, as mor table will automatically update the enableVectorizedReader to true, actually that hive table is read as column batch, but the plan doesn't contain C2R to convert the batch to row, whereas the error occurs:
I see there's some modification in the master code, but I suspect this issue could still happen since we'd also modify it in HoodieFileGroupReaderBasedParquetFileFormat:
Besides this issue, Is it suitable to set spark configures globally? No matter users set it or not, I actually see hudi could set many spark relate configures in SparkConf, most of them are related to parquet reader/writer. This could confuse users and make it hard for devs to find the cause.
The text was updated successfully, but these errors were encountered:
We recently also met the issue #9305, but with the different cause(we still use hudi 0.12).
The user set the configure
spark.sql.parquet.enableVectorizedReader
to false manually, and read a hive table and cache it. Given spark will analyze the plan firstly if it needs to be cached, so currently spark won't addC2R
to that cached plan since vectorized reader is false. At currently, spark won't execute that plan since there's no action operator.Then user tries to read a MOR read_optimized table and join that cached plan and get the result, as mor table will automatically update the
enableVectorizedReader
to true, actually that hive table is read as column batch, but the plan doesn't containC2R
to convert the batch to row, whereas the error occurs:I see there's some modification in the master code, but I suspect this issue could still happen since we'd also modify it in
HoodieFileGroupReaderBasedParquetFileFormat
:spark.conf.set("spark.sql.parquet.enableVectorizedReader", supportBatchResult)
Besides this issue, Is it suitable to set spark configures globally? No matter users set it or not, I actually see hudi could set many spark relate configures in
SparkConf
, most of them are related to parquet reader/writer. This could confuse users and make it hard for devs to find the cause.The text was updated successfully, but these errors were encountered: