[SUPPORT]Unable to read hudi table and got an IllegalArgumentException: For input string: "null" #8061

wolf8334 · 2023-02-27T11:01:57Z

Describe the problem you faced

I use java and spark 3.3 to read hudi 0.13.0 table following the guide on offical website.
The guide says this will work,but I got an IllegalArgumentException: For input string: "null".

To Reproduce

Steps to reproduce the behavior:

1.generate one hudi COW table from mysql table.
2.get access to the COW table through spark sql
3.the IllegalArgumentException: For input string: "null" shows.
4.I have already changed the datasource and the table structure,It has no relationship with this.
5 I use this command line and I am sure there are datas in my parquet file.
./hadoop jar ~/parquet-tools-1.9.0.jar cat hdfs://192.168.5.128:9000/user/spark/hudi/1/2.parquet

Expected behavior

the data is shown.

Environment Description

Hudi version :
0.12.2,0.13.0
Spark version :
3.3.2
Hive version :
none
Hadoop version :
3.3.4
Storage (HDFS/S3/GCS..) :
HDFS
Running on Docker? (yes/no) :
no.my local laptop

Additional context
JDK 1.8

Add any other context about the problem here.
`Map<String, String> hudiConf = new HashMap<>();
hudiConf.put(HoodieWriteConfig.TBL_NAME.key(), "t_yklc_info");

        Dataset<Row> demods = getActiveSession().read().options(hudiConf).format("org.apache.hudi").load("/user/spark/hudi/*/*");

        demods.createOrReplaceTempView("lcinfo");
        demods.printSchema();

        logger.info(getActiveSession().conf().get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG().key()).toString());
        logger.info(getActiveSession().conf().get(SQLConf.PARQUET_BINARY_AS_STRING().key()).toString());
        logger.info(getActiveSession().conf().get(SQLConf.PARQUET_INT96_AS_TIMESTAMP().key()).toString());
        logger.info(getActiveSession().conf().get(SQLConf.CASE_SENSITIVE().key()).toString());


        Dataset<Row> ds = getActiveSession().sql("select APP_NO from lcinfo where APP_NO = '1' and STAT_CYCLE = '2'");
        ds.printSchema();
        ds.show();`

Stacktrace
INFO 18:45:03.183 | org.apache.spark.sql.execution.datasources.FileScanRDD | Reading File path: hdfs://192.168.5.128:9000/user/spark/hudi/2/1.parquet, range: 0-3964741, partition values: [empty row]
ERROR 18:45:03.420 | org.apache.spark.executor.Executor | Exception in task 3.0 in stage 1.0 (TID 60)
java.lang.IllegalArgumentException: For input string: "null"
at scala.collection.immutable.StringLike.parseBoolean(StringLike.scala:330) ~[scala-library-2.12.15.jar:?]
at scala.collection.immutable.StringLike.toBoolean(StringLike.scala:289) ~[scala-library-2.12.15.jar:?]
at scala.collection.immutable.StringLike.toBoolean$(StringLike.scala:289) ~[scala-library-2.12.15.jar:?]
at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:33) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.(ParquetSchemaConverter.scala:70) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.datasources.parquet.HoodieParquetFileFormatHelper$.buildImplicitSchemaChangeInfo(HoodieParquetFileFormatHelper.scala:30) ~[hudi-spark3.3-bundle_2.12-0.13.0.jar:3.3.2]
at org.apache.spark.sql.execution.datasources.parquet.Spark32PlusHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32PlusHoodieParquetFileFormat.scala:231) ~[hudi-spark3.3-bundle_2.12-0.13.0.jar:3.3.2]
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:561) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) ~[?:?]
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) ~[spark-sql_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.scheduler.Task.run(Task.scala:136) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.2.jar:3.3.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_362]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_362]
at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362]

The text was updated successfully, but these errors were encountered:

wolf8334 · 2023-02-28T06:31:37Z

I add these code and it works.But still wander why.
sc.set("spark.sql.legacy.parquet.nanosAsLong", "false");
sc.set("spark.sql.parquet.binaryAsString", "false");
sc.set("spark.sql.parquet.int96AsTimestamp", "true");
sc.set("spark.sql.caseSensitive", "false");

caokz · 2023-03-08T03:44:06Z

I also encountered this problem when using hudi 0.13.0 on spark 3.3.2 and found that an exception was thrown when querying the mor table and the merge type was "REALTIME_PAYLOAD_COMBINE". The reason for this is that spark 3.3.2 is not compatible with the ParquetToSparkSchemaConverter class of spark 3.3.1. The constructor method of the ParquetToSparkSchemaConverterl class in spark 3.3.2 requires the "LEGACY_PARQUET_NANOS_AS_LONG" configuration parameter, whereas in The buildReaderWithPartitionValues method of the Spark32PlusHoodieParquetFileFormatl class does not initialize the value of this parameter. So my conclusion is that hudi 0.13.0 is currently not compatible with spark 3.3.2.

danny0405 · 2023-03-08T07:45:08Z

Thanks for the feedback, found that there seems already a fixing PR: #8082,
let's move the discussions there and it's great if you guys can help the review.

cmanning-arcadia · 2023-04-10T12:47:45Z

If anyone finds this and has the issue from spark-shell, try adding the parameters as command line args. Example:

spark-shell --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.2 \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' \
--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false' \
--conf 'spark.hadoop.spark.sql.parquet.binaryAsString=false' \
--conf 'spark.hadoop.spark.sql.parquet.int96AsTimestamp=true' \
--conf 'spark.hadoop.spark.sql.caseSensitive=false'

codope · 2023-04-12T10:23:21Z

Closing as we have the PR and we will followup there.

bigdata-spec · 2023-04-21T05:38:01Z

Thanks for the feedback, found that there seems already a fixing PR: #8082, let's move the discussions there and it's great if you guys can help the review.

Hi，I will try Spark3.3.2 and Hudi 0.13，does it mean master can fix this problem？

bigdata-spec · 2023-04-21T06:22:40Z

spark.hadoop.spark.sql.parquet.binaryAsString

@cmanning-arcadia Hi， I have some doubt，
I find

--conf 'spark.hadoop.spark.sql.legacy.parquet.nanosAsLong=false' \
--conf 'spark.hadoop.spark.sql.parquet.binaryAsString=false' \
--conf 'spark.hadoop.spark.sql.parquet.int96AsTimestamp=true' \
--conf 'spark.hadoop.spark.sql.caseSensitive=false'

in Apache Spark3.3.2 source , spark.hadoop.spark.sql.legacy.parquet.nanosAsLong is false originally and so on.

yihua · 2023-07-21T20:11:41Z

Hi @bigdata-spec Have you tried Hudi 0.13.1 release which is compatible with Spark 3.3.2 release, without adding additional spark configs above?

danny0405 self-assigned this Mar 8, 2023

danny0405 added spark-sql schema-and-data-types version-compatibility labels Mar 8, 2023

codope added the priority:major degraded perf; unable to move forward; potential bugs label Apr 12, 2023

codope closed this as completed Apr 12, 2023

rmnlchh mentioned this issue Jul 25, 2023

[ISSUE] Hudi 0.13.0. Spark 3.3.2 Deltastreamed table read failure #9282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT]Unable to read hudi table and got an IllegalArgumentException: For input string: "null" #8061

[SUPPORT]Unable to read hudi table and got an IllegalArgumentException: For input string: "null" #8061

wolf8334 commented Feb 27, 2023 •

edited

wolf8334 commented Feb 28, 2023

caokz commented Mar 8, 2023

danny0405 commented Mar 8, 2023 •

edited

cmanning-arcadia commented Apr 10, 2023 •

edited

codope commented Apr 12, 2023

bigdata-spec commented Apr 21, 2023

bigdata-spec commented Apr 21, 2023 •

edited

yihua commented Jul 21, 2023

[SUPPORT]Unable to read hudi table and got an IllegalArgumentException: For input string: "null" #8061

[SUPPORT]Unable to read hudi table and got an IllegalArgumentException: For input string: "null" #8061

Comments

wolf8334 commented Feb 27, 2023 • edited

wolf8334 commented Feb 28, 2023

caokz commented Mar 8, 2023

danny0405 commented Mar 8, 2023 • edited

cmanning-arcadia commented Apr 10, 2023 • edited

codope commented Apr 12, 2023

bigdata-spec commented Apr 21, 2023

bigdata-spec commented Apr 21, 2023 • edited

yihua commented Jul 21, 2023

wolf8334 commented Feb 27, 2023 •

edited

danny0405 commented Mar 8, 2023 •

edited

cmanning-arcadia commented Apr 10, 2023 •

edited

bigdata-spec commented Apr 21, 2023 •

edited