-
Notifications
You must be signed in to change notification settings - Fork 210
Issue with Reading LZ4-Compressed Parquet File Using Spark 3.5 + Blaze #771
Copy link
Copy link
Closed
Labels
Description
Describe the bug
Issue with Reading LZO-Compressed Parquet File Using Spark 3.5 + Blaze
To Reproduce
Steps to reproduce the behavior:
-
The LZO-compressed Parquet file that reproduces the issue is attached, eg:
part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet.txtNote: Please remove the “.txt” suffix to convert it back to a Parquet file before proceeding.
-
Upload the aforementioned LZO-compressed Parquet file to HDFS for backup.
-
Launch spark-shell with Spark 3.5 + Blaze.
-
Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error as follows::
scala> spark.conf.set("spark.blaze.enable", true)
scala> val df = spark.read.parquet("hdfs://path/o/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
25/01/17 17:01:31 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (tjtx16-35-27.58os.org executor 2): java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[ParquetScan] error: Execution error: output_with_sender[ParquetScan]: output() returns error: Arrow error: External error: Arrow: Parquet argument error: External: the offset to copy is not contained in the decompressed buffer
at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:95)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
at org.apache.spark.scheduler.Task.run(Task.scala:143)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:662)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:682)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
- Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results, as follows:
scala> spark.conf.set("spark.blaze.enable", false)
scala> val df = spark.read.parquet("hdfs://path/to/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|cp_catalog_page_sk|cp_catalog_page_id|cp_start_date_sk|cp_end_date_sk|cp_department|cp_catalog_number|cp_catalog_page_number| cp_description| cp_type|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
| 1| AAAAAAAABAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 1|In general basic ...|bi-annual|
| 2| AAAAAAAACAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 2|English areas wil...|bi-annual|
| 3| AAAAAAAADAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 3|Times could not a...|bi-annual|
| 4| AAAAAAAAEAAAAAAA| 2450815| NULL| NULL| 1| NULL| NULL|bi-annual|
| 5| AAAAAAAAFAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 5|Classic buildings...|bi-annual|
| 6| AAAAAAAAGAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 6|Exciting principl...|bi-annual|
| 7| AAAAAAAAHAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 7|National services...|bi-annual|
| 8| AAAAAAAAIAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 8|Areas see early f...|bi-annual|
| 9| AAAAAAAAJAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 9|Intensive, econom...|bi-annual|
| 10| AAAAAAAAKAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 10|Careful, intense ...|bi-annual|
| 11| AAAAAAAALAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 11|At least national...|bi-annual|
| 12| AAAAAAAAMAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 12|Girls indicate so...|bi-annual|
| 13| AAAAAAAANAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 13|Miles see mainly ...|bi-annual|
| 14| AAAAAAAAOAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 14|Rooms would say a...|bi-annual|
| 15| AAAAAAAAPAAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 15|Legal, required e...|bi-annual|
| 16| AAAAAAAAABAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 16|Schools must know...|bi-annual|
| 17| AAAAAAAABBAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 17|More than true ca...|bi-annual|
| 18| AAAAAAAACBAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 18|Shops end problem...|bi-annual|
| 19| AAAAAAAADBAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 19|Poor, hostile gui...|bi-annual|
| 20| AAAAAAAAEBAAAAAA| 2450815| 2450996| DEPARTMENT| 1| 20|Appropriate years...|bi-annual|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
only showing top 20 rows
Expected behavior
- Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error;
- Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results;
Screenshots
Enable the Blaze switch:

Disable the Blaze switch:
Additional context
Spark version: 3.5
Reactions are currently unavailable