Skip to content

Issue with Reading LZ4-Compressed Parquet File Using Spark 3.5 + Blaze #771

@merrily01

Description

@merrily01

Describe the bug

Issue with Reading LZO-Compressed Parquet File Using Spark 3.5 + Blaze

To Reproduce
Steps to reproduce the behavior:

  1. The LZO-compressed Parquet file that reproduces the issue is attached, eg:
    part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet.txt

    Note: Please remove the “.txt” suffix to convert it back to a Parquet file before proceeding.

  2. Upload the aforementioned LZO-compressed Parquet file to HDFS for backup.

  3. Launch spark-shell with Spark 3.5 + Blaze.

  4. Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error as follows::

scala> spark.conf.set("spark.blaze.enable", true)
scala> val df = spark.read.parquet("hdfs://path/o/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
25/01/17 17:01:31 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (tjtx16-35-27.58os.org executor 2): java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[ParquetScan] error: Execution error: output_with_sender[ParquetScan]: output() returns error: Arrow error: External error: Arrow: Parquet argument error: External: the offset to copy is not contained in the decompressed buffer
	at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
	at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:95)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:143)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:662)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:682)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
  1. Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results, as follows:
scala> spark.conf.set("spark.blaze.enable", false)
scala> val df = spark.read.parquet("hdfs://path/to/part-00000-7493e343-a159-4a2f-b69d-77cb68ac525f-c000.lz4.parquet")
scala> df.show()
...
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|cp_catalog_page_sk|cp_catalog_page_id|cp_start_date_sk|cp_end_date_sk|cp_department|cp_catalog_number|cp_catalog_page_number|      cp_description|  cp_type|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
|                 1|  AAAAAAAABAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     1|In general basic ...|bi-annual|
|                 2|  AAAAAAAACAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     2|English areas wil...|bi-annual|
|                 3|  AAAAAAAADAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     3|Times could not a...|bi-annual|
|                 4|  AAAAAAAAEAAAAAAA|         2450815|          NULL|         NULL|                1|                  NULL|                NULL|bi-annual|
|                 5|  AAAAAAAAFAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     5|Classic buildings...|bi-annual|
|                 6|  AAAAAAAAGAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     6|Exciting principl...|bi-annual|
|                 7|  AAAAAAAAHAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     7|National services...|bi-annual|
|                 8|  AAAAAAAAIAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     8|Areas see early f...|bi-annual|
|                 9|  AAAAAAAAJAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                     9|Intensive, econom...|bi-annual|
|                10|  AAAAAAAAKAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    10|Careful, intense ...|bi-annual|
|                11|  AAAAAAAALAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    11|At least national...|bi-annual|
|                12|  AAAAAAAAMAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    12|Girls indicate so...|bi-annual|
|                13|  AAAAAAAANAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    13|Miles see mainly ...|bi-annual|
|                14|  AAAAAAAAOAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    14|Rooms would say a...|bi-annual|
|                15|  AAAAAAAAPAAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    15|Legal, required e...|bi-annual|
|                16|  AAAAAAAAABAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    16|Schools must know...|bi-annual|
|                17|  AAAAAAAABBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    17|More than true ca...|bi-annual|
|                18|  AAAAAAAACBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    18|Shops end problem...|bi-annual|
|                19|  AAAAAAAADBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    19|Poor, hostile gui...|bi-annual|
|                20|  AAAAAAAAEBAAAAAA|         2450815|       2450996|   DEPARTMENT|                1|                    20|Appropriate years...|bi-annual|
+------------------+------------------+----------------+--------------+-------------+-----------------+----------------------+--------------------+---------+
only showing top 20 rows

Expected behavior

  1. Enable the Blaze switch, read the Parquet file mentioned above, The query fails and throws an error;
  2. Disable the Blaze switch, read the Parquet file mentioned above, the query succeeds and display the results;

Screenshots
Enable the Blaze switch:
Image

Disable the Blaze switch:

Image

Additional context

Spark version: 3.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions