Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0) #2285

Closed
zherenyu831 opened this issue Nov 26, 2020 · 18 comments
Closed

[SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0) #2285

zherenyu831 opened this issue Nov 26, 2020 · 18 comments
Assignees
Labels
priority:minor everything else; usability gaps; questions; feature reqs

Comments

@zherenyu831
Copy link
Contributor

zherenyu831 commented Nov 26, 2020

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. have a table with 100GB data and under compaction
  2. kill the spark job
  3. try to read the data by snapshot query
val df = spark.read.format("org.apache.hudi")
.option("hoodie.datasource.query.type","snapshot")
.load("s3://path_to_data/*")

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.6.0

  • Spark version : 2.4.4

  • Hive version : not using

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS..) : s3

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Exception: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4191
	at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary.decodeToDouble(PlainValuesDictionary.java:208)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:460)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:178)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1(HoodieMergeOnReadRDD.scala:239)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1$adapted(HoodieMergeOnReadRDD.scala:237)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:237)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.hasNext(HoodieMergeOnReadRDD.scala:197)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:244)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:242)
	... 9 more
@zherenyu831
Copy link
Contributor Author

スクリーンショット 2020-11-27 11 54 34

スクリーンショット 2020-11-27 11 54 45

スクリーンショット 2020-11-27 11 58 13

@zherenyu831
Copy link
Contributor Author

but array doubleDictionaryContent only has 3000 elements, it caused the problem

@bvaradar
Copy link
Contributor

bvaradar commented Dec 1, 2020

Wondering if parquet version has anything to do here. Can you check if the hadoop installation has 1.10.1 parquet bundles ?

@zherenyu831
Copy link
Contributor Author

@bvaradar
here is the all parquet on my emr cluster,

/usr/lib/spark/jars/parquet-column-1.10.1-spark-amzn-1.jar
/usr/lib/spark/jars/parquet-common-1.10.1-spark-amzn-1.jar
/usr/lib/spark/jars/parquet-encoding-1.10.1-spark-amzn-1.jar
/usr/lib/spark/jars/parquet-format-2.4.0.jar
/usr/lib/spark/jars/parquet-hadoop-1.10.1-spark-amzn-1.jar
/usr/lib/spark/jars/parquet-hadoop-bundle-1.6.0.jar
/usr/lib/spark/jars/parquet-jackson-1.10.1-spark-amzn-1.jar

we also tried to see if there is any difference with official jar on these amazon built jars
seems fine...

JYI: by using read optimized query, we can have all values

@bvaradar
Copy link
Contributor

bvaradar commented Dec 7, 2020

cc @umehrot2 : Wondering why there is parquet-hadoop-bundle-1.6.0.jar along with parquet-hadoop-1.10.1-spark-amzn-1.jar. Wouldn't they cause conflict ?

@zherenyu831
Copy link
Contributor Author

@bvaradar
I deleted parquet-hadoop-bundle-1.6.0.jar and tired again, but error still happens
then I replaced all parquet lib with official ones, but not worked

@bvaradar
Copy link
Contributor

bvaradar commented Dec 9, 2020

@n3nash : Can you look at this ?

@zherenyu831 : As the integration tests are passing with for compaction, I am suspecting this is still has to do with parquet version mismatch. Would it be possible to replicate this using docker setup : https://hudi.apache.org/docs/docker_demo.html ?

@adaniline-paytm
Copy link

I have the same sporadic issue, using standard Spark 2.4.7 distribution and Hudi 0.6:

$ ls -l /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-*
 /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-column-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-common-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-encoding-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-format-2.4.0.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-hadoop-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-jackson-1.10.1.jar

the only workaround we found is to disable VectorizedReader:

      rc.spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

@nsivabalan
Copy link
Contributor

nsivabalan commented Jan 24, 2021

@zherenyu831 : can you please respond with any updates on your end.
@n3nash : can you please take a look when you have time. If you were able to narrow down the issue, please do file a jira and add "user-support-issues" label.

@vinothchandar
Copy link
Member

cc @garyli1019 as well

@zherenyu831
Copy link
Contributor Author

@bvaradar
Hi Bavaradar, it will be little difficult to replicate the problem, since it only happens on huge amount of data.

@nsivabalan
Copy link
Contributor

@n3nash : would you be able to help in here.

@nsivabalan
Copy link
Contributor

@zherenyu831 : few quick questions as we triage the issue.

  • Were you running older version of Hudi and encountered this trying to upgrade to a latest version?
  • Is this affecting your production? trying to gauge the severity.
  • Or you are trying out a POC ? and this is the first time trying out Hudi.

@nsivabalan nsivabalan added priority:major degraded perf; unable to move forward; potential bugs priority:critical production down; pipelines stalled; Need help asap. and removed priority:major degraded perf; unable to move forward; potential bugs labels Feb 6, 2021
@zherenyu831
Copy link
Contributor Author

zherenyu831 commented Feb 9, 2021

@nsivabalan

Were you running older version of Hudi and encountered this trying to upgrade to a latest version?

We faced the problem by using hudi 0.6.0, didn't tried with hudi 0.7.0

Is this affecting your production? trying to gauge the severity.
Or you are trying out a POC ? and this is the first time trying out Hudi.

Not actually, because it only happens on reading while compaction of huge table.
Compaction is not happens all the time, so retry usually works for us.
We using hudi for about one year, before we used read optimized query(since it the only method supported on spark datasource of MOR table before 0.6.0).

@nsivabalan
Copy link
Contributor

@vinothchandar @n3nash @bvaradar : One of the customer mentioned that disabling vectorized reader fixed the issue for them. Hope it should be fine? And, do we need to make a note of this in faq or somewhere?

@vinothchandar vinothchandar assigned vinothchandar and unassigned n3nash Feb 9, 2021
@vinothchandar vinothchandar added priority:major degraded perf; unable to move forward; potential bugs and removed priority:critical production down; pipelines stalled; Need help asap. labels Mar 1, 2021
@vinothchandar
Copy link
Member

I see lot of general spark issues reported like this. making this sev:high for now, as we figure out more

@vinothchandar vinothchandar changed the title [SUPPORT] Exception on snapshot query while compaction (hudi 0.6.0) [SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0) Jun 5, 2021
@nsivabalan nsivabalan added priority:minor everything else; usability gaps; questions; feature reqs and removed priority:major degraded perf; unable to move forward; potential bugs labels Aug 31, 2021
@nsivabalan
Copy link
Contributor

@zherenyu831 : We made some fixes to spillablemap thats been used in compaction path and should help w/ large datasets. Can you try giving it a try. else, feel free to close it out if its not an issue anymore. thanks!

@vinothchandar
Copy link
Member

Closing since the fix has since been landed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs
Projects
None yet
Development

No branches or pull requests

6 participants