Reading excel files with very large text fields #590

SushantSr · 2022-05-02T12:57:37Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I'm trying to read an excel file in databricks that has some very large text fields and I'm getting
'RecordFormatException: Tried to allocate an array of length 197,578,186, but the maximum length for this record type is 100,000,000' error on trying to read the file.

Detail error is as below:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11) (10.139.64.5 executor 1): shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 197,578,186, but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
at shadeio.poi.util.IOUtils.throwRFE(IOUtils.java:599)
at shadeio.poi.util.IOUtils.checkLength(IOUtils.java:276)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:230)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:203)
at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:82)
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:98)
at shadeio.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:132)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:109)
at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:125)
at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:74)
at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:30)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:99)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:43)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:94)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:131)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:826)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1670)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:829)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:684)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2984)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2931)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2925)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2925)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1345)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1345)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1345)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3193)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3134)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3122)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1107)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2628)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:266)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:276)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:81)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:87)
at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:75)
at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:62)
at org.apache.spark.sql.execution.ResultCacheManager.collectResult$1(ResultCacheManager.scala:587)
at org.apache.spark.sql.execution.ResultCacheManager.computeResult(ResultCacheManager.scala:596)
at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:542)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:541)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:438)
at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:417)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:422)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3132)
at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3123)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3930)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:209)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:356)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:160)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:958)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:115)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:306)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3928)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3122)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:268)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:102)
at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:587)
at com.databricks.backend.daemon.driver.PythonDriverLocal.computeListResultsItem(PythonDriverLocal.scala:622)
at com.databricks.backend.daemon.driver.PythonDriverLocalBase.genListResults(PythonDriverLocalBase.scala:494)
at com.databricks.backend.daemon.driver.PythonDriverLocal.$anonfun$getResultBufferInternal$1(PythonDriverLocal.scala:677)
at com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:558)
at com.databricks.backend.daemon.driver.PythonDriverLocal.getResultBufferInternal(PythonDriverLocal.scala:637)
at com.databricks.backend.daemon.driver.DriverLocal.getResultBuffer(DriverLocal.scala:715)
at com.databricks.backend.daemon.driver.PythonDriverLocal.outputSuccess(PythonDriverLocal.scala:600)
at com.databricks.backend.daemon.driver.PythonDriverLocal.$anonfun$repl$6(PythonDriverLocal.scala:222)
at com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:558)
at com.databricks.backend.daemon.driver.PythonDriverLocal.repl(PythonDriverLocal.scala:209)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$11(DriverLocal.scala:602)
at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:28)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:26)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:205)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:204)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:60)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:240)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:225)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:60)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:579)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:615)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:607)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommandAndGetError(DriverWrapper.scala:526)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:561)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:431)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:374)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:225)
at java.lang.Thread.run(Thread.java:748)
Caused by: shadeio.poi.util.RecordFormatException: Tried to allocate an array of length 197,578,186, but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
at shadeio.poi.util.IOUtils.throwRFE(IOUtils.java:599)
at shadeio.poi.util.IOUtils.checkLength(IOUtils.java:276)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:230)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:203)
at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:82)
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:98)
at shadeio.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:132)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:312)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:97)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.v2.excel.ExcelHelper.getWorkbook(ExcelHelper.scala:109)
at com.crealytics.spark.v2.excel.ExcelHelper.getRows(ExcelHelper.scala:125)
at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.readFile(ExcelPartitionReaderFactory.scala:74)
at com.crealytics.spark.v2.excel.ExcelPartitionReaderFactory.buildReader(ExcelPartitionReaderFactory.scala:61)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.$anonfun$createReader$1(FilePartitionReaderFactory.scala:30)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.getNextReader(FilePartitionReader.scala:99)
at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:43)
at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:94)
at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:131)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:155)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:826)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1670)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:829)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:684)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Expected Behavior

No response

Steps To Reproduce

Attached snippet of code. I can't share the file since it's very large.

Environment

- Spark version:3.2.1
- Spark-Excel version:2.12:3.2.1_0.17.0
- Cluster environment: Databricks

Anything else?

No response

pjfanning · 2022-05-21T12:15:53Z

Try setting IOUtils.setByteArrayMaxOverride() as the error message says.

There are other config settings documented in https://poi.apache.org/components/configuration.html

ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(int thresholdBytes) and ZipPackage.setUseTempFilePackageParts(boolean tempFilePackageParts) are both potentially useful in your case.

Note that spark-excel shades the org.apache.poi classes and changes the packages to shadio.poi instead.

rbharathkumar · 2022-06-10T12:46:06Z

Thank you for providing the correct solution.
I had the same issue and added the following line in a test environment and the issue was resolved.
(Please do so at your own risk. I have no idea on what it does or how it resolves the issue or even if it is correct. May cause OOM or other issues.)

shadeio.poi.util.IOUtils.setByteArrayMaxOverride(n)

Tried a few options for n, each time same issue occurred but with a higher value. Repeated the process with higher number until the issue was resolved.

Would be nice if the README had a small section on how we can pass the options as part of the program. Would be very helpful.

WillHolbrook · 2022-07-04T16:25:25Z

How would I pass this option through when using pyspark to read in the excel file and similar for the other config options? @pjfanning
As the original question is in python and Databricks

samueldemir · 2022-07-20T09:29:56Z

for my databricks friends:

%scala
shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(100000000)

do not set parameter to 100_00_000 instead set it to 100000000
increase the driver memoy if u are loading large files.

pjfanning · 2022-07-20T20:19:27Z

@samueldemir that threshold param is probably better supported by #619

Setting it using the call to the static method means the value is only set on the JVM where you make that call. If you have a Spark cluster, you need to call that method on all the JVMs in the cluster or at least before, you run any closures that parse xlsx files.

pjfanning mentioned this issue May 21, 2022

Remove shading of packages for POI? #596

Closed

1 task

pjfanning mentioned this issue Jul 20, 2022

support tempFileThreshold param #619

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading excel files with very large text fields #590

Reading excel files with very large text fields #590

SushantSr commented May 2, 2022

pjfanning commented May 21, 2022

rbharathkumar commented Jun 10, 2022

WillHolbrook commented Jul 4, 2022 •

edited

Loading

samueldemir commented Jul 20, 2022

pjfanning commented Jul 20, 2022

Reading excel files with very large text fields #590

Reading excel files with very large text fields #590

Comments

SushantSr commented May 2, 2022

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

pjfanning commented May 21, 2022

rbharathkumar commented Jun 10, 2022

WillHolbrook commented Jul 4, 2022 • edited Loading

samueldemir commented Jul 20, 2022

pjfanning commented Jul 20, 2022

WillHolbrook commented Jul 4, 2022 •

edited

Loading