Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency issues with Spark's built-in commons-compress #93

Closed
jwooden1 opened this issue Oct 26, 2018 · 29 comments
Closed

Dependency issues with Spark's built-in commons-compress #93

jwooden1 opened this issue Oct 26, 2018 · 29 comments

Comments

@jwooden1
Copy link

I can use the library when I run spark on my local windows machine and read excel files on the same machine. However, when I upload the files to WASB on Azure and use HDInsight cluster for running spark jobs (either local or cluster mode), I get the following error:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:64) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:71) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:70) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:264) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:263) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:263) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided

@nightscape
Copy link
Collaborator

I had the same problem a few days ago, but haven't found a proper solution.
The problem is that Spark comes bundled with a rather outdated version of commons-compress and POI needs a newer version. In principle it should be possible to override the JARs bundled with Spark with user-provided ones, but I haven't yet managed to successfully do so.
In case you find a solution, please post it here 👍
In the mean time, you could try older versions of spark-excel maybe the pre-0.10 versions work with the older version of commons-compress.

@jornfranke
Copy link

jornfranke commented Nov 6, 2018

I had the same issue (but not for spark-excel, another software). You need to shade the dependencies to commons-compress so that your Spark application uses the new version of commons-compress. You can do this in Java with the Maven shade plugin or in Scala with the assembly plugin (https://github.com/sbt/sbt-assembly) of SBT. Then, you can define in your build.sbt a rule to shade the commons compress (https://github.com/sbt/sbt-assembly#shading).

If you want to use R and Python then maybe @nightscape needs to shade it directly in the spark-excel module that is published on Maven.

The other way "override the Jars bundled with Spark" is in this case not possible, because it is core part of Spark. However, shading it is not so bad in this case. I recommend also to create a JIRA issue for this with the Spark project to update commons-compress (the old version is vulnerable to several attacks).

@nightscape
Copy link
Collaborator

nightscape commented Nov 9, 2018

I just released 0.10.1 and 0.11.0-beta2 which shade commons-compress and should hopefully fix this problem.
Can you give it a try and tell me if it worked?

@hbenzineb
Copy link

Hi @nightscape
I m using 0.11.0-beta2 and I still have the same Error
When I use a dependency to commons-compress, I have this message :
_

diagnostics: User class threw exception: java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.

_
When I dont use the dependency, I have this :
_

diagnostics: User class threw exception: java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics

_
As a reminder, I try to write the contents of several dataframes in several sheets of the same Excel file

@jornfranke
Copy link

@nightscape I think you don't include commons-compress explicitly in the resulting jar of the spark-excel module. In this case the shading rules will not apply. See fat jar: https://github.com/sbt/sbt-assembly.

@nightscape
Copy link
Collaborator

Just trying another approach. Can someone check 0.11.0-beta3?

@hbenzineb
Copy link

@nightscape : it's OK :)
Thanks

@nightscape
Copy link
Collaborator

Ok, then I'll backport this to 0.10 and release 0.11 from the beta version.

@nightscape
Copy link
Collaborator

Fixed in 0.10.2 and 0.11.0-beta3.

@jwooden1
Copy link
Author

jwooden1 commented Nov 26, 2018

fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3.
scala.MatchError: Map(treatemptyvaluesasnulls -> false, path -> /unique.xlsx, useheader -> true, endcolumn -> 8, inferschema -> true, startcolumn -> 0, sheetname -> input) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided
Looking at the code, it looks to me it is due to making dataaddress a mandetory filed? what is it anyway? Also, I think it is creating a side-effect, because if I pass null when reading, there is no err in read, but it does not read the specified sheet-- looks that it just read the first sheet.

@abhishek-bhatt3
Copy link

fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3.
scala.MatchError: Map(treatemptyvaluesasnulls -> false, path -> /unique.xlsx, useheader -> true, endcolumn -> 8, inferschema -> true, startcolumn -> 0, sheetname -> input) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided
Looking at the code, it looks to me it is due to making dataaddress a mandetory filed? what is it anyway? Also, I think it is creating a side-effect, because if I pass null when reading, there is no err in read, but it does not read the specified sheet-- looks that it just read the first sheet.

I am facing the same error in 0.11.0. Any update on this?

@jagadeesh427
Copy link

jagadeesh427 commented May 1, 2019

Exception in thread "main" scala.MatchError: Map(treatemptyvaluesasnulls -> true, location -> hdfs://nameservice1/flatfiles/raw/500a_map_e.xlsx, useheader -> true, inferschema -> true, addcolorcolumns -> false, sheetname -> _500a_map_e) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)

I am facing above issue.

dependencies used .

com.crealytics
spark-excel_2.10
0.8.3

can anyone help?

@jagadeesh427
Copy link

solved the issue :

used --packages com.crealytics:spark-excel_2.11:0.10.2

worked fine

@nightscape nightscape changed the title error in reading files in azure hdinsight cluster Dependency issues with Spark's built-in commons-compress Jun 27, 2019
@nightscape nightscape reopened this Jun 27, 2019
@nightscape
Copy link
Collaborator

I can reproduce this locally now. The problem seems to be that despite shading org.apache.commons.compress this line seems to be calling the constructor of the unshaded ZipArchiveInputStream.
Trying to find out what's happening...

@nightscape
Copy link
Collaborator

Not understanding it...
The exception says the following:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
  org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)
  org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180)
  org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104)
  org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298)
  org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129)
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  java.lang.reflect.Method.invoke(Method.java:498)
  org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314)
  org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296)
  org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214)
  org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180)
  com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:42)

on the other hand, when I download and unzip the spark-excel JAR and run

javap -verbose com/crealytics/spark-excel_2.12/0.11.2/org/apache/poi/openxml4j/opc/internal/ZipHelper.class

it clearly shows that the above method is using the shaded classes:

  public static org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream openZipStream(java.io.InputStream) throws java.io.IOException;
    descriptor: (Ljava/io/InputStream;)Lorg/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream;
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=5, locals=2, args_size=1
         0: aload_0
         1: invokestatic  #108                // Method org/apache/poi/poifs/filesystem/FileMagic.prepareToCheckMagic:(Ljava/io/InputStream;)Ljava/io/InputStream;
         4: astore_1
         5: aload_1
         6: invokestatic  #139                // Method verifyZipHeader:(Ljava/io/InputStream;)V
         9: new           #141                // class org/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream
        12: dup
        13: new           #143                // class shadeio/commons/compress/archivers/zip/ZipArchiveInputStream
        16: dup
        17: aload_1
        18: invokespecial #145                // Method shadeio/commons/compress/archivers/zip/ZipArchiveInputStream."<init>":(Ljava/io/InputStream;)V
        21: invokespecial #146                // Method org/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream."<init>":(Ljava/io/InputStream;)V
        24: areturn

@jornfranke
Copy link

Maybe some of your dependencies have POI as a dependency and then this dependency does not use the shaded commons-io

@nightscape
Copy link
Collaborator

@jornfranke That was exactly the problem. spark-excel itself still adds POI as a dependency (see hammerlab/sbt-parent#32).
I'm now bundling and shading all dependencies that require commons-io.

I just released 0.12.0 with this fix (and Scala 2.12 compatibility), it should appear on Maven Central in the next few hours.
Please go ahead and try it.
I'll close this issue until there are reports of the problem occurring again.

@jlscott3
Copy link

jlscott3 commented Jul 3, 2019

Confirmed 0.12.0 working in AWS Glue now - thanks for the quick response!

@ecv-stan
Copy link

ecv-stan commented Jul 25, 2019

@jlscott3 hi, do u mind to share how do u get this to work in glue?
do u just add the spark-excel_2.12-0.12.0.jar to Jar lib path in the glue job? do u need to set anything else?
I tried spark-excel_2.12-0.12.0.jar, spark-excel_2.11-0.12.0.jar, spark-excel_2.11-0.11.1.jar but all throw error...
thanks in advance.


Update:

Finally I got it working in AWS glue.

Below are the jars I used:
ooxml-schemas-1.4.jar
poi-4.0.0.jar
spark-excel_2.11-0.12.0.jar
xmlbeans-3.1.0.jar

Hope it helps.

@nightscape
Copy link
Collaborator

It turns out something went wrong while publishing spark-excel_2.12-0.12.0.jar, so that version actually still had this problem.
In case anyone wants to try with Scala 2.12 it should work with spark-excel 0.12.1.

@tochandrashekhar
Copy link

@jlscott3 hi, do u mind to share how do u get this to work in glue?
do u just add the spark-excel_2.12-0.12.0.jar to Jar lib path in the glue job? do u need to set anything else?
I tried spark-excel_2.12-0.12.0.jar, spark-excel_2.11-0.12.0.jar, spark-excel_2.11-0.11.1.jar but all throw error...
thanks in advance.

Update:

Finally I got it working in AWS glue.

Below are the jars I used:
ooxml-schemas-1.4.jar
poi-4.0.0.jar
spark-excel_2.11-0.12.0.jar
xmlbeans-3.1.0.jar

Hope it helps.

Do we need to import in spark code.. Can you please provide some sample code?

@xvinosh
Copy link

xvinosh commented Aug 14, 2020

Did anyone get the solution to this problem. I am facing the same problem with the latest version of spark-excel -> 0.13.5

scala> val file = new File("/Users/vinodsharma/Documents/Spark-Excel/People.xlsx")
file: java.io.File = /Users/vinodsharma/Documents/Spark-Excel/People.xlsx

scala> val fIP = new FileInputStream(file)
fIP: java.io.FileInputStream = java.io.FileInputStream@236ec69

scala> val wb = new XSSFWorkbook(fIP)
java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:65)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:178)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:104)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:47)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:309)
... 51 elided

How to go about changing the classpath for common compress jar: In my case, the version of compress jar is org.apache.commons#commons-compress;1.20

@nightscape
Copy link
Collaborator

You might have to manually exclude commons-compress from the dependencies due to this problem which I don't yet know how to fix: hammerlab/sbt-parent#32

@xvinosh
Copy link

xvinosh commented Aug 17, 2020

@nightscape :
In my case, I tried all the versions from 0.12.1 to 0.13.5, none worked.
Downloaded the latest version of common compress manually which spark-shell showed as if it has downloaded while launching the spark shell with packages option but actually did not(as I could not find anywhere in the maven repo dir where it said, it’s downloaded)
Version: 1.20
Then explicitly mentioned the jar name in the driver’s classpath as mentioned below:
$ spark-shell --driver-class-path /home/xvinosh/.m2/repository/org/apache/commons/commons-compress/1.20/commons-compress-1.20jar

This worked. Hope it helps other.

@sjahongir
Copy link

@nightscape hi
I tried the 0.9.0 version with spark 2.3.1 (local and cluster mode). It is worked but when I use a large excel file, a spark cannot process it.

Then tried higher versions of your library from 0.10:

  • spark can process large file when I use as a local mode
  • the following error occurs when I use spark as a cluster (standalone) mode

Exception in thread "main" java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34) at org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66) at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:258) at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) at etl.io.XlsxReader.open(XlsxReader.scala:135) at etl.io.XlsxReader.<init>(XlsxReader.scala:153) at etl.connectors.excel.ExcelConnector.readXlsx(ExcelConnector.scala:194) at etl.connectors.excel.ExcelConnector.read(ExcelConnector.scala:119) at etl.io.DatasetReader$.read(DatasetReader.scala:47) at etl.DatasetResolver$.byModel(DatasetResolver.scala:58) at etl.App$.processTask(App.scala:105) at etl.App$.main(App.scala:65) at etl.App.main(App.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@nightscape
Copy link
Collaborator

@sjahongir can you try the recommendation from @xvinosh?

@SwapnaRavi21
Copy link

SwapnaRavi21 commented Oct 25, 2021

@nightscape I still see issues with spark excel compatible with 2.12..
Using 0.13.4 I face java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:65)

Using 0.12.0 or 0.12.1 I get useHeader errors and as well as the above. Nothing is working out. Tried using commons-compress-1.20.jar along with other jars in my spark submit. No use.

Currently we are migrating to scala 2.12, could you pls suggest the spark excel version for the same without these issues?

@nightscape
Copy link
Collaborator

Hi @SwapnaRavi21, I would recommend always using the latest version available for your Spark & Scala version.
@quanghgx and me will try to figure out a way to build against multiple versions of Spark.
Unfortunately I'm under quite some deadline pressure at the moment and will probably only get to this the second week of November.
If you have experience with SBT, we'd be happy for any contributions!

@SwapnaRavi21
Copy link

@nightscape yes we are onto latest scala only 2.12. But this fix is available only in 2.11 and not in 2.12 right. Sure thanks. Meanwhile is there any alternative for this dependency so we can use that in 2.12 until the fix is provided in this version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests