Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix Classpath problems (#696) #697

Merged
merged 2 commits into from
Dec 20, 2022

Conversation

zzeekk
Copy link

@zzeekk zzeekk commented Dec 20, 2022

Publish assembly jar (fat-jar) with classifier 'assembly'. Like that the main artifact is a thin-jar, usable as maven library dependency, having it's dependencies described in the maven pom, making proper dependency resolution work again. The fat-jar can be downloaded from maven central with classifier 'assembly' in the file name.

Move log4j2 implementation needed for running spark >=3.3.0 into scope test. Remove slf4j dependency as it is not used througout the project.

Like that the main artifact is a thin-jar, usable as maven library dependency, having it's dependencies described in the maven pom.
…e test. Remove slf4j dependency as it is not used.
@nightscape
Copy link
Collaborator

There's one issue with that: spark-submit and spark-shell cannot handle classifiers, i.e. there's no way to use the assembly jar there.
I think it would have to be the other way round, the assembly version being the default and the thin version having a thin classifier (I assume the thin version will mostly be used from SBT / Maven projects where classifiers work).
Or the two get different artifact IDs altogether. It would be good if the existing artifact name would be kept for the assembly so that users don't get something different than they expect.

@zzeekk
Copy link
Author

zzeekk commented Dec 20, 2022

Hi @nightscape, publishing a fat-jar as main artifact with pom including library dependencies is against maven conventions. I strongly advise against this as it will produce classpath duplicates and problems.
On the otherside it is possible to use an assembly-jar with classifier with spark-submit by using parameter --jars, which can even take a URL of the corresponding artifact on maven central. Actionally assembly-jars should always be used with Spark through --jars option, as there is no pom / library dependencies which should be resolved.

@nightscape nightscape changed the base branch from main to 696-thin-jar December 20, 2022 20:35
@nightscape
Copy link
Collaborator

Hi @zzeekk, I've changed the PR to target an identically named branch in our repo. I will merge this PR to this branch, then I can have CI build a release from it which you and other people can test.

@nightscape nightscape merged commit 9d26e6a into crealytics:696-thin-jar Dec 20, 2022
@nightscape
Copy link
Collaborator

Here we go: https://github.com/crealytics/spark-excel/actions/runs/3743993292
Once the build is finished, please test this in as many environments as you have available and in the best case also motivate other people to try it 😉

@zzeekk
Copy link
Author

zzeekk commented Dec 23, 2022

Local unit tests with our Spark framework: OK

Databricks using maven dependency: OK

  • Databricks Runtime Version: 12.0
  • Cluster library config:
    image
  • Test result
    image

Yarn: OPEN

@zzeekk
Copy link
Author

zzeekk commented Jan 17, 2023

Hi @nightscape, finally i was able to tested this also on yarn / Spark 3.2.0, sorry for the delay:

using maven dependency in a Spark applications pom.xml, creating a fat-jar and reading XML -> OK

<dependency>
        <groupId>com.crealytics</groupId>
        <artifactId>spark-excel_2.12</artifactId>
        <version>3.2.3_0.18.6-beta1</version>
</dependency>

using assembly jar with spark-shell -> OK

wget https://repo1.maven.org/maven2/com/crealytics/spark-excel_2.12/3.2.3_0.18.6-beta1/spark-excel_2.12-3.2.3_0.18.6-beta1-assembly.jar
bin/spark-shell --jars spark-excel_2.12-3.2.3_0.18.6-beta1-assembly.jar
...
// v1 source
val df = spark.read.format("com.crealytics.spark.excel").option("header","true").load("abc.xlsx")
df.printSchema
df.show
// v2 source
val df = spark.read.format("excel").option("header","true").load("abc.xlsx")
df.printSchema
df.show

As you see i'm able to use Version 3.2.3_0.18.6-beta1 with Spark 3.2.0, as Spark's API is compatible and dependencies are stable within the same minor release, e.g. 3.2.x. You can also see this for other libraries, like delta lake: all delta lake 2.2.x versions are compatible with all spark versions 3.3.x, see also https://docs.delta.io/latest/releases.html.
I would therefore suggest another change and name the version 3.2_0.18.6-beta1 instead of 3.2.3_0.18.6-beta1.
WDYT?

@tof38130
Copy link

tof38130 commented May 3, 2023

Hi,
Newer versions of package are still fat jars on mvn repo. Its seems that this branche has never been merge into main branch.
Are you planning to merge it in main branch ?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants