-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build #5632
Conversation
Test build #30746 has started for PR 5632 at commit |
Why not change the SBT build? the Maven build is the 'master' and its output is pretty standard, so we shouldn't change that, but change code that expects this path. |
Test build #30746 has finished for PR 5632 at commit
|
Test FAILed. |
I have no specific inclination, either changing sbt or maven is OK for me, since @tdas what is your opinion? |
Jenkins, retest this please. |
Test build #30757 has started for PR 5632 at commit |
Test build #30757 has finished for PR 5632 at commit
|
Test PASSed. |
Hi @JoshRosen , this issue is related to #4961 , since you helped to review that patch, would you please give some suggestions on this PR, thanks a lot. |
I'm pretty sure we don't want to change the Maven build from its standard output format. Changing SBT to match would make more sense given SBT should follow the Maven build. |
My concern is that sbt assembly deliberately changes the streaming-kafka-assembly jar name as:
But Maven still keeps the default way. So is there any intention or consideration to change the name in sbt, while forget to modify the related part in pom.xml for maven. |
The intent is to make SBT match Maven, no? that's what the comment suggests. |
Actually this comment is misleading, I guess original meaning is to keep the same name both in maven and sbt, but maybe forget to change the maven output name. While looking at the assembly in example module, maven also changed its output file, as well as sbt:
Not only change sbt to keep the same as maven. So I assume here in kafka-assembly module, we should still keep the same way as what previously did. |
Ping @tdas , would you please take a look at this PR, thanks a lot. |
I think this PR needs to modify the SBT build to match Maven's behavior at this point, rather than the other way around, or else this should be closed. |
Still I think we should keep consistent with other modules like example to change the output name. but since no one is explaining the specific reason for this over other, or other over this, so I will close this PR. |
@jerryshao The Maven build is the official/canonical build because, among other reasons, the releases of Spark are produced with the Maven release plugin. That's the typical case for Apache projects, so it's good to use the established, maven-based infrastructure. Since the Maven build is the established build for releases, it is the SBT build that needs to be modified to be consistent with the Maven build. |
I'm OK changing SBT rather than Maven, but my doubt is that, for other module like example, assembly, we also change the Maven output file:
I think by default Maven will not generate the assembly jar in this folder, but SBT will, so here why change Maven rather than SBT? |
I think the destination of the assembly JARs has been overridden in Maven because it needs to include the Scala version in order to not get those mixed up easily, and same for the Hadoop flavor. Ah, are you basically saying this needs to happen for this other kafka-assembly too? that would make sense but it's making Maven consistent with Maven really. I suppose any assembly output should follow the same pattern. SBT's output is kind of unimportant here, although, as much as possible, matching the main Maven build is good. |
Hi @srowen , that's what I originally mean, move kafka-assembly into scala-version folder. |
OK, then this should happen for flume-sink too? And update the titles here. |
I think it should be flume-assembly, not flume-sink, if we want to support Python API in flume ( #6830 ) , we should also keep consistent to change the output dirs like other modules. |
Merged build triggered. |
Merged build started. |
Test build #35080 has finished for PR 5632 at commit
|
Merged build finished. Test FAILed. |
Test build #35083 has finished for PR 5632 at commit
|
Merged build finished. Test PASSed. |
@jerryshao You said you think it should be "flume-assembly"; if you are suggesting a change to the module name, no I don't think that's possible now. There are two assembly outputs that don't customize the output path to include the Scala version, kafka-assembly and flume-sink. I think it's fine to make these consistent with the output of other assemblies. That seems like the right scope for this change. |
Hi @srowen , what I mean is the newly WIP PR #6830 , it adds a Python API for Flume in a similar to Kafka way (by adding a new model I'm not sure flume-sink also requires assembly, from POM and |
I don't understand why that requires a new module? |
You could see the implementation of that PR. I think the reason is to assemble all the flume and Spark Streaming related jars for Python to use, like what Kafka-assembly did. |
Any further comment on this PR? |
Shouldn't this change also happen in |
@srowen I'm not sure |
I brought it up because it also configures the shade plugin, which makes me think somebody wants the shaded (assembly) output. It's a pretty thin config but I don't know why it exists either way. @harishreedharan ? |
Any specific reason why this cannot be merged, #6830 which is done with same way already get merged... |
I have an outstanding comment here. I'd prefer consistency rather than fixing similar problems over and over. I don't know if there is a reason |
I'm not sure about |
That's fine but does something else need to change in Python then? if it's expecting the artifact in one place and now it goes somewhere else? |
No, I think Python already pick the right place, so no need to change now. |
OK maybe I'm confused then, how can it be expecting the file in a location where it's not yet written? Since I don't know this bit, I don't want to overlook something and create more issues. |
Here is the code in python test which used to search the Kafka assembly jar.
Here assumes Kafka assembly jar is in I'm not sure if this could explain your doubt. |
How does it work now then, if this is not where the assembly is found? I just want to make sure we're not overlooking something basic. Or are you saying this doesn't work at all? |
Hi @srowen , if the assembly jar is not found using this pattern |
So some tests have not been running at all? OK well seems to be a good thing to fix no matter what. Isn't that the headline topic here then -- not just about consistency |
Yes, Python streaming Kafka related tests will not be run if this jar is missing. |
Test build #1007 has started for PR 5632 at commit |
Test build #1007 has finished for PR 5632 at commit
|
To fix Spark Streaming unit test with maven build. Previously the name and path of maven generated jar is different from sbt, which will lead to following exception. This fix keep the same behavior with both Maven and sbt build.