Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build #5632

Closed
wants to merge 1 commit into from

Conversation

jerryshao
Copy link
Contributor

To fix Spark Streaming unit test with maven build. Previously the name and path of maven generated jar is different from sbt, which will lead to following exception. This fix keep the same behavior with both Maven and sbt build.

Failed to find Spark Streaming Kafka assembly jar in /home/xyz/spark/external/kafka-assembly
You need to build Spark with  'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or 'build/mvn package' before running this program

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30746 has started for PR 5632 at commit 74b068d.

@srowen
Copy link
Member

srowen commented Apr 22, 2015

Why not change the SBT build? the Maven build is the 'master' and its output is pretty standard, so we shouldn't change that, but change code that expects this path.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30746 has finished for PR 5632 at commit 74b068d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30746/
Test FAILed.

@jerryshao
Copy link
Contributor Author

I have no specific inclination, either changing sbt or maven is OK for me, since dev/run-tests using sbt as a default tool, so I can change here to keep the same with sbt.

@tdas what is your opinion?

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30757 has started for PR 5632 at commit 74b068d.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30757 has finished for PR 5632 at commit 74b068d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30757/
Test PASSed.

@jerryshao
Copy link
Contributor Author

Hi @JoshRosen , this issue is related to #4961 , since you helped to review that patch, would you please give some suggestions on this PR, thanks a lot.

@srowen
Copy link
Member

srowen commented Apr 23, 2015

I'm pretty sure we don't want to change the Maven build from its standard output format. Changing SBT to match would make more sense given SBT should follow the Maven build.

@jerryshao
Copy link
Contributor Author

My concern is that sbt assembly deliberately changes the streaming-kafka-assembly jar name as:

    jarName in assembly <<= (version, moduleName, hadoopVersion) map { (v, mName, hv) =>
      if (mName.contains("streaming-kafka-assembly")) {
        // This must match the same name used in maven (see external/kafka-assembly/pom.xml)
        s"${mName}-${v}.jar"
      } else {
        s"${mName}-${v}-hadoop${hv}.jar"
      }
    },

But Maven still keeps the default way. So is there any intention or consideration to change the name in sbt, while forget to modify the related part in pom.xml for maven.

@srowen
Copy link
Member

srowen commented Apr 24, 2015

The intent is to make SBT match Maven, no? that's what the comment suggests.

@jerryshao
Copy link
Contributor Author

Actually this comment is misleading, I guess original meaning is to keep the same name both in maven and sbt, but maybe forget to change the maven output name.

While looking at the assembly in example module, maven also changed its output file, as well as sbt:

<outputFile>${project.build.directory}/scala-${scala.binary.version}/spark-examples-${project.v
ersion}-hadoop${hadoop.version}.jar</outputFile>

Not only change sbt to keep the same as maven. So I assume here in kafka-assembly module, we should still keep the same way as what previously did.

@srowen
Copy link
Member

srowen commented Apr 24, 2015

@davies what was your intent in 0561c45 ?

@jerryshao
Copy link
Contributor Author

Ping @tdas , would you please take a look at this PR, thanks a lot.

@srowen
Copy link
Member

srowen commented Jun 16, 2015

I think this PR needs to modify the SBT build to match Maven's behavior at this point, rather than the other way around, or else this should be closed.

@jerryshao
Copy link
Contributor Author

Still I think we should keep consistent with other modules like example to change the output name. but since no one is explaining the specific reason for this over other, or other over this, so I will close this PR.

@jerryshao jerryshao closed this Jun 17, 2015
@markhamstra
Copy link
Contributor

@jerryshao The Maven build is the official/canonical build because, among other reasons, the releases of Spark are produced with the Maven release plugin. That's the typical case for Apache projects, so it's good to use the established, maven-based infrastructure. Since the Maven build is the established build for releases, it is the SBT build that needs to be modified to be consistent with the Maven build.

@jerryshao
Copy link
Contributor Author

I'm OK changing SBT rather than Maven, but my doubt is that, for other module like example, assembly, we also change the Maven output file:

<outputFile>${project.build.directory}/scala-${scala.binary.version}/spark-examples-${project.v
ersion}-hadoop${hadoop.version}.jar</outputFile>

I think by default Maven will not generate the assembly jar in this folder, but SBT will, so here why change Maven rather than SBT?

@srowen
Copy link
Member

srowen commented Jun 17, 2015

I think the destination of the assembly JARs has been overridden in Maven because it needs to include the Scala version in order to not get those mixed up easily, and same for the Hadoop flavor. Ah, are you basically saying this needs to happen for this other kafka-assembly too? that would make sense but it's making Maven consistent with Maven really. I suppose any assembly output should follow the same pattern. SBT's output is kind of unimportant here, although, as much as possible, matching the main Maven build is good.

@jerryshao
Copy link
Contributor Author

Hi @srowen , that's what I originally mean, move kafka-assembly into scala-version folder.

@srowen
Copy link
Member

srowen commented Jun 17, 2015

OK, then this should happen for flume-sink too? And update the titles here.

@jerryshao
Copy link
Contributor Author

I think it should be flume-assembly, not flume-sink, if we want to support Python API in flume ( #6830 ) , we should also keep consistent to change the output dirs like other modules.

@jerryshao jerryshao changed the title [SPARK-7050][build] Keep maven build consistent with sbt in kafka-assembly module [SPARK-7050][build] Keep kafka-assembly maven output path be consistent with other modules Jun 17, 2015
@jerryshao jerryshao reopened this Jun 17, 2015
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35080 has finished for PR 5632 at commit 74b068d.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35083 has finished for PR 5632 at commit 74b068d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@srowen
Copy link
Member

srowen commented Jun 18, 2015

@jerryshao You said you think it should be "flume-assembly"; if you are suggesting a change to the module name, no I don't think that's possible now.

There are two assembly outputs that don't customize the output path to include the Scala version, kafka-assembly and flume-sink. I think it's fine to make these consistent with the output of other assemblies. That seems like the right scope for this change.

@jerryshao
Copy link
Contributor Author

Hi @srowen , what I mean is the newly WIP PR #6830 , it adds a Python API for Flume in a similar to Kafka way (by adding a new model flume-assembly), so in that PR, we should also change the output dir like here.

I'm not sure flume-sink also requires assembly, from POM and SparkBuild.scala, seems assembly is not required.

@srowen
Copy link
Member

srowen commented Jun 18, 2015

I don't understand why that requires a new module?

@jerryshao
Copy link
Contributor Author

You could see the implementation of that PR. I think the reason is to assemble all the flume and Spark Streaming related jars for Python to use, like what Kafka-assembly did.

@jerryshao
Copy link
Contributor Author

Any further comment on this PR?

@srowen
Copy link
Member

srowen commented Jun 23, 2015

Shouldn't this change also happen in flume-sink?

@jerryshao
Copy link
Contributor Author

@srowen I'm not sure flume-sink requires such assembly, from the pom file and SparkBuild.scala, seems module flume-sink don't require assembly, so why this should also be happened in flume-sink?

@srowen
Copy link
Member

srowen commented Jun 24, 2015

I brought it up because it also configures the shade plugin, which makes me think somebody wants the shaded (assembly) output. It's a pretty thin config but I don't know why it exists either way. @harishreedharan ?

@jerryshao
Copy link
Contributor Author

Any specific reason why this cannot be merged, #6830 which is done with same way already get merged...

@srowen
Copy link
Member

srowen commented Jul 2, 2015

I have an outstanding comment here. I'd prefer consistency rather than fixing similar problems over and over. I don't know if there is a reason flume-sink shouldn't be consistent, so was asking @harishreedharan I think this change is sensible though. Does any other code need to change to use the new location? you're saying this is actually consumed by Python?

@jerryshao
Copy link
Contributor Author

I'm not sure about flume-sink, I think this changing is mainly used for Python, since currently only Flume and Kafka support python API, so changing kafka-assembly and flume-assembly is enough.

@srowen
Copy link
Member

srowen commented Jul 2, 2015

That's fine but does something else need to change in Python then? if it's expecting the artifact in one place and now it goes somewhere else?

@jerryshao
Copy link
Contributor Author

No, I think Python already pick the right place, so no need to change now.

@srowen
Copy link
Member

srowen commented Jul 2, 2015

OK maybe I'm confused then, how can it be expecting the file in a location where it's not yet written? Since I don't know this bit, I don't want to overlook something and create more issues.

@jerryshao
Copy link
Contributor Author

Here is the code in python test which used to search the Kafka assembly jar.

def search_kafka_assembly_jar():
    SPARK_HOME = os.environ["SPARK_HOME"]
    kafka_assembly_dir = os.path.join(SPARK_HOME, "external/kafka-assembly")
    jars = glob.glob(
        os.path.join(kafka_assembly_dir, "target/scala-*/spark-streaming-kafka-assembly-*.jar"))
    if not jars:
        raise Exception(
            ("Failed to find Spark Streaming kafka assembly jar in %s. " % kafka_assembly_dir) +
            "You need to build Spark with "
            "'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or "
            "'build/mvn package' before running this test")
    elif len(jars) > 1:
        raise Exception(("Found multiple Spark Streaming Kafka assembly JARs in %s; please "
                         "remove all but one") % kafka_assembly_dir)
    else:
        return jars[0]

Here assumes Kafka assembly jar is in target/scala-*/spark-streaming-kafka-assembly-*.jar.

I'm not sure if this could explain your doubt.

@srowen
Copy link
Member

srowen commented Jul 2, 2015

How does it work now then, if this is not where the assembly is found? I just want to make sure we're not overlooking something basic. Or are you saying this doesn't work at all?

@jerryshao
Copy link
Contributor Author

Hi @srowen , if the assembly jar is not found using this pattern target/scala-*/spark-streaming-kafka-assembly-*.jar, the python Kafka test will not be run at all.

@srowen
Copy link
Member

srowen commented Jul 6, 2015

So some tests have not been running at all? OK well seems to be a good thing to fix no matter what. Isn't that the headline topic here then -- not just about consistency

@jerryshao
Copy link
Contributor Author

Yes, Python streaming Kafka related tests will not be run if this jar is missing.

@jerryshao jerryshao changed the title [SPARK-7050][build] Keep kafka-assembly maven output path be consistent with other modules [SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build Jul 6, 2015
@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #1007 has started for PR 5632 at commit 74b068d.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #1007 has finished for PR 5632 at commit 74b068d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants