[SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build #5632

jerryshao · 2015-04-22T09:19:33Z

To fix Spark Streaming unit test with maven build. Previously the name and path of maven generated jar is different from sbt, which will lead to following exception. This fix keep the same behavior with both Maven and sbt build.

Failed to find Spark Streaming Kafka assembly jar in /home/xyz/spark/external/kafka-assembly
You need to build Spark with  'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or 'build/mvn package' before running this program

SparkQA · 2015-04-22T09:23:42Z

Test build #30746 has started for PR 5632 at commit 74b068d.

srowen · 2015-04-22T09:59:25Z

Why not change the SBT build? the Maven build is the 'master' and its output is pretty standard, so we shouldn't change that, but change code that expects this path.

SparkQA · 2015-04-22T10:54:18Z

Test build #30746 has finished for PR 5632 at commit 74b068d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-22T10:54:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30746/
Test FAILed.

jerryshao · 2015-04-22T12:49:24Z

I have no specific inclination, either changing sbt or maven is OK for me, since dev/run-tests using sbt as a default tool, so I can change here to keep the same with sbt.

@tdas what is your opinion?

jerryshao · 2015-04-22T13:07:07Z

Jenkins, retest this please.

SparkQA · 2015-04-22T13:13:42Z

Test build #30757 has started for PR 5632 at commit 74b068d.

SparkQA · 2015-04-22T15:14:14Z

Test build #30757 has finished for PR 5632 at commit 74b068d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-22T15:14:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30757/
Test PASSed.

jerryshao · 2015-04-23T01:12:26Z

Hi @JoshRosen , this issue is related to #4961 , since you helped to review that patch, would you please give some suggestions on this PR, thanks a lot.

srowen · 2015-04-23T21:56:50Z

I'm pretty sure we don't want to change the Maven build from its standard output format. Changing SBT to match would make more sense given SBT should follow the Maven build.

jerryshao · 2015-04-24T01:06:54Z

My concern is that sbt assembly deliberately changes the streaming-kafka-assembly jar name as:

    jarName in assembly <<= (version, moduleName, hadoopVersion) map { (v, mName, hv) =>
      if (mName.contains("streaming-kafka-assembly")) {
        // This must match the same name used in maven (see external/kafka-assembly/pom.xml)
        s"${mName}-${v}.jar"
      } else {
        s"${mName}-${v}-hadoop${hv}.jar"
      }
    },

But Maven still keeps the default way. So is there any intention or consideration to change the name in sbt, while forget to modify the related part in pom.xml for maven.

srowen · 2015-04-24T01:12:30Z

The intent is to make SBT match Maven, no? that's what the comment suggests.

jerryshao · 2015-04-24T01:20:44Z

Actually this comment is misleading, I guess original meaning is to keep the same name both in maven and sbt, but maybe forget to change the maven output name.

While looking at the assembly in example module, maven also changed its output file, as well as sbt:

<outputFile>${project.build.directory}/scala-${scala.binary.version}/spark-examples-${project.v
ersion}-hadoop${hadoop.version}.jar</outputFile>

Not only change sbt to keep the same as maven. So I assume here in kafka-assembly module, we should still keep the same way as what previously did.

srowen · 2015-04-24T01:26:07Z

@davies what was your intent in 0561c45 ?

jerryshao · 2015-05-01T06:52:02Z

Ping @tdas , would you please take a look at this PR, thanks a lot.

srowen · 2015-06-16T19:57:55Z

I think this PR needs to modify the SBT build to match Maven's behavior at this point, rather than the other way around, or else this should be closed.

jerryshao · 2015-06-17T00:59:09Z

Still I think we should keep consistent with other modules like example to change the output name. but since no one is explaining the specific reason for this over other, or other over this, so I will close this PR.

markhamstra · 2015-06-17T05:33:31Z

@jerryshao The Maven build is the official/canonical build because, among other reasons, the releases of Spark are produced with the Maven release plugin. That's the typical case for Apache projects, so it's good to use the established, maven-based infrastructure. Since the Maven build is the established build for releases, it is the SBT build that needs to be modified to be consistent with the Maven build.

jerryshao · 2015-06-17T06:01:09Z

I'm OK changing SBT rather than Maven, but my doubt is that, for other module like example, assembly, we also change the Maven output file:

<outputFile>${project.build.directory}/scala-${scala.binary.version}/spark-examples-${project.v
ersion}-hadoop${hadoop.version}.jar</outputFile>

I think by default Maven will not generate the assembly jar in this folder, but SBT will, so here why change Maven rather than SBT?

srowen · 2015-06-17T07:18:12Z

I think the destination of the assembly JARs has been overridden in Maven because it needs to include the Scala version in order to not get those mixed up easily, and same for the Hadoop flavor. Ah, are you basically saying this needs to happen for this other kafka-assembly too? that would make sense but it's making Maven consistent with Maven really. I suppose any assembly output should follow the same pattern. SBT's output is kind of unimportant here, although, as much as possible, matching the main Maven build is good.

jerryshao · 2015-06-17T07:28:39Z

Hi @srowen , that's what I originally mean, move kafka-assembly into scala-version folder.

srowen · 2015-06-17T08:19:49Z

OK, then this should happen for flume-sink too? And update the titles here.

jerryshao · 2015-06-17T09:03:59Z

I think it should be flume-assembly, not flume-sink, if we want to support Python API in flume ( #6830 ) , we should also keep consistent to change the output dirs like other modules.

AmplabJenkins · 2015-06-17T09:12:13Z

Merged build triggered.

AmplabJenkins · 2015-06-17T09:12:23Z

Merged build started.

SparkQA · 2015-06-18T02:16:44Z

Test build #35080 has finished for PR 5632 at commit 74b068d.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-18T02:16:50Z

Merged build finished. Test FAILed.

SparkQA · 2015-06-18T03:07:53Z

Test build #35083 has finished for PR 5632 at commit 74b068d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-06-18T03:08:03Z

Merged build finished. Test PASSed.

srowen · 2015-06-18T07:09:38Z

@jerryshao You said you think it should be "flume-assembly"; if you are suggesting a change to the module name, no I don't think that's possible now.

There are two assembly outputs that don't customize the output path to include the Scala version, kafka-assembly and flume-sink. I think it's fine to make these consistent with the output of other assemblies. That seems like the right scope for this change.

jerryshao · 2015-06-18T07:35:02Z

Hi @srowen , what I mean is the newly WIP PR #6830 , it adds a Python API for Flume in a similar to Kafka way (by adding a new model flume-assembly), so in that PR, we should also change the output dir like here.

I'm not sure flume-sink also requires assembly, from POM and SparkBuild.scala, seems assembly is not required.

srowen · 2015-06-18T08:51:15Z

I don't understand why that requires a new module?

jerryshao · 2015-06-18T09:07:01Z

You could see the implementation of that PR. I think the reason is to assemble all the flume and Spark Streaming related jars for Python to use, like what Kafka-assembly did.

jerryshao · 2015-06-23T01:33:15Z

Any further comment on this PR?

srowen · 2015-06-23T12:42:09Z

Shouldn't this change also happen in flume-sink?

jerryshao · 2015-06-24T00:53:33Z

@srowen I'm not sure flume-sink requires such assembly, from the pom file and SparkBuild.scala, seems module flume-sink don't require assembly, so why this should also be happened in flume-sink?

srowen · 2015-06-24T06:30:27Z

I brought it up because it also configures the shade plugin, which makes me think somebody wants the shaded (assembly) output. It's a pretty thin config but I don't know why it exists either way. @harishreedharan ?

jerryshao · 2015-07-02T01:48:06Z

Any specific reason why this cannot be merged, #6830 which is done with same way already get merged...

srowen · 2015-07-02T07:07:07Z

I have an outstanding comment here. I'd prefer consistency rather than fixing similar problems over and over. I don't know if there is a reason flume-sink shouldn't be consistent, so was asking @harishreedharan I think this change is sensible though. Does any other code need to change to use the new location? you're saying this is actually consumed by Python?

jerryshao · 2015-07-02T07:12:05Z

I'm not sure about flume-sink, I think this changing is mainly used for Python, since currently only Flume and Kafka support python API, so changing kafka-assembly and flume-assembly is enough.

srowen · 2015-07-02T07:15:29Z

That's fine but does something else need to change in Python then? if it's expecting the artifact in one place and now it goes somewhere else?

jerryshao · 2015-07-02T07:20:16Z

No, I think Python already pick the right place, so no need to change now.

srowen · 2015-07-02T07:28:27Z

OK maybe I'm confused then, how can it be expecting the file in a location where it's not yet written? Since I don't know this bit, I don't want to overlook something and create more issues.

jerryshao · 2015-07-02T09:24:45Z

Here is the code in python test which used to search the Kafka assembly jar.

def search_kafka_assembly_jar():
    SPARK_HOME = os.environ["SPARK_HOME"]
    kafka_assembly_dir = os.path.join(SPARK_HOME, "external/kafka-assembly")
    jars = glob.glob(
        os.path.join(kafka_assembly_dir, "target/scala-*/spark-streaming-kafka-assembly-*.jar"))
    if not jars:
        raise Exception(
            ("Failed to find Spark Streaming kafka assembly jar in %s. " % kafka_assembly_dir) +
            "You need to build Spark with "
            "'build/sbt assembly/assembly streaming-kafka-assembly/assembly' or "
            "'build/mvn package' before running this test")
    elif len(jars) > 1:
        raise Exception(("Found multiple Spark Streaming Kafka assembly JARs in %s; please "
                         "remove all but one") % kafka_assembly_dir)
    else:
        return jars[0]

Here assumes Kafka assembly jar is in target/scala-*/spark-streaming-kafka-assembly-*.jar.

I'm not sure if this could explain your doubt.

srowen · 2015-07-02T10:28:01Z

How does it work now then, if this is not where the assembly is found? I just want to make sure we're not overlooking something basic. Or are you saying this doesn't work at all?

jerryshao · 2015-07-06T05:08:29Z

Hi @srowen , if the assembly jar is not found using this pattern target/scala-*/spark-streaming-kafka-assembly-*.jar, the python Kafka test will not be run at all.

srowen · 2015-07-06T08:10:14Z

So some tests have not been running at all? OK well seems to be a good thing to fix no matter what. Isn't that the headline topic here then -- not just about consistency

jerryshao · 2015-07-06T08:17:39Z

Yes, Python streaming Kafka related tests will not be run if this jar is missing.

SparkQA · 2015-07-08T10:13:22Z

Test build #1007 has started for PR 5632 at commit 74b068d.

SparkQA · 2015-07-08T10:32:26Z

Test build #1007 has finished for PR 5632 at commit 74b068d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Fix mvn build issue

74b068d

jerryshao closed this Jun 17, 2015

jerryshao changed the title ~~[SPARK-7050][build] Keep maven build consistent with sbt in kafka-assembly module~~ [SPARK-7050][build] Keep kafka-assembly maven output path be consistent with other modules Jun 17, 2015

jerryshao reopened this Jun 17, 2015

jerryshao mentioned this pull request Jun 29, 2015

[SPARK-8378][Streaming]Add the Python API for Flume #6830

Closed

jerryshao changed the title ~~[SPARK-7050][build] Keep kafka-assembly maven output path be consistent with other modules~~ [SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build Jul 6, 2015

asfgit closed this in 8a9d9cc Jul 8, 2015

zsxwing mentioned this pull request Jul 15, 2015

[SPARK-5155] [PySpark] [Streaming] Mqtt streaming support in Python #4229

Closed

tdas mentioned this pull request Aug 24, 2015

[SPARK-10168][Streaming]Fix the issue that maven publishes wrong artifact jars #8373

Closed

[SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build #5632

[SPARK-7050][build] Fix Python Kafka test assembly jar not found issue under Maven build #5632

Conversation

jerryshao commented Apr 22, 2015

SparkQA commented Apr 22, 2015

srowen commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

jerryshao commented Apr 22, 2015

jerryshao commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

jerryshao commented Apr 23, 2015

srowen commented Apr 23, 2015

jerryshao commented Apr 24, 2015

srowen commented Apr 24, 2015

jerryshao commented Apr 24, 2015

srowen commented Apr 24, 2015

jerryshao commented May 1, 2015

srowen commented Jun 16, 2015

jerryshao commented Jun 17, 2015

markhamstra commented Jun 17, 2015

jerryshao commented Jun 17, 2015

srowen commented Jun 17, 2015

jerryshao commented Jun 17, 2015

srowen commented Jun 17, 2015

jerryshao commented Jun 17, 2015

AmplabJenkins commented Jun 17, 2015

AmplabJenkins commented Jun 17, 2015

SparkQA commented Jun 18, 2015

AmplabJenkins commented Jun 18, 2015

SparkQA commented Jun 18, 2015

AmplabJenkins commented Jun 18, 2015

srowen commented Jun 18, 2015

jerryshao commented Jun 18, 2015

srowen commented Jun 18, 2015

jerryshao commented Jun 18, 2015

jerryshao commented Jun 23, 2015

srowen commented Jun 23, 2015

jerryshao commented Jun 24, 2015

srowen commented Jun 24, 2015

jerryshao commented Jul 2, 2015

srowen commented Jul 2, 2015

jerryshao commented Jul 2, 2015

srowen commented Jul 2, 2015

jerryshao commented Jul 2, 2015

srowen commented Jul 2, 2015

jerryshao commented Jul 2, 2015

srowen commented Jul 2, 2015

jerryshao commented Jul 6, 2015

srowen commented Jul 6, 2015

jerryshao commented Jul 6, 2015

SparkQA commented Jul 8, 2015

SparkQA commented Jul 8, 2015