SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

tmalaska · 2014-04-23T13:57:29Z

Updated the Flume dependency in the maven pom file and the scala build file.

AmplabJenkins · 2014-04-23T13:57:56Z

Can one of the admins verify this patch?

tmalaska · 2014-04-23T14:09:33Z

There is a build issue. Related to thrift.version. THis pull request should be consider involved.

Researching now.

tmalaska · 2014-04-24T14:07:14Z

OK this should be good now. Please review

pwendell · 2014-04-24T16:33:45Z

project/SparkBuild.scala

@@ -605,7 +606,8 @@ object SparkBuild extends Build {
    name := "spark-streaming-flume",
    previousArtifact := sparkPreviousArtifact("spark-streaming-flume"),
    libraryDependencies ++= Seq(
-      "org.apache.flume" % "flume-ng-sdk" % "1.2.0" % "compile" excludeAll(excludeNetty)
+      "org.apache.flume" % "flume-ng-sdk" % "1.4.0" % "compile" excludeAll(excludeNetty, excludeThrift),
+      "org.apache.thrift" % "libthrift" % "0.8.0" % "compile"


Why does spark need to directly depend on thrift?

See comment below made at 1:41 EST April 14 2014

pwendell · 2014-04-24T17:01:16Z

Hey @tmalaska could you explain a bit more what's going on here? Why does the maven build not need to exclude thrift but the sbt on does? Also, what is the exact conflict that is causing you to have to manually alter thrift in sbt

tmalaska · 2014-04-24T17:41:19Z

@pwendell this is a great question.

The answer there is ether a bug in sbt or I'm missing something in the SparkBuild.scala.

In the Flume 1.4.0 pom.xml there is a dependency on thrift but the version is declared with a property and that property is defined in a profile. I'm not sure if the issue related to the property or the profile or the combination, but sbt does not use the value of the thirft version property and I get the following exception.

sbt.ResolveException: unresolved dependency: org.apache.thrift#libthrift;${thrift.version}: not found

Maven works just fine so I left that as is.

So with my limited understand of sbt and why it was craping out. I decided to exclude the thrift dependency in Flume 1.4.0 and place it in the SparkBuild.scala file.

I'm open to any and all help here. I don't know enough about sbt to know why it is having trouble with this.

Side note, sbt works fine with Flume 1.3.0. This is because in Flume 1.3.0 the thrift version is hard coded in the Flume pom.xml. Flume 1.4.0 introduces the property value.

pwendell · 2014-04-24T18:29:23Z

@tmalaska I noticed in the flume pom they don't actually define thrift.version outside of profiles. Profiles are not a concept that transitive builds can depend on (profiles are totally ignored when bringing in transitive dependencies), so I think they need to put default versions for these things.

This is why thrift.version isn't defined when linking against it.

http://repo1.maven.org/maven2/org/apache/flume/flume-parent/1.4.0/flume-parent-1.4.0.pom

tmalaska · 2014-04-24T18:37:46Z

@pwendell I thought that too then I noticed the following two parts of the Flume 1.4.0 file

//----

hadoop-1.0

!hadoop.profile

//------------

hadoop-2

hadoop.profile
2

//------------

And from what I understand if the property hadoop.profile is equal to 2 then profile hadoop-2 is used otherwise hadoop-1.0 is used.

So I'm still of the belief that this is a sbt bug.

berngp · 2014-04-24T18:40:20Z

quick comment @pwendell, I think it will be a good idea to standardize the pull requests' titles.
e.g. "[SPARK-XYZ]: Small description of the ticket" as well as provide a template for the git-commit message.

tmalaska · 2014-04-24T18:40:32Z

Soo the tags didn't make it through the last one. Here are the parts of the pom.xml again.

{profile}
{id}hadoop-2{/id}
{activation}
{property}
{name}hadoop.profile{/name}
{value}2{/value}
{/property}
{/activation}

//---------

{profile}
  {id}hadoop-1.0{/id}
  {activation}
    {property}
      {name}!hadoop.profile{/name}
    {/property}
  {/activation}

tmalaska · 2014-04-24T18:42:26Z

@berngp I undated pull request name as you recommended.

pwendell · 2014-04-24T18:51:24Z

@berngp We do standardize these:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

tmalaska · 2014-04-24T18:52:30Z

Yes this is my first Spark commit. So I'm going to make some mistakes. :)

pwendell · 2014-04-24T18:56:01Z

@tmalaska Are maven profiles, in general, something that can be activated for a dependency? My understanding was that they just govern a particular local build. For instance I can't set -Phadoop2 when I link against flume and expect it to activate that profile.

I wonder if this is a case where maven allows things that are technically not part of the pom specification and other build tools don't honor them.

That trick in the flume build seems like a way to enable "default" behavior, but again I don't think this will always work transitively for people linking against flume and not just building flume.

tmalaska · 2014-04-24T19:04:33Z

@pwendell to be honest this is a little deeper then I normally go with Pom specification.

I think we have a behavior that maven does and there is a behavior that sbt does and they are not the same.

My goal of this pull request was to make the same outcome for both maven and sbt. In the end thrift will be included because would have Flume pulled it in.

It's not a perfect solution but the only option is to change the pom in Flume 1.4.0 but they have a different requirement of having two thrift options. I'm not even sure how Flume would honor that requirement without profiles.

Also I figured it would be good to have Flume 1.4.0 in Spark 1.0, because Flume 1.4.0 is the most commonly used Flume out there and it has some really cool functionality I would like to add to the FlumeStream like compression and encryption.

srowen · 2014-04-24T19:47:47Z

This is a bit of a weird setup. Thrift has a Hadoop 1 and Hadoop 2 version, but usually you deploy those as separate artifacts. The Hadoop 1 profile is the default, which specifies thrift 0.7.0, but you have to parse profiles to know that. I think it would have been better to set 0.7.0 as the version outside a profile. But I think that's why SBT doesn't get it.

Next I note that flume is telling us it wants to use 0.7.0 with Hadoop 1 and 0.8.0 with Hadoop 2. Ideally we just get away with 0.8.0 for both and not try to version this independently.

You have the SBT build using 0.8.0 but I bet that if you ran mvn dependency:tree you'd find that the Maven build is actually pulling in 0.7.0, given that it's the 'default'.

(But wait, there's more. In Examples, Hive 0.12 wants thrift 0.9.0! But that's irrelevant, it's not a dependency on external.)

Suggest that the Maven build force thrift 0.8.0 too and document it. It can be <scope>runtime</scope>. I think that's pretty tidy, and at least this is contained to particular leaf module.

Once Hadoop 1 and yarn alpha support and such go away, some of this jar hell will too. And maybe when there's one build of reference.

tmalaska · 2014-04-24T19:50:23Z

@srowen yes I agree. Yes I missed that one. The maven will do 7 and sbt will do 8.

I will move the maven to 8 as well.

pwendell · 2014-04-24T20:49:49Z

@srowen if we force thrift 0.8 in maven, is it possible that users compiling against older hadoop versions will get errors because the hadoop code (or some other transitive dependnecy) expects thrift 0.7? I'm not sure whether 0.7 and 0.8 are binary compatible.

tmalaska · 2014-04-24T20:56:36Z

So what should we do? Flume 1.2.0 is even worse at Thrift 6.1.

There are some people still on Hadoop 1 but most are on Hadoop 2 now.

berngp · 2014-04-24T21:13:29Z

Since in the Hadoop ecosystem the projects don't shade Protobufs nor Thrift versions trying to work with every single combination is a hassle to say the least. This is one of the things that Bigtop solves by defining static versions that work together. I am not giving an answer but just additional feedback of the complexity of trying to provide a tool that will work with all versions of the projects in the Hadoop ecosystem.

srowen · 2014-04-24T21:23:33Z

@pwendell In this case, it looks like the thrift versions are there to avoid compatibility problems with HBase:

https://issues.apache.org/jira/browse/FLUME-1974
https://issues.apache.org/jira/browse/FLUME-1975

So there's a theoretical problem with someone using a Flume DStream and HBase 0.92 client in the same app, but I think that should be out of scope to care about.

I will ask flume folk internally to comment if they have a moment.

tmalaska · 2014-04-24T21:30:57Z

Good point @srowen we may be able to just exclude thrift all together. All we need is the avro source stuff.

I will exclude from both and see if it works.

srowen · 2014-04-24T21:33:55Z

It may be that thrift can be entirely excluded for purposes here, but I was just referring to the differing versioning of thrift being caused by HBase versions.

pwendell · 2014-04-24T22:09:50Z

Excluding it, if possible, would be great.

@tmalaska The issue is just that if you hard-code a newer version of a dependency it can break Spark for users of e.g. older hadoop versions - so we need to be careful. We officially support Hadoop 1.X in Spark and many people still use it.

tmalaska · 2014-04-24T23:23:52Z

Uploaded the exclude all and I ran a assembly and test-quick and it worked.

Let me know what I should do next.

Thanks again for the help.

pwendell · 2014-04-25T04:02:42Z

I've merged this - thanks @tmalaska and @srowen for the good work.

Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes #507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version (cherry picked from commit d5c6ae6) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

pwendell · 2014-05-02T09:00:18Z

@tmalaska @srowen FYI - I filed this bug with Flume, so hopefully they can update the pom files:

https://issues.apache.org/jira/browse/FLUME-2379

Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes apache#507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version

* initial R support without integration tests * finished sparkR integration * case sensitive file names in unix * revert back to previous lower case in dockerfile * addition into the build-push-docker-images

* Support EMNIST AI training for charactors This patch adds an new dataset emnist into AI training scenario. Also need to change some logics in zuul jobs. * fix ci * Update main.yaml * fix face to face

Upgrade flume version

5bf56a7

fixed the flume pom file issues

159c3f1

pwendell reviewed Apr 24, 2014
View reviewed changes

tmalaska changed the title ~~SPARK-1584~~ SPARK-1584 Upgrade Flume dependency to 1.4.0 Apr 24, 2014

tmalaska changed the title ~~SPARK-1584 Upgrade Flume dependency to 1.4.0~~ SPARK-1584: Upgrade Flume dependency to 1.4.0 Apr 24, 2014

excluded all thrift

79492c8

tdas mentioned this pull request Apr 25, 2014

SPARK-1584. Upgrade Flume dependency to 1.4.0 #501

Closed

asfgit closed this in d5c6ae6 Apr 25, 2014

yifeih added a commit to yifeih/spark that referenced this pull request May 8, 2019

SPARK-25299: Add rest of shuffle writer benchmarks (apache#507)

ddc9905

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

tmalaska commented Apr 23, 2014

AmplabJenkins commented Apr 23, 2014

tmalaska commented Apr 23, 2014

tmalaska commented Apr 24, 2014

pwendell Apr 24, 2014

tmalaska Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

berngp commented Apr 24, 2014

tmalaska commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

srowen commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

berngp commented Apr 24, 2014

srowen commented Apr 24, 2014

tmalaska commented Apr 24, 2014

srowen commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 25, 2014

pwendell commented May 2, 2014

SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

Conversation

tmalaska commented Apr 23, 2014

AmplabJenkins commented Apr 23, 2014

tmalaska commented Apr 23, 2014

tmalaska commented Apr 24, 2014

pwendell Apr 24, 2014

Choose a reason for hiding this comment

tmalaska Apr 24, 2014

Choose a reason for hiding this comment

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

berngp commented Apr 24, 2014

tmalaska commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

srowen commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

berngp commented Apr 24, 2014

srowen commented Apr 24, 2014

tmalaska commented Apr 24, 2014

srowen commented Apr 24, 2014

pwendell commented Apr 24, 2014

tmalaska commented Apr 24, 2014

pwendell commented Apr 25, 2014

pwendell commented May 2, 2014