New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1584: Upgrade Flume dependency to 1.4.0 #507
Conversation
Can one of the admins verify this patch? |
There is a build issue. Related to thrift.version. THis pull request should be consider involved. Researching now. |
OK this should be good now. Please review |
@@ -605,7 +606,8 @@ object SparkBuild extends Build { | |||
name := "spark-streaming-flume", | |||
previousArtifact := sparkPreviousArtifact("spark-streaming-flume"), | |||
libraryDependencies ++= Seq( | |||
"org.apache.flume" % "flume-ng-sdk" % "1.2.0" % "compile" excludeAll(excludeNetty) | |||
"org.apache.flume" % "flume-ng-sdk" % "1.4.0" % "compile" excludeAll(excludeNetty, excludeThrift), | |||
"org.apache.thrift" % "libthrift" % "0.8.0" % "compile" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does spark need to directly depend on thrift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below made at 1:41 EST April 14 2014
Hey @tmalaska could you explain a bit more what's going on here? Why does the maven build not need to exclude thrift but the sbt on does? Also, what is the exact conflict that is causing you to have to manually alter thrift in sbt |
@pwendell this is a great question. The answer there is ether a bug in sbt or I'm missing something in the SparkBuild.scala. In the Flume 1.4.0 pom.xml there is a dependency on thrift but the version is declared with a property and that property is defined in a profile. I'm not sure if the issue related to the property or the profile or the combination, but sbt does not use the value of the thirft version property and I get the following exception. sbt.ResolveException: unresolved dependency: org.apache.thrift#libthrift;${thrift.version}: not found Maven works just fine so I left that as is. So with my limited understand of sbt and why it was craping out. I decided to exclude the thrift dependency in Flume 1.4.0 and place it in the SparkBuild.scala file. I'm open to any and all help here. I don't know enough about sbt to know why it is having trouble with this. Side note, sbt works fine with Flume 1.3.0. This is because in Flume 1.3.0 the thrift version is hard coded in the Flume pom.xml. Flume 1.4.0 introduces the property value. |
@tmalaska I noticed in the flume pom they don't actually define thrift.version outside of profiles. Profiles are not a concept that transitive builds can depend on (profiles are totally ignored when bringing in transitive dependencies), so I think they need to put default versions for these things. This is why http://repo1.maven.org/maven2/org/apache/flume/flume-parent/1.4.0/flume-parent-1.4.0.pom |
@pwendell I thought that too then I noticed the following two parts of the Flume 1.4.0 file //---- And from what I understand if the property hadoop.profile is equal to 2 then profile hadoop-2 is used otherwise hadoop-1.0 is used. So I'm still of the belief that this is a sbt bug. |
quick comment @pwendell, I think it will be a good idea to standardize the pull requests' titles. |
Soo the tags didn't make it through the last one. Here are the parts of the pom.xml again. {profile} //---------
|
@berngp I undated pull request name as you recommended. |
@berngp We do standardize these: |
Yes this is my first Spark commit. So I'm going to make some mistakes. :) |
@tmalaska Are maven profiles, in general, something that can be activated for a dependency? My understanding was that they just govern a particular local build. For instance I can't set I wonder if this is a case where maven allows things that are technically not part of the pom specification and other build tools don't honor them. That trick in the flume build seems like a way to enable "default" behavior, but again I don't think this will always work transitively for people linking against flume and not just building flume. |
@pwendell to be honest this is a little deeper then I normally go with Pom specification. I think we have a behavior that maven does and there is a behavior that sbt does and they are not the same. My goal of this pull request was to make the same outcome for both maven and sbt. In the end thrift will be included because would have Flume pulled it in. It's not a perfect solution but the only option is to change the pom in Flume 1.4.0 but they have a different requirement of having two thrift options. I'm not even sure how Flume would honor that requirement without profiles. Also I figured it would be good to have Flume 1.4.0 in Spark 1.0, because Flume 1.4.0 is the most commonly used Flume out there and it has some really cool functionality I would like to add to the FlumeStream like compression and encryption. |
This is a bit of a weird setup. Thrift has a Hadoop 1 and Hadoop 2 version, but usually you deploy those as separate artifacts. The Hadoop 1 profile is the default, which specifies thrift 0.7.0, but you have to parse profiles to know that. I think it would have been better to set 0.7.0 as the version outside a profile. But I think that's why SBT doesn't get it. Next I note that flume is telling us it wants to use 0.7.0 with Hadoop 1 and 0.8.0 with Hadoop 2. Ideally we just get away with 0.8.0 for both and not try to version this independently. You have the SBT build using 0.8.0 but I bet that if you ran mvn dependency:tree you'd find that the Maven build is actually pulling in 0.7.0, given that it's the 'default'. (But wait, there's more. In Examples, Hive 0.12 wants thrift 0.9.0! But that's irrelevant, it's not a dependency on external.) Suggest that the Maven build force thrift 0.8.0 too and document it. It can be Once Hadoop 1 and yarn alpha support and such go away, some of this jar hell will too. And maybe when there's one build of reference. |
@srowen yes I agree. Yes I missed that one. The maven will do 7 and sbt will do 8. I will move the maven to 8 as well. |
@srowen if we force thrift 0.8 in maven, is it possible that users compiling against older hadoop versions will get errors because the hadoop code (or some other transitive dependnecy) expects thrift 0.7? I'm not sure whether 0.7 and 0.8 are binary compatible. |
So what should we do? Flume 1.2.0 is even worse at Thrift 6.1. There are some people still on Hadoop 1 but most are on Hadoop 2 now. |
Since in the Hadoop ecosystem the projects don't shade Protobufs nor Thrift versions trying to work with every single combination is a hassle to say the least. This is one of the things that Bigtop solves by defining static versions that work together. I am not giving an answer but just additional feedback of the complexity of trying to provide a tool that will work with all versions of the projects in the Hadoop ecosystem. |
@pwendell In this case, it looks like the thrift versions are there to avoid compatibility problems with HBase: https://issues.apache.org/jira/browse/FLUME-1974 So there's a theoretical problem with someone using a Flume DStream and HBase 0.92 client in the same app, but I think that should be out of scope to care about. I will ask flume folk internally to comment if they have a moment. |
Good point @srowen we may be able to just exclude thrift all together. All we need is the avro source stuff. I will exclude from both and see if it works. |
It may be that thrift can be entirely excluded for purposes here, but I was just referring to the differing versioning of thrift being caused by HBase versions. |
Excluding it, if possible, would be great. @tmalaska The issue is just that if you hard-code a newer version of a dependency it can break Spark for users of e.g. older hadoop versions - so we need to be careful. We officially support Hadoop 1.X in Spark and many people still use it. |
Uploaded the exclude all and I ran a assembly and test-quick and it worked. Let me know what I should do next. Thanks again for the help. |
Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes #507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version (cherry picked from commit d5c6ae6) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
Updated the Flume dependency in the maven pom file and the scala build file. Author: tmalaska <ted.malaska@cloudera.com> Closes apache#507 from tmalaska/master and squashes the following commits: 79492c8 [tmalaska] excluded all thrift 159c3f1 [tmalaska] fixed the flume pom file issues 5bf56a7 [tmalaska] Upgrade flume version
* initial R support without integration tests * finished sparkR integration * case sensitive file names in unix * revert back to previous lower case in dockerfile * addition into the build-push-docker-images
* Support EMNIST AI training for charactors This patch adds an new dataset emnist into AI training scenario. Also need to change some logics in zuul jobs. * fix ci * Update main.yaml * fix face to face
Updated the Flume dependency in the maven pom file and the scala build file.