Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

Closed
wants to merge 3 commits into from
Closed

SPARK-1584: Upgrade Flume dependency to 1.4.0 #507

wants to merge 3 commits into from

Conversation

tmalaska
Copy link

Updated the Flume dependency in the maven pom file and the scala build file.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tmalaska
Copy link
Author

There is a build issue. Related to thrift.version. THis pull request should be consider involved.

Researching now.

@tmalaska
Copy link
Author

OK this should be good now. Please review

@@ -605,7 +606,8 @@ object SparkBuild extends Build {
name := "spark-streaming-flume",
previousArtifact := sparkPreviousArtifact("spark-streaming-flume"),
libraryDependencies ++= Seq(
"org.apache.flume" % "flume-ng-sdk" % "1.2.0" % "compile" excludeAll(excludeNetty)
"org.apache.flume" % "flume-ng-sdk" % "1.4.0" % "compile" excludeAll(excludeNetty, excludeThrift),
"org.apache.thrift" % "libthrift" % "0.8.0" % "compile"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does spark need to directly depend on thrift?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below made at 1:41 EST April 14 2014

@pwendell
Copy link
Contributor

Hey @tmalaska could you explain a bit more what's going on here? Why does the maven build not need to exclude thrift but the sbt on does? Also, what is the exact conflict that is causing you to have to manually alter thrift in sbt

@tmalaska
Copy link
Author

@pwendell this is a great question.

The answer there is ether a bug in sbt or I'm missing something in the SparkBuild.scala.

In the Flume 1.4.0 pom.xml there is a dependency on thrift but the version is declared with a property and that property is defined in a profile. I'm not sure if the issue related to the property or the profile or the combination, but sbt does not use the value of the thirft version property and I get the following exception.

sbt.ResolveException: unresolved dependency: org.apache.thrift#libthrift;${thrift.version}: not found

Maven works just fine so I left that as is.

So with my limited understand of sbt and why it was craping out. I decided to exclude the thrift dependency in Flume 1.4.0 and place it in the SparkBuild.scala file.

I'm open to any and all help here. I don't know enough about sbt to know why it is having trouble with this.

Side note, sbt works fine with Flume 1.3.0. This is because in Flume 1.3.0 the thrift version is hard coded in the Flume pom.xml. Flume 1.4.0 introduces the property value.

@pwendell
Copy link
Contributor

@tmalaska I noticed in the flume pom they don't actually define thrift.version outside of profiles. Profiles are not a concept that transitive builds can depend on (profiles are totally ignored when bringing in transitive dependencies), so I think they need to put default versions for these things.

This is why thrift.version isn't defined when linking against it.

http://repo1.maven.org/maven2/org/apache/flume/flume-parent/1.4.0/flume-parent-1.4.0.pom

@tmalaska
Copy link
Author

@pwendell I thought that too then I noticed the following two parts of the Flume 1.4.0 file

//----

hadoop-1.0


!hadoop.profile


//------------

hadoop-2


hadoop.profile
2


//------------

And from what I understand if the property hadoop.profile is equal to 2 then profile hadoop-2 is used otherwise hadoop-1.0 is used.

So I'm still of the belief that this is a sbt bug.

@berngp
Copy link
Contributor

berngp commented Apr 24, 2014

quick comment @pwendell, I think it will be a good idea to standardize the pull requests' titles.
e.g. "[SPARK-XYZ]: Small description of the ticket" as well as provide a template for the git-commit message.

@tmalaska
Copy link
Author

Soo the tags didn't make it through the last one. Here are the parts of the pom.xml again.

{profile}
{id}hadoop-2{/id}
{activation}
{property}
{name}hadoop.profile{/name}
{value}2{/value}
{/property}
{/activation}

//---------

{profile}
  {id}hadoop-1.0{/id}
  {activation}
    {property}
      {name}!hadoop.profile{/name}
    {/property}
  {/activation}

@tmalaska tmalaska changed the title SPARK-1584 SPARK-1584 Upgrade Flume dependency to 1.4.0 Apr 24, 2014
@tmalaska tmalaska changed the title SPARK-1584 Upgrade Flume dependency to 1.4.0 SPARK-1584: Upgrade Flume dependency to 1.4.0 Apr 24, 2014
@tmalaska
Copy link
Author

@berngp I undated pull request name as you recommended.

@pwendell
Copy link
Contributor

@tmalaska
Copy link
Author

Yes this is my first Spark commit. So I'm going to make some mistakes. :)

@pwendell
Copy link
Contributor

@tmalaska Are maven profiles, in general, something that can be activated for a dependency? My understanding was that they just govern a particular local build. For instance I can't set -Phadoop2 when I link against flume and expect it to activate that profile.

I wonder if this is a case where maven allows things that are technically not part of the pom specification and other build tools don't honor them.

That trick in the flume build seems like a way to enable "default" behavior, but again I don't think this will always work transitively for people linking against flume and not just building flume.

@tmalaska
Copy link
Author

@pwendell to be honest this is a little deeper then I normally go with Pom specification.

I think we have a behavior that maven does and there is a behavior that sbt does and they are not the same.

My goal of this pull request was to make the same outcome for both maven and sbt. In the end thrift will be included because would have Flume pulled it in.

It's not a perfect solution but the only option is to change the pom in Flume 1.4.0 but they have a different requirement of having two thrift options. I'm not even sure how Flume would honor that requirement without profiles.

Also I figured it would be good to have Flume 1.4.0 in Spark 1.0, because Flume 1.4.0 is the most commonly used Flume out there and it has some really cool functionality I would like to add to the FlumeStream like compression and encryption.

@srowen
Copy link
Member

srowen commented Apr 24, 2014

This is a bit of a weird setup. Thrift has a Hadoop 1 and Hadoop 2 version, but usually you deploy those as separate artifacts. The Hadoop 1 profile is the default, which specifies thrift 0.7.0, but you have to parse profiles to know that. I think it would have been better to set 0.7.0 as the version outside a profile. But I think that's why SBT doesn't get it.

Next I note that flume is telling us it wants to use 0.7.0 with Hadoop 1 and 0.8.0 with Hadoop 2. Ideally we just get away with 0.8.0 for both and not try to version this independently.

You have the SBT build using 0.8.0 but I bet that if you ran mvn dependency:tree you'd find that the Maven build is actually pulling in 0.7.0, given that it's the 'default'.

(But wait, there's more. In Examples, Hive 0.12 wants thrift 0.9.0! But that's irrelevant, it's not a dependency on external.)

Suggest that the Maven build force thrift 0.8.0 too and document it. It can be <scope>runtime</scope>. I think that's pretty tidy, and at least this is contained to particular leaf module.

Once Hadoop 1 and yarn alpha support and such go away, some of this jar hell will too. And maybe when there's one build of reference.

@tmalaska
Copy link
Author

@srowen yes I agree. Yes I missed that one. The maven will do 7 and sbt will do 8.

I will move the maven to 8 as well.

@pwendell
Copy link
Contributor

@srowen if we force thrift 0.8 in maven, is it possible that users compiling against older hadoop versions will get errors because the hadoop code (or some other transitive dependnecy) expects thrift 0.7? I'm not sure whether 0.7 and 0.8 are binary compatible.

@tmalaska
Copy link
Author

So what should we do? Flume 1.2.0 is even worse at Thrift 6.1.

There are some people still on Hadoop 1 but most are on Hadoop 2 now.

@berngp
Copy link
Contributor

berngp commented Apr 24, 2014

Since in the Hadoop ecosystem the projects don't shade Protobufs nor Thrift versions trying to work with every single combination is a hassle to say the least. This is one of the things that Bigtop solves by defining static versions that work together. I am not giving an answer but just additional feedback of the complexity of trying to provide a tool that will work with all versions of the projects in the Hadoop ecosystem.

@srowen
Copy link
Member

srowen commented Apr 24, 2014

@pwendell In this case, it looks like the thrift versions are there to avoid compatibility problems with HBase:

https://issues.apache.org/jira/browse/FLUME-1974
https://issues.apache.org/jira/browse/FLUME-1975

So there's a theoretical problem with someone using a Flume DStream and HBase 0.92 client in the same app, but I think that should be out of scope to care about.

I will ask flume folk internally to comment if they have a moment.

@tmalaska
Copy link
Author

Good point @srowen we may be able to just exclude thrift all together. All we need is the avro source stuff.

I will exclude from both and see if it works.

@srowen
Copy link
Member

srowen commented Apr 24, 2014

It may be that thrift can be entirely excluded for purposes here, but I was just referring to the differing versioning of thrift being caused by HBase versions.

@pwendell
Copy link
Contributor

Excluding it, if possible, would be great.

@tmalaska The issue is just that if you hard-code a newer version of a dependency it can break Spark for users of e.g. older hadoop versions - so we need to be careful. We officially support Hadoop 1.X in Spark and many people still use it.

@tmalaska
Copy link
Author

Uploaded the exclude all and I ran a assembly and test-quick and it worked.

Let me know what I should do next.

Thanks again for the help.

@pwendell
Copy link
Contributor

I've merged this - thanks @tmalaska and @srowen for the good work.

asfgit pushed a commit that referenced this pull request Apr 25, 2014
Updated the Flume dependency in the maven pom file and the scala build file.

Author: tmalaska <ted.malaska@cloudera.com>

Closes #507 from tmalaska/master and squashes the following commits:

79492c8 [tmalaska] excluded all thrift
159c3f1 [tmalaska] fixed the flume pom file issues
5bf56a7 [tmalaska] Upgrade flume version
(cherry picked from commit d5c6ae6)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
@pwendell
Copy link
Contributor

pwendell commented May 2, 2014

@tmalaska @srowen FYI - I filed this bug with Flume, so hopefully they can update the pom files:

https://issues.apache.org/jira/browse/FLUME-2379

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
Updated the Flume dependency in the maven pom file and the scala build file.

Author: tmalaska <ted.malaska@cloudera.com>

Closes apache#507 from tmalaska/master and squashes the following commits:

79492c8 [tmalaska] excluded all thrift
159c3f1 [tmalaska] fixed the flume pom file issues
5bf56a7 [tmalaska] Upgrade flume version
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017
* initial R support without integration tests

* finished sparkR integration

* case sensitive file names in unix

* revert back to previous lower case in dockerfile

* addition into the build-push-docker-images
yifeih added a commit to yifeih/spark that referenced this pull request May 8, 2019
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
* Support EMNIST AI training for charactors

This patch adds an new dataset emnist into AI training scenario. Also
need to change some logics in zuul jobs.

* fix ci

* Update main.yaml

* fix face to face
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants