Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 1.4.1 release candidates #142

Closed
tyro89 opened this issue Jul 11, 2015 · 40 comments
Closed

Spark 1.4.1 release candidates #142

tyro89 opened this issue Jul 11, 2015 · 40 comments

Comments

@tyro89
Copy link
Contributor

tyro89 commented Jul 11, 2015

Would love to have the option to launch with spark 1.4.1 rc's! I looked through the s3 buckets and noticed there's only 1.4.0 builds at the moment.

@tyro89 tyro89 changed the title Spark 1.4.1 Spark 1.4.1 release candidates Jul 11, 2015
@ankurmitujjain
Copy link

+1 Spark 1.4.1 is now released... Really appreciate if you can quickly include this one...

Thank you

@mkanchwala
Copy link

Waiting for this release on AWS EMR... have major bug fixes.

Thanks

@erond
Copy link

erond commented Jul 16, 2015

Waiting as well for 1.4.1 due to the several bug fixes. Thanks

@MattFlower
Copy link

+1

@christopherbozeman
Copy link
Contributor

It's coming...

@ankurmitujjain
Copy link

Great............

@ankurmitujjain
Copy link

is it there?

@mkanchwala
Copy link

@christopherbozeman Can you please tell me how much time it'll take?

Thanks

@erond
Copy link

erond commented Jul 23, 2015

Any update on this issue? I'd really appreciate the possibility to use the latest bug-fixed version.. Thanks

@christopherbozeman
Copy link
Contributor

Spark 1.4.1 is now available as native application with EMR's new release, see https://forums.aws.amazon.com/ann.jspa?annID=3160.

@PKUKILLA
Copy link

Hi Chris,
How to enable the dynamic allocation as it requires to copy shuffle jar and have the following changes in yarn-site.xml (Ref link http://www.slideshare.net/ozax86/spark-on-yarn-with-dynamic-resource-allocation)
 
            yarn.nodemanager.aux-services
            spark_shuffle,mapreduce_shuffle
       

       
            yarn.nodemanager.aux-services.spark_shuffle.class
            org.apache.spark.network.yarn.YarnShuffleService
       
       

@jkleckner
Copy link

@christopherbozeman This page needs updating for the dynamic feature because it calls out a fixed instance-count, true?

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-launch.html

Create the cluster with the following command:

aws emr create-cluster --name "Spark cluster" --release-label emr-4.0.0 --applications Name=Spark --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --use-default-roles

@erond
Copy link

erond commented Jul 25, 2015

Thanks @christopherbozeman. Do you think you are also going to add 1.4.1 support to the
"old" bootstrap action as per this GH project? We are deeply using it, and we are not yet ready to move to Hadoop 2.6 and Hive 1.0. It would be great to both have the "automated" way and the "manual" way to install Spark so to be able to test all the pieces step by step before moving a production system to a new set of upgraded frameworks. Also, can you please give the community any hints about how long this project (emr-bootstrap-action to install Spark) will still be maintained (and offered)? Appreciated. Thanks as always.

@PKUKILLA
Copy link

thanks it works

@Sazpaimon
Copy link

@christopherbozeman Does the EMR 4.0.0 version of Spark contain the patch from christopherbozeman/spark@316b2e0? It doesn't look like it does, as when I insert into a Hive table using Spark SQL, it creates temporary files in S3 and then appears to get stuck when trying to move them to their right place.

@erond
Copy link

erond commented Aug 13, 2015

Considering also the issues presented for Hive (#154) and Ganglia ( #153), is there any possibilities to get Spark 1.4.1 available as bootstrap action (a.k.a. "the usual way") so to get it in the meanwhile working on the 3.8.0 AMI (and Hadoop 2.4)? The upgrade of our system is stuck because of this, since 1.4.0 has know blocking bugs, so no change to move forward from Spark 1.3.1 until you kindly upgrade the emr-bootstrap-action support as well. I think many people would really appreciate it. Thanks.

@erond
Copy link

erond commented Aug 18, 2015

Stuck on the upgrade to Spark 1.4.0 using AMI 3.8.0 due to https://issues.apache.org/jira/browse/SPARK-8368. So we can't move forward neither to 1.4.0 unless we switch to AMI 4.0.0. PLEASE, upgrade the emr-boostrap-action to support Spark 1.4.1 on AMI 3.8.0, this is really a big issue for many people!

@knowak
Copy link

knowak commented Aug 19, 2015

Same here, would appreciate getting 1.4.1 integrated here while EMR 4.0 matures.

@erond
Copy link

erond commented Aug 19, 2015

Furthermore, we actually CAN'T switch to AMI 4.0.0 since we are leveraging DataPipeline that, obviously, doesn't currently support such AMI version: read https://forums.aws.amazon.com/thread.jspa?messageID=662004 and https://forums.aws.amazon.com/thread.jspa?messageID=658891 for references.

@ankurmitujjain
Copy link

+1, I think emr 4.0.0 is not mature enough to replace all application available on AMI 3.8.0.

@christopherbozeman
Copy link
Contributor

A Spark 1.4.1 is now available for the Spark bootstrap action for EMR AMI 3.x and can be requested by version "1.4.1.a".

@Sazpaimon
Copy link

@christopherbozeman Can you answer my previous question about the EMR 4.0 version of Spark containing christopherbozeman/spark@316b2e0?

@rajatdt
Copy link

rajatdt commented Sep 8, 2015

Hi christopherbozeman,

Thank you for the update. Could you please provide some feedback on the configuration that I am trying to use. My Configuration:
-ami-version : 3.3 (which defaults to 3.3.2)
-spark: 1.4.1.a
etc.
The question is, should I use ami-version 3.3 or should I use the latest ami-version. I want to use emr-4.0.0 release label as it provides spark 1.4.1 but it provides Hadoop 2.6.0 where as I want to use 2.4.0

@christopherbozeman
Copy link
Contributor

@Sazpaimon I dug into your comment on #142 (comment) and determined that christopherbozeman/spark@316b2e0 is a NOOP (the underlying RDD interaction with Hadoop output format takes care of the S3 direct write). What performs the magic for not creating extra temporary paths when writing to S3 is the code that EMR added to Hive which gets included by the Spark BA installed when -h option is supplied. This is what is missing from EMR release 4.0.0. Also, Spark 1.4 only supports up to Hive 0.13 (https://issues.apache.org/jira/browse/SPARK-8065) the native Spark in EMR release 4.0.0 cannot just use the Hive 1.0 jars in order to fix the issue. I'll report this issue internally with the development team so it is resolved in a future EMR release. At this time, the ugly workaround would be to take the Hive jars from EMR AMI 3.x with Hive 0.13 that is pruned for Spark (~spark/classpath/hive/*), copy to master of a EMR 4.0.0 cluster and then append the jars to the Spark classpath.

@christopherbozeman
Copy link
Contributor

@rajatdt - why are you avoiding Hadoop 2.6.0?

@rajatdt
Copy link

rajatdt commented Sep 8, 2015

Hi ,

Can you please specify the comment that i have made on this issue. I think that you have the wrong guy here.

Regards

Rajat Dikshit

Sent by Outlook for Android

On Tue, Sep 8, 2015 at 3:42 PM -0700, "Christopher Bozeman" notifications@github.com wrote:

@rajatdt - why are you avoiding Hadoop 2.6.0?


Reply to this email directly or view it on GitHub.

@christopherbozeman
Copy link
Contributor

@rajatdt - in reference to #142 (comment). Why are you needing to use Hadoop 2.4.0?

@rajatdt
Copy link

rajatdt commented Sep 8, 2015

I was trying to work on a project with outdated instructions. So i started working and i lost track of the updated versions.

Sent by Outlook for Android

On Tue, Sep 8, 2015 at 4:20 PM -0700, "Christopher Bozeman" notifications@github.com wrote:

@rajatdt - in reference to #142 (comment). Why are you needing to use Hadoop 2.4.0?


Reply to this email directly or view it on GitHub.

@Sazpaimon
Copy link

@christopherbozeman Thanks. I know exactly the piece of code you're talking about (I've had to decompile Amazon's Hive distribution for debugging purposes more times than I'd care to admit) and I'll give your suggestion a shot next time I need EMR 4.0

@erond
Copy link

erond commented Sep 9, 2015

@christopherbozeman: Unfortunately I'm facing issues with this v.1.4.1.a when trying to run (YARN-cluster mode) my Spark driver on EMR with both AMI 3.7.0 and 3.8.0, in particular when trying to create a Hive external table backed on S3 I get:

5/09/09 10:09:57 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NoSuchMethodError: org.apache.http.params.HttpConnectionParams.setSoKeepalive(Lorg/apache/http/params/HttpParams;Z)V
java.lang.NoSuchMethodError: org.apache.http.params.HttpConnectionParams.setSoKeepalive(Lorg/apache/http/params/HttpParams;Z)V
at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:95)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:198)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:132)
at com.amazonaws.services.s3.AmazonS3Client.<init>(AmazonS3Client.java:431)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createAmazonS3Client(EmrFSProdModule.java:125)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSProdModule.createAmazonS3(EmrFSProdModule.java:165)
at com.amazon.ws.emr.hadoop.fs.guice.EmrFSBaseModule.provideAmazonS3(EmrFSBaseModule.java:81)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.google.inject.internal.ProviderMethod.get(ProviderMethod.java:104)
at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:40)
at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46)
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1031)
at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
at com.google.inject.Scopes$1$1.get(Scopes.java:65)
at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:40)
at com.google.inject.internal.SingleFieldInjector.inject(SingleFieldInjector.java:53)
at com.google.inject.internal.MembersInjectorImpl.injectMembers(MembersInjectorImpl.java:110)
at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:94)
at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254)
at com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:54)
at com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978)
at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024)
at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974)
at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1009)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:105)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2445)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2479)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2461)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372)
at org.apache.hadoop.hive.common.FileUtils.isLocalFile(FileUtils.java:430)
at org.apache.hadoop.hive.common.FileUtils.isLocalFile(FileUtils.java:414)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:9887)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9180)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:422)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:345)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:326)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:155)
at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:326)
at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:316)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:473)
at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:950)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:950)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:128)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755)
at myCompany.myPackage.otherPackage.ReadStuffUsingHive.apply(ReadStuffUsingHive.scala:12)
at myCompany.myPackage.BatchSparkDriver$.main(BatchSparkDriver.scala:200)
at myCompany.myPackage.BatchSparkDriver.mainBatchSparkDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
15/09/09 10:09:57 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NoSuchMethodError: org.apache.http.params.HttpConnectionParams.setSoKeepalive(Lorg/apache/http/params/HttpParams;Z)V)
15/09/09 10:09:57 INFO spark.SparkContext: Invoking stop() from shutdown hook

Please, note that the very same app but built for and deployed on Spark 1.3.1 (AMI 3.7.0) use to always work smoothly on EMR. Also, the same app built for Spark 1.4.1 has been successfully run/tested on a private physical cluster (CentOS based, with Hadoop 2.4, Hive 0.13, Java 7, Scala 2.10).

Any hints? Thanks in advance!

@christopherbozeman
Copy link
Contributor

@erond the error is likely a version conflict/mismatch on dependencies. Can I have your spark-submit arguments?

@erond
Copy link

erond commented Sep 9, 2015

of course @christopherbozeman. I launch the Spark driver as an EmrActivity's step within a DataPipeline:

"step" : ["
   s3://elasticmapreduce/libs/script-runner/script-runner.jar,
   file:///home/hadoop/spark/bin/spark-submit,
    --class,myCompany.myPackage.BatchSparkDriver,
    --name,\"BatchSparkDriver on DP #{runsOn.@pipelineId}\",
    --files,/home/hadoop/spark/conf/hive-site.xml,
    --driver-class-path,/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark/classpath/emr/mysql-connector-java-5.1.30.jar:hive-site.xml,
    --master,yarn-cluster,
    --driver-memory,512m,
    --num-executors,3,
    --executor-memory,2176m,
    s3://myCompany-bucket/path/to/my-app-1.2.3-SNAPSHOT.jar,
    (then driver's args)
  "]

@PKUKILLA
Copy link

@christopher,
Is there any way to use Spark 1.5.0 with EMR?

On Wed, Sep 9, 2015 at 8:43 PM, Roberto Coluccio notifications@github.com
wrote:

of course @christopherbozeman https://github.com/christopherbozeman. I
launch the Spark driver as an EmrActivity's step within a DataPipeline:

"step" : [" s3://elasticmapreduce/libs/script-runner/script-runner.jar, file:///home/hadoop/spark/bin/spark-submit, --class,myCompany.myPackage.BatchSparkDriver, --name,"BatchSparkDriver on DP #{runsOn.@pipelineId}", --files,/home/hadoop/spark/conf/hive-site.xml, --driver-class-path,/home/hadoop/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/spark/lib/datanucleus-core-3.2.10.jar:/home/hadoop/spark/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/spark/classpath/emr/mysql-connector-java-5.1.30.jar:hive-site.xml, --master,yarn-cluster, --driver-memory,512m, --num-executors,3, --executor-memory,2176m, s3://myCompany-bucket/path/to/my-app-1.2.3-SNAPSHOT.jar, (then driver's args) "]


Reply to this email directly or view it on GitHub
#142 (comment)
.

@erond
Copy link

erond commented Sep 17, 2015

@christopherbozeman did anyone else experienced the same I reported when upgrading to 1.4.1.e as the best of your knowledge? You got any advice? Thank you very much.

@njvijay
Copy link

njvijay commented Sep 18, 2015

When can we expect Spark 1.5.0 on emr?

@christopherbozeman
Copy link
Contributor

@erond Please try build 1.4.1.b that pushed with #163 to see if it resolves the issue.

@christopherbozeman
Copy link
Contributor

@njvijay and @PKUKILLA see issue #160 regarding Spark 1.5.

@erond
Copy link

erond commented Sep 29, 2015

@christopherbozeman thanks for your update. Unfortunately, it still fails with the very same error, with both AMI 3.7.0 and 3.8.0.

@dacort
Copy link
Contributor

dacort commented Apr 28, 2023

Hi there - thanks for your contribution. We're updating this repository to include more relevant and recent information.

As such, we're cleaning up and closing old issues and PRs.

Feel free to open an issue if you still use EMR and would like to see an example of something!

@dacort dacort closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests