Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 access documentation #1643

Closed
jpdna opened this issue Jul 27, 2017 · 2 comments
Closed

S3 access documentation #1643

jpdna opened this issue Jul 27, 2017 · 2 comments
Assignees
Milestone

Comments

@jpdna
Copy link
Member

jpdna commented Jul 27, 2017

Getting Spark to connect to S3 can require a bit of trial and error - it would be good if we had the process documented.

This recipe works for me at the moment on my local machine:

  1. build ADAM from source at c8a2202
  2. use scripts to change to scala 2_11 Spark 2.x
  3. local spark installed is spark-2.2.0-bin-hadoop2.7

I downloaded jars:
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.1

I start adam-shell like:

../adam/bin/adam-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.1.jar,../adam/adam-assembly/target/adam-assembly-spark2_2.11-0.23.0-SNAPSHOT.jar

reading from s3a appears to work

val x = sc.textFile("s3a://1000genomes/CHANGELOG")
scala> val x = sc.textFile("s3a://1000genomes/CHANGELOG")
x: org.apache.spark.rdd.RDD[String] = s3a://1000genomes/CHANGELOG MapPartitionsRDD[1] at textFile at <console>:24

scala> x.count
res0: Long = 11987

Attempts that didn't work for me:

  1. I tried to use --packages rather than --jars to add dependencies, this resulted in errors
  2. It tried to add the maven coordinates for the two dependency files to the pom.xml - this resulted in errors.

todo:

  1. it would be good for the dependencies required for S3 access to be in the POM, perhaps activated by a profile.

  2. test reading BAM/VCF from S3

@heuermh
Copy link
Member

heuermh commented Jul 27, 2017

For the record, I don't need to provide any extra jars when using Elastic Map Reduce (EMR) version emr-5.6.0 on AWS, so Spark 2.1.1 on Hadoop 2.7.3 YARN with Ganglia 3.7.2 and Zeppelin 0.7.1, then build ADAM from source with Hadoop 2.7.3 dependency version, and then

$ ./bin/adam-submit --master yarn -- --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/bin/spark-submit
17/06/24 22:15:35 INFO ADAMMain: ADAM invoked with args: "--version"

       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b

ADAM version: 0.23.0-SNAPSHOT
Commit: 0306717cb952511d48e514475b70ae995a468b5a Build: 2017-06-24
Built for: Apache Spark 2.1.0, Scala 2.11.8, and Hadoop 2.7.3

$ ./bin/adam-shell --master yarn --driver-memory 58G --executor-memory 58G
Using SPARK_SHELL=/usr/bin/spark-shell
Spark context Web UI available at http://172.31.1.86:4040
Spark context available as 'sc' (master = yarn, app id = application_1498339040857_0002).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/
        
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val genotypes = sc.loadGenotypes("s3a://bucket/sample.genotypes.adam")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
GenotypeRDD(MapPartitionsRDD[1] at map at ADAMContext.scala:345,SequenceDictionary{
...

scala> genotypes.rdd.count()
res1: Long = 23600775

@fnothaft
Copy link
Member

+1 to documenting, but also +1 to @heuermh's point. I believe most distros (eg EMR, Databricks, CDH) build some AWS library in. I know this is true for EMR&DB, and I'm pretty sure of the same for CDH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants