S3 access documentation #1643

jpdna · 2017-07-27T00:50:08Z

Getting Spark to connect to S3 can require a bit of trial and error - it would be good if we had the process documented.

This recipe works for me at the moment on my local machine:

build ADAM from source at c8a2202
use scripts to change to scala 2_11 Spark 2.x
local spark installed is spark-2.2.0-bin-hadoop2.7

I downloaded jars:
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.1

I start adam-shell like:

../adam/bin/adam-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.1.jar,../adam/adam-assembly/target/adam-assembly-spark2_2.11-0.23.0-SNAPSHOT.jar

reading from s3a appears to work

val x = sc.textFile("s3a://1000genomes/CHANGELOG")
scala> val x = sc.textFile("s3a://1000genomes/CHANGELOG")
x: org.apache.spark.rdd.RDD[String] = s3a://1000genomes/CHANGELOG MapPartitionsRDD[1] at textFile at <console>:24

scala> x.count
res0: Long = 11987

Attempts that didn't work for me:

I tried to use --packages rather than --jars to add dependencies, this resulted in errors
It tried to add the maven coordinates for the two dependency files to the pom.xml - this resulted in errors.

todo:

it would be good for the dependencies required for S3 access to be in the POM, perhaps activated by a profile.
test reading BAM/VCF from S3

The text was updated successfully, but these errors were encountered:

heuermh · 2017-07-27T01:20:03Z

For the record, I don't need to provide any extra jars when using Elastic Map Reduce (EMR) version emr-5.6.0 on AWS, so Spark 2.1.1 on Hadoop 2.7.3 YARN with Ganglia 3.7.2 and Zeppelin 0.7.1, then build ADAM from source with Hadoop 2.7.3 dependency version, and then

$ ./bin/adam-submit --master yarn -- --version
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/bin/spark-submit
17/06/24 22:15:35 INFO ADAMMain: ADAM invoked with args: "--version"

       e         888~-_          e             e    e
      d8b        888   \        d8b           d8b  d8b
     /Y88b       888    |      /Y88b         d888bdY88b
    /  Y88b      888    |     /  Y88b       / Y88Y Y888b
   /____Y88b     888   /     /____Y88b     /   YY   Y888b
  /      Y88b    888_-~     /      Y88b   /          Y888b

ADAM version: 0.23.0-SNAPSHOT
Commit: 0306717cb952511d48e514475b70ae995a468b5a Build: 2017-06-24
Built for: Apache Spark 2.1.0, Scala 2.11.8, and Hadoop 2.7.3

$ ./bin/adam-shell --master yarn --driver-memory 58G --executor-memory 58G
Using SPARK_SHELL=/usr/bin/spark-shell
Spark context Web UI available at http://172.31.1.86:4040
Spark context available as 'sc' (master = yarn, app id = application_1498339040857_0002).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/
        
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val genotypes = sc.loadGenotypes("s3a://bucket/sample.genotypes.adam")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD =
GenotypeRDD(MapPartitionsRDD[1] at map at ADAMContext.scala:345,SequenceDictionary{
...

scala> genotypes.rdd.count()
res1: Long = 23600775

fnothaft · 2017-07-27T01:27:25Z

+1 to documenting, but also +1 to @heuermh's point. I believe most distros (eg EMR, Databricks, CDH) build some AWS library in. I know this is true for EMR&DB, and I'm pretty sure of the same for CDH.

Resolves bigdatagenomics#1643.

Resolves #1643.

fnothaft added this to the 0.23.0 milestone Aug 2, 2017

fnothaft added the documentation label Aug 2, 2017

fnothaft mentioned this issue Sep 17, 2017

java.nio.file.ProviderNotFoundException (Provider "s3" not found) #1732

Closed

fnothaft self-assigned this Oct 18, 2017

fnothaft added a commit to fnothaft/adam that referenced this issue Oct 19, 2017

[ADAM-1643] Add S3 access documentation.

8f0cd70

Resolves bigdatagenomics#1643.

fnothaft mentioned this issue Oct 19, 2017

[ADAM-1643] Add S3 access documentation. #1767

Merged

heuermh closed this as completed in #1767 Oct 19, 2017

heuermh pushed a commit that referenced this issue Oct 19, 2017

[ADAM-1643] Add S3 access documentation.

65dde41

Resolves #1643.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 access documentation #1643

S3 access documentation #1643

jpdna commented Jul 27, 2017 •

edited

Loading

heuermh commented Jul 27, 2017

fnothaft commented Jul 27, 2017

S3 access documentation #1643

S3 access documentation #1643

Comments

jpdna commented Jul 27, 2017 • edited Loading

heuermh commented Jul 27, 2017

fnothaft commented Jul 27, 2017

jpdna commented Jul 27, 2017 •

edited

Loading