diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md index a3179fce59c13..f02e366075cd6 100644 --- a/docs/openstack-integration.md +++ b/docs/openstack-integration.md @@ -1,51 +1,68 @@ +--- layout: global -title: Accessing Openstack Swift from Spark +title: OpenStack Integration --- -# Accessing Openstack Swift from Spark +* This will become a table of contents (this text will be scraped). +{:toc} + + +# Accessing OpenStack Swift from Spark -Spark's file interface allows it to process data in Openstack Swift using the same URI +Spark's file interface allows it to process data in OpenStack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a -URI of the form `swift://swift://. You will also need to set your +Swift security credentials, through core-sites.xml or via +SparkContext.hadoopConfiguration. +Openstack Swift driver was merged in Hadoop version 2.3.0 +([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). +Users that wish to use previous Hadoop versions will need to configure Swift driver manually. +Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts +to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). # Configuring Swift -Proxy server of Swift should include `list_endpoints` middleware. More information -available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) - -# Compilation of Spark -Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0. -For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include - - - --------- - - org.apache.hadoop - hadoop-openstack - 2.3.0 - - ---------- - - -in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml` - - - ---------- - - org.apache.hadoop - hadoop-openstack - - ---------- - -# Configuration of Spark -Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be -configured: declaration of the Swift driver and the parameters that are required by Keystone. +Proxy server of Swift should include list_endpoints middleware. More information +available +[here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) -Configuration of Hadoop to use Swift File system achieved via +# Dependencies + +Spark should be compiled with hadoop-openstack-2.3.0.jar that is distributted with +Hadoop 2.3.0. For the Maven builds, the dependencyManagement section of Spark's main +pom.xml should include: +{% highlight xml %} + + ... + + org.apache.hadoop + hadoop-openstack + 2.3.0 + + ... + +{% endhighlight %} + +In addition, both core and yarn projects should add +hadoop-openstack to the dependencies section of their +pom.xml: +{% highlight xml %} + + ... + + org.apache.hadoop + hadoop-openstack + + ... + +{% endhighlight %} +# Configuration Parameters + +Create core-sites.xml and place it inside /spark/conf directory. +There are two main categories of parameters that should to be configured: declaration of the +Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via @@ -54,184 +71,199 @@ Configuration of Hadoop to use Swift File system achieved via
Property NameValue
-Additional parameters requiered by Keystone and should be provided to the Swift driver. Those +Additional parameters required by Keystone and should be provided to the Swift driver. Those parameters will be used to perform authentication in Keystone to access Swift. The following table -contains a list of Keystone mandatory parameters. `PROVIDER` can be any name. +contains a list of Keystone mandatory parameters. PROVIDER can be any name. - + - + - + - + - + - + - + - +
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.urlfs.swift.service.PROVIDER.auth.url Keystone Authentication URL Mandatory
fs.swift.service.PROVIDER.auth.endpoint.prefixfs.swift.service.PROVIDER.auth.endpoint.prefix Keystone endpoints prefix Optional
fs.swift.service.PROVIDER.tenantfs.swift.service.PROVIDER.tenant Tenant Mandatory
fs.swift.service.PROVIDER.usernamefs.swift.service.PROVIDER.username Username Mandatory
fs.swift.service.PROVIDER.passwordfs.swift.service.PROVIDER.password Password Mandatory
fs.swift.service.PROVIDER.http.portfs.swift.service.PROVIDER.http.port HTTP port Mandatory
fs.swift.service.PROVIDER.regionfs.swift.service.PROVIDER.region Keystone region Mandatory
fs.swift.service.PROVIDER.publicfs.swift.service.PROVIDER.public Indicates if all URLs are public Mandatory
-For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`. -Than `core-sites.xml` should include: - - - - fs.swift.impl - org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem - - - fs.swift.service.SparkTest.auth.url - http://127.0.0.1:5000/v2.0/tokens - - - fs.swift.service.SparkTest.auth.endpoint.prefix - endpoints - - fs.swift.service.SparkTest.http.port - 8080 - - - fs.swift.service.SparkTest.region - RegionOne - - - fs.swift.service.SparkTest.public - true - - - fs.swift.service.SparkTest.tenant - test - - - fs.swift.service.SparkTest.username - tester - - - fs.swift.service.SparkTest.password - testing - - - -Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`, -`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach. -We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration` +For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing +defined for tenant tenant. Than core-sites.xml should include: + +{% highlight xml %} + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + fs.swift.service.SparkTest.auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service.SparkTest.auth.endpoint.prefix + endpoints + + fs.swift.service.SparkTest.http.port + 8080 + + + fs.swift.service.SparkTest.region + RegionOne + + + fs.swift.service.SparkTest.public + true + + + fs.swift.service.SparkTest.tenant + test + + + fs.swift.service.SparkTest.username + tester + + + fs.swift.service.SparkTest.password + testing + + +{% endhighlight %} + +Notice that +fs.swift.service.PROVIDER.tenant, +fs.swift.service.PROVIDER.username, +fs.swift.service.PROVIDER.password contains sensitive information and keeping them in +core-sites.xml is not always a good approach. +We suggest to keep those parameters in core-sites.xml for testing purposes when running Spark +via spark-shell. +For job submissions they should be provided via sparkContext.hadoopConfiguration. # Usage examples -Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log` -from Spark the `swift://` scheme should be used. + +Assume Keystone's authentication URL is http://127.0.0.1:5000/v2.0/tokens and Keystone contains tenant test, user tester with password testing. In our example we define PROVIDER=SparkTest. Assume that Swift contains container logs with an object data.log. To access data.log from Spark the swift:// scheme should be used. + ## Running Spark via spark-shell -Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, -`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme. - - val sfdata = sc.textFile("swift://logs.SparkTest/data.log") - sfdata.count() - -## Job submission via spark-submit -In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`, -`fs.swift.service.SparkTest.password`. Example of Java usage: - - /* SimpleApp.java */ - import org.apache.spark.api.java.*; - import org.apache.spark.SparkConf; - import org.apache.spark.api.java.function.Function; - - public class SimpleApp { - public static void main(String[] args) { - String logFile = "swift://logs.SparkTest/data.log"; - SparkConf conf = new SparkConf().setAppName("Simple Application"); - JavaSparkContext sc = new JavaSparkContext(conf); - sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); - sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); - sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); - - JavaRDD logData = sc.textFile(logFile).cache(); - - long num = logData.count(); - - System.out.println("Total number of lines: " + num); - } - } - -The directory sturture is - - find . - ./src - ./src/main - ./src/main/java - ./src/main/java/SimpleApp.java - -Maven pom.xml is - - - edu.berkeley - simple-project - 4.0.0 - Simple Project - jar - 1.0 - - - Akka repository - http://repo.akka.io/releases - - - - - - org.apache.maven.plugins - maven-compiler-plugin - 2.3 - - 1.6 - 1.6 - - - - - - - org.apache.spark - spark-core_2.10 - 1.0.0 - - - - -Compile and execute +Make sure that core-sites.xml contains fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, +fs.swift.service.SparkTest.password. Run Spark via spark-shell and access Swift via swift:// scheme. + +{% highlight scala %} +val sfdata = sc.textFile("swift://logs.SparkTest/data.log") +sfdata.count() +{% endhighlight %} + + +## Sample Application + +In this case core-sites.xml need not contain fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, +fs.swift.service.SparkTest.password. Example of Java usage: - mvn package - SPARK_HOME/spark-submit --class "SimpleApp" --master local[4] target/simple-project-1.0.jar +{% highlight java %} +/* SimpleApp.java */ +import org.apache.spark.api.java.*; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.function.Function; +public class SimpleApp { + public static void main(String[] args) { + String logFile = "swift://logs.SparkTest/data.log"; + SparkConf conf = new SparkConf().setAppName("Simple Application"); + JavaSparkContext sc = new JavaSparkContext(conf); + sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); + + JavaRDD logData = sc.textFile(logFile).cache(); + + long num = logData.count(); + + System.out.println("Total number of lines: " + num); + } +} +{% endhighlight %} + +The directory structure is +{% highlight bash %} +./src +./src/main +./src/main/java +./src/main/java/SimpleApp.java +{% endhighlight %} + +Maven pom.xml should contain: +{% highlight xml %} + + edu.berkeley + simple-project + 4.0.0 + Simple Project + jar + 1.0 + + + Akka repository + http://repo.akka.io/releases + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3 + + 1.6 + 1.6 + + + + + + + org.apache.spark + spark-core_2.10 + 1.0.0 + + + +{% endhighlight %} + +Compile and execute +{% highlight bash %} +mvn package +SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar +{% endhighlight %}