Skip to content

Commit

Permalink
Removed sample code.
Browse files Browse the repository at this point in the history
  • Loading branch information
rxin committed Sep 6, 2014
1 parent e9c3761 commit 0447c9f
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 123 deletions.
2 changes: 1 addition & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
</exclusion>
</exclusions>
</dependency>
<dependency>
<dependency>
<groupId>net.java.dev.jets3t</groupId>
<artifactId>jets3t</artifactId>
</dependency>
Expand Down
131 changes: 9 additions & 122 deletions docs/openstack-integration.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: global
title: OpenStack Integration
title: OpenStack Swift Integration
---

* This will become a table of contents (this text will be scraped).
Expand All @@ -9,16 +9,12 @@ title: OpenStack Integration

# Accessing OpenStack Swift from Spark

Spark's file interface allows it to process data in OpenStack Swift using the same URI
formats that are supported for Hadoop. You can specify a path in Swift as input through a
URI of the form <code>swift://<container.PROVIDER/path</code>. You will also need to set your
Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the
same URI formats as in Hadoop. You can specify a path in Swift as input through a
URI of the form <code>swift://container.PROVIDER/path</code>. You will also need to set your
Swift security credentials, through <code>core-sites.xml</code> or via
<code>SparkContext.hadoopConfiguration</code>.
Openstack Swift driver was merged in Hadoop version 2.3.0
([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)).
Users that wish to use previous Hadoop versions will need to configure Swift driver manually.
Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts
to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
<code>SparkContext.hadoopConfiguration</code>.
Current Swift driver requires Swift to use Keystone authentication method.

# Configuring Swift
Proxy server of Swift should include <code>list_endpoints</code> middleware. More information
Expand All @@ -27,9 +23,9 @@ available

# Dependencies

Spark should be compiled with <code>hadoop-openstack-2.3.0.jar</code> that is distributted with
Hadoop 2.3.0. For the Maven builds, the <code>dependencyManagement</code> section of Spark's main
<code>pom.xml</code> should include:
The Spark application should include <code>hadoop-openstack</code> dependency.
For example, for Maven support, add the following to the <code>pom.xml</code> file:

{% highlight xml %}
<dependencyManagement>
...
Expand All @@ -42,19 +38,6 @@ Hadoop 2.3.0. For the Maven builds, the <code>dependencyManagement</code> sectio
</dependencyManagement>
{% endhighlight %}

In addition, both <code>core</code> and <code>yarn</code> projects should add
<code>hadoop-openstack</code> to the <code>dependencies</code> section of their
<code>pom.xml</code>:
{% highlight xml %}
<dependencies>
...
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
</dependency>
...
</dependencies>
{% endhighlight %}

# Configuration Parameters

Expand Down Expand Up @@ -171,99 +154,3 @@ Notice that
We suggest to keep those parameters in <code>core-sites.xml</code> for testing purposes when running Spark
via <code>spark-shell</code>.
For job submissions they should be provided via <code>sparkContext.hadoopConfiguration</code>.

# Usage examples

Assume Keystone's authentication URL is <code>http://127.0.0.1:5000/v2.0/tokens</code> and Keystone contains tenant <code>test</code>, user <code>tester</code> with password <code>testing</code>. In our example we define <code>PROVIDER=SparkTest</code>. Assume that Swift contains container <code>logs</code> with an object <code>data.log</code>. To access <code>data.log</code> from Spark the <code>swift://</code> scheme should be used.


## Running Spark via spark-shell

Make sure that <code>core-sites.xml</code> contains <code>fs.swift.service.SparkTest.tenant</code>, <code>fs.swift.service.SparkTest.username</code>,
<code>fs.swift.service.SparkTest.password</code>. Run Spark via <code>spark-shell</code> and access Swift via <code>swift://</code> scheme.

{% highlight scala %}
val sfdata = sc.textFile("swift://logs.SparkTest/data.log")
sfdata.count()
{% endhighlight %}


## Sample Application

In this case <code>core-sites.xml</code> need not contain <code>fs.swift.service.SparkTest.tenant</code>, <code>fs.swift.service.SparkTest.username</code>,
<code>fs.swift.service.SparkTest.password</code>. Example of Java usage:

{% highlight java %}
/* SimpleApp.java */
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
public static void main(String[] args) {
String logFile = "swift://logs.SparkTest/data.log";
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test");
sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing");
sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester");

JavaRDD<String> logData = sc.textFile(logFile).cache();
long num = logData.count();

System.out.println("Total number of lines: " + num);
}
}
{% endhighlight %}

The directory structure is
{% highlight bash %}
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java
{% endhighlight %}

Maven pom.xml should contain:
{% highlight xml %}
<project>
<groupId>edu.berkeley</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<repositories>
<repository>
<id>Akka repository</id>
<url>http://repo.akka.io/releases</url>
</repository>
</repositories>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.0</version>
</dependency>
</dependencies>
</project>
{% endhighlight %}

Compile and execute
{% highlight bash %}
mvn package
SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar
{% endhighlight %}

0 comments on commit 0447c9f

Please sign in to comment.