From f1b88e635c7ac20998e4e1f73ad7234463fff288 Mon Sep 17 00:00:00 2001 From: Constantin Ahlmann-Eltze Date: Fri, 4 Jul 2014 17:06:00 +0200 Subject: [PATCH] SPARK-2171: Adds Documentation and Examples that explain how to use Spark with Groovy Contributes documentation that explains how one can use Spark with the Groovy without additional wrappers. Also adds some examples that demonstrate the benefits of using Spark with Groovy. --- docs/groovy-guide.md | 176 ++++++++++++++++++ .../spark/examples/GroovyBroadcastTest.groovy | 51 +++++ .../spark/examples/GroovyGroupByTest.groovy | 61 ++++++ .../spark/examples/GroovyHdfsTest.groovy | 46 +++++ .../spark/examples/GroovyPageRank.groovy | 91 +++++++++ .../spark/examples/GroovySimpleApp.groovy | 43 +++++ .../spark/examples/GroovySparkPi.groovy | 56 ++++++ .../org/apache/spark/examples/GroovyTC.groovy | 73 ++++++++ .../spark/examples/GroovyWordCount.groovy | 57 ++++++ 9 files changed, 654 insertions(+) create mode 100644 docs/groovy-guide.md create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovyBroadcastTest.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovyGroupByTest.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovyHdfsTest.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovyPageRank.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovySimpleApp.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovySparkPi.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovyTC.groovy create mode 100644 examples/src/main/groovy/org/apache/spark/examples/GroovyWordCount.groovy diff --git a/docs/groovy-guide.md b/docs/groovy-guide.md new file mode 100644 index 0000000000000..207786dcc074b --- /dev/null +++ b/docs/groovy-guide.md @@ -0,0 +1,176 @@ +--- +layout: global +title: Groovy Programming Guide +--- + +* This will become a table of contents (this text will be scraped). +{:toc} + +#Overview +This page discusses the benefits of using Groovy bindings for Spark and explains how to do it “out of the box”. + +#Motivation +[Groovy](http://groovy.codehaus.org/) is a dynamic JVM-language which significantly reduces the code bloat of Java (its original intention was to achieve the simplicity and compactness of Python on the JVM). At the same time, it adds some useful concepts like closures, dynamic typing, and more.
+Compared to Scala, Groovy is much easier to learn due to more intuitive syntax which is essentially a “simplified Java” (in fact, Groovy is to a large degree a superset of Java – if in doubt, a programmer can fall back to a standard Java syntax). While Java is an industry approved and robust language, programs tend to contain a lot of boilerplate code, especially in case of functional programming APIs [^1].
+Groovy as a third JVM-based language for Spark has potential to combine the expressiveness of Scala with simplicity and low learning effort of a “Python-like” programming language. In context of Spark particularly helpful are the Groovy Closures which work like anonymous functions in Scala. Since Groovy version 2.2 they can be [automatically casted](http://groovy.codehaus.org/Groovy+2.2+release+notes#Groovy2.2releasenotes-Implicitclosurecoercion) to interfaces with a single method. This allows using Spark’s native Java API directly from Groovy, making the code more readable and easier to write. Furthermore, due to this feature no “adapter code” is needed – except for a changed build process, Groovy can be used in Spark “out of the box”.
+Just compare this Groovy code to a Java equivalent: + +
+
+{% highlight Groovy %} +def words = lines.flatMap({ s -> Arrays.asList(SPACE.split(s))}) +def counts = words.mapToPair({ new Tuple2(it, 1) }).reduceByKey({ a, b -> a + b }) +{% endhighlight %} + + +
+
+{% highlight Java %} +JavaRDD words = lines.flatMap(new FlatMapFunction() { + @Override + public Iterable call(String s) { + return Arrays.asList(SPACE.split(s)); + } +}); + +JavaPairRDD ones = words.mapToPair(new PairFunction() { + @Override + public Tuple2 call(String s) { + return new Tuple2(s, 1); + } +}); + +JavaPairRDD counts = ones.reduceByKey(new Function2() { + @Override + public Integer call(Integer i1, Integer i2) { + return i1 + i2; + } +}); +{% endhighlight %} + +
+
+ +#Installation +In order to use Groovy4Spark make sure that you have Groovy version 2.2 or higher installed. If this is already the case, set the environment variable GROOVY_HOME to the path of the installation directory (Linux, bash shell): + +{% highlight bash %} +$ export GROOVY_HOME=~/dev/groovy +# To check whether the variable is set: +$ echo $GROOVY_HOME +{% endhighlight %} + +On Linux, Groovy installation is straightforward: + +{% highlight bash %} +# Get the Groovy enVironment Manager +$ curl -s get.gvmtool.net | bash +# CAUTION: Before you can use gvm you need to restart your bash +# Install Groovy +$ gvm install groovy +{% endhighlight %} + +When everything works, you are now good to run the Groovy examples like the following one (for more examples see the git repo): +{% highlight bash %} +$ ./bin/run-groovy-example GroovySparkPi +{% endhighlight %} + +#Standalone Application +To illustrate how easy it is to create a standalone application with Groovy for Spark, we show a simple program that counts lines with “a”s and “b”s in a text file. In order to deploy it, we will build the executable jar file with [Gradle](http://www.gradle.org/), a groovish build/test/deploy automation tool. + +###Directory structure +First we need to create a new directory for our app. Our build tool Gradle requires a certain directory structure. Make sure that your directory structure is as follows: + +{% highlight bash %} +$ find . +./build.gradle +./src +./src/main +./src/main/groovy +./src/main/groovy/SimpleApp.groovy +{% endhighlight %} + +As next we describe the contents of `SimpleApp.groovy` and `build.groovy`. + +###SimpleApp.groovy +The following program is a demonstration how to use Groovy with Spark. It reads a file and counts the lines which contain “a” or “b”. For a comparison with other programming languages for Spark please see [here](quick-start.html#standalone-applications).
+CAUTION: To run this code you have to replace the string YOUR_SPARK_HOME with the location where your Spark version is installed. + +{% highlight groovy %} +/* SimpleApp.groovy */ +import org.apache.spark.api.java.* +import org.apache.spark.SparkConf + +class SimpleApp { + static def main(args) { + // Should be some file on your system, for example: + def logFile = "YOUR_SPARK_HOME/README.md" + def conf = new SparkConf().setAppName("Simple Groovy Application") + def sc = new JavaSparkContext(conf) + def logData = sc.textFile(logFile).cache() + + def numAs = logData.filter({it.contains("a")}).count() + + def numBs = logData.filter({it.contains("b")}).count() + + println("Lines with a: $numAs, lines with b: $numBs") + } +} + +{% endhighlight %} + +As you can see the implementation is pretty straightforward and although we use Groovy, we deploy the Spark’s standard Java API. + +###build.gradle +Spark requires jar files while executing in a cluster. In order to build a jar file we will use Gradle. This automation tool can be installed via: + +{% highlight bash %} +# Skip the next line if your Groovy enVironment Manager (gvm) is installed +$ curl -s get.gvmtool.net | bash +# Install Gradle +$ gvm install gradle +{% endhighlight %} + +The file `build.gradle` specifies which external `jars` are needed (Maven-functionality) and names the resulting `jar`. Of course, the version numbers can be adjusted for the most recent version. + +{% highlight groovy %} +apply plugin: 'groovy' + +repositories { + mavenCentral() +} + +dependencies { + compile 'org.codehaus.groovy:groovy-all:2.2.0' + compile 'org.apache.spark:spark-core_2.10:1.0.0' +} + +jar.archiveName = "simple_app.jar" +{% endhighlight %} + + +###Running the Application +As the last step we package our app by executing `gradle build` command and then run the created jar with the script spark-submit: + +CAUTION: You need to replace the strings the YOUR_GROOVY_VERSION and LOCATION_OF_PROJECT to match your local setup first. + +{% highlight bash %} +# Packaging our app. This will create a new jar in the folder ./build/libs +$ gradle build +# Submit our app to Spark +$ YOUR_SPARK_HOME/bin/spark-submit \ + --jars $GROOVY_HOME/embeddable/groovy-all-YOUR_GROOVY_VERSION.jar \ + --class "SimpleApp" \ + --master local[*] \ + LOCATION_OF_PROJECT/simple_app/build/libs/simple_app.jar +{% endhighlight %} +Please adopt the `YOUR_GROOVY_VERSION` and `LOCATION_OF_PROJECT` to match your setup. + +Congratulations, you have executed your first application with Groovy for Spark! + +#Feedback +If you have comments, suggestions for improvements, or any questions, you are encouraged to contact us directly: Constantin Ahlmann-Eltze or Artur Andrzejak at the Parallel and Distributed Systems Group ([PVS](http://pvs.ifi.uni-heidelberg.de/home/)), Heidelberg University.

+ +---- + +[^1]: This will probably improve with the adoption of [Java 8](http://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html) \ No newline at end of file diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovyBroadcastTest.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovyBroadcastTest.groovy new file mode 100644 index 0000000000000..8c14160e7b449 --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovyBroadcastTest.groovy @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + +class GroovyBroadcastTest { + + static void main(String[] args){ + + def conf = new SparkConf().setAppName("GroovyBroadcastTest") + def sc = new JavaSparkContext(conf) + def slices = (args.size() > 1) ? args[0].toInteger() : 2 + def num = (args.size() > 2) ? args[1].toInteger() : 1000000 + + def arr1 = 0..num-1 + + for(i in 0..3){ + println("Iteration " + i) + println("===========") + def barr1 = sc.broadcast(arr1) + sc.parallelize(1..10, slices).foreach({ + println(barr1.value().size()) + }) + } + sc.stop() + } + +} diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovyGroupByTest.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovyGroupByTest.groovy new file mode 100644 index 0000000000000..0ff3e53bfeae6 --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovyGroupByTest.groovy @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import scala.Tuple2 +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + + +class GroovyGroupByTest { + + static def main(args){ + + def numMappers = (args.length > 0) ? args[0].toInteger() : 2 + def numKVPairs = (args.length > 1) ? args[1].toInteger() : 1000 + def valSize = (args.length > 2) ? args[2].toInteger() : 1000 + def numReducers = (args.length > 3) ? args[3].toInteger() : numMappers + + def conf = new SparkConf().setAppName("Groovy GroupBy Test") + def sc = new JavaSparkContext(conf) + + def pairs1 = sc.parallelize(0..numMappers-1, numMappers).flatMapToPair( { + def ranGen = new Random() + def arr1 = new Tuple2[numKVPairs] + for (i in 0..numKVPairs-1) { + def byteArr = new Byte[valSize] + ranGen.nextBytes(byteArr) + arr1[i] = new Tuple2(ranGen.nextInt(Integer.MAX_VALUE), byteArr) + } + Arrays.asList(arr1) + }).cache() + + // Enforce that everything has been calculated and in cache + pairs1.count() + + println(pairs1.groupByKey(numReducers).count()) + + sc.stop() + } + +} diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovyHdfsTest.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovyHdfsTest.groovy new file mode 100644 index 0000000000000..dda6a3b9e544e --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovyHdfsTest.groovy @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + + +class GroovyHdfsTest { + + static void main(String[] args){ + def sparkConf = new SparkConf().setAppName("Groovy HdfsTest") + def sc = new JavaSparkContext(sparkConf) + def file = sc.textFile(args[0]) + def mapped = file.map({it.size()}).cache() + + for(iter in 1..10){ + def start = System.currentTimeMillis() + for (x in mapped.collect()){ x + 2 } + def end = System.currentTimeMillis() + println("Iteration " + iter + " took " + (end-start) + " ms") + } + sc.stop() + } + +} diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovyPageRank.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovyPageRank.groovy new file mode 100644 index 0000000000000..85727f4e77e5b --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovyPageRank.groovy @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import java.util.regex.Pattern +import scala.Tuple2 +import com.google.common.collect.Iterables +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + + +/** + * Computes the PageRank of URLs from an input file. Input file should + * be in format of: + * URL neighbor URL + * URL neighbor URL + * URL neighbor URL + * ... + * where URL and their neighbors are separated by space(s). + */ +class GroovyPageRank { + private static final Pattern SPACES = Pattern.compile("\\s+"); + + static def main(args) { + if (args.length < 2) { + System.err.println("Usage: JavaPageRank ") + System.exit(1) + } + + def sparkConf = new SparkConf().setAppName("JavaPageRank") + def sc = new JavaSparkContext(sparkConf) + + // Loads in input file. It should be in format of: + // URL neighbor URL + // URL neighbor URL + // URL neighbor URL + // ... + def lines = sc.textFile(args[0], 1) + + // Loads all URLs from input file and initialize their neighbors. + def links = lines.mapToPair({s -> + String[] parts = SPACES.split(s) + new Tuple2(parts[0], parts[1]) + }).distinct().groupByKey().cache() + + // Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one. + def ranks = links.mapValues({1}) + + // Calculates and updates URL ranks continuously using PageRank algorithm. + for (int current = 0; current < Integer.parseInt(args[1]); current++) { + // Calculates URL contributions to the rank of other URLs. + def contribs = links.join(ranks).values().flatMapToPair({s -> + def urlCount = Iterables.size(s._1) + def results = new ArrayList>() +// for (String n : s._1) { + s._1.each({n -> results.add(new Tuple2(n, s._2() / urlCount))}) + return results + }) + + // Re-calculates URL ranks based on neighbor contributions. + ranks = contribs.reduceByKey({d1, d2 -> d1+d2}).mapValues({sum -> 0.15 + sum * 0.85}) + } + + // Collects all URL ranks and dump them to console. + def output = ranks.collect(); + output.each({tuple -> println(tuple._1() + " has rank: " + tuple._2() + ".")}) + + sc.stop(); + } + +} diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovySimpleApp.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovySimpleApp.groovy new file mode 100644 index 0000000000000..17ee97d6514dc --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovySimpleApp.groovy @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.SparkConf + + +class GroovySimpleApp { + static def main(args) { + // Should be some file on your system + def logFile = "README.md" + def conf = new SparkConf().setAppName("Simple Groovy Application") + def sc = new JavaSparkContext(conf) + def logData = sc.textFile(logFile).cache() + + def numAs = logData.filter({it.contains("a")}).count() + + def numBs = logData.filter({it.contains("b")}).count() + + println("Lines with a: $numAs, lines with b: $numBs") + } +} \ No newline at end of file diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovySparkPi.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovySparkPi.groovy new file mode 100644 index 0000000000000..339a93d1f23c6 --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovySparkPi.groovy @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + + +/** + * Computes an approximation to pi + * Usage: GroovySparkPi [slices]. + */ +class GroovySparkPi { + + static def main(args){ + + def sparkConf = new SparkConf().setAppName("GroovySparkPi") + def jsc = new JavaSparkContext(sparkConf) + + def slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2 + def n = 100000 * slices + def l = 0..n + def dataSet = jsc.parallelize(l, slices) + def count = dataSet.map({e -> + double x = Math.random() * 2 - 1 + double y = Math.random() * 2 - 1 + (x * x + y * y < 1) ? 1 : 0 + }).reduce({i1, i2 -> i1 + i2}) + + println("Pi is roughly " + 4.0 * count / n) + + jsc.stop() + + } + +} diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovyTC.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovyTC.groovy new file mode 100644 index 0000000000000..524854d1408b5 --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovyTC.groovy @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import scala.Tuple2 +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + + +/** + * Transitive closure on a graph, implemented in Java. + * Usage: GroovyTC [slices] + */ +class GroovyTC { + + static def rand = new Random(42) + static def numEdges = 20 + static def numVertices = 100 + + static def generateGraph(){ + def edges = new HashSet(numEdges) + while(edges.size() < numEdges){ + def from = rand.nextInt(numVertices) + def to = rand.nextInt(numVertices) + def e = new Tuple2(from, to) + if(from != to) + edges.add(e) + } + return edges.toList() + } + + static final def projectFN = {new Tuple2(it._2._2, it._2._1)} + + static def main(args){ + def sparkConf = new SparkConf().setAppName("GroovyHdfsLR") + def sc = new JavaSparkContext(sparkConf) + def slices = (args.length > 1) ? args[0].toInteger() : 2 + def tc = sc.parallelizePairs(generateGraph(), slices).cache() + + def edges = tc.mapToPair({new Tuple2(it._2, it._1)}) + def oldCount = 0L + def nextCount = tc.count() + while(nextCount != oldCount){ + oldCount = nextCount + tc = tc.union(tc.join(edges).mapToPair(projectFN)) + tc = tc.distinct().cache() + nextCount = tc.count() + } + println("TC has ${tc.count()} edges.") + System.exit(0) + } + +} diff --git a/examples/src/main/groovy/org/apache/spark/examples/GroovyWordCount.groovy b/examples/src/main/groovy/org/apache/spark/examples/GroovyWordCount.groovy new file mode 100644 index 0000000000000..716c0639f1526 --- /dev/null +++ b/examples/src/main/groovy/org/apache/spark/examples/GroovyWordCount.groovy @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * Author: Constantin Ahlmann-Eltze + * Affiliation: Parallel and Distributed Systems Group, Heidelberg + * University, http://pvs.ifi.uni-heidelberg.de/home/ + * Date: July 4th, 2014 + */ + +package org.apache.spark.examples + +import java.util.regex.Pattern +import scala.Tuple2 +import org.apache.spark.SparkConf +import org.apache.spark.api.java.JavaSparkContext + + +/** + * Counts the words in a file + * Usage: GroovyWordCount . + */ +class GroovyWordCount { + + private static final def SPACE = Pattern.compile(" ") + + static def main(args){ + + if(args.length < 1){ + System.err.println("Usage GroovyWordCount ") + System.exit(1) + } + def sparkConf = new SparkConf().setAppName("GroovyWordCount") + def sc = new JavaSparkContext(sparkConf) + def lines = sc.textFile(args[0], 1) + def words = lines.flatMap({ s -> Arrays.asList(SPACE.split(s))}) + def counts = words.mapToPair({ new Tuple2(it, 1) }).reduceByKey({ a, b -> a + b }) + println("Here are 5 words with there count in the file") + counts.take(5).each({t -> println(t)}) + sc.stop() + + } + + +}