Julia interface to Apache Spark.
See Roadmap for current status.
Spark.jl requires at least Java 7 and Maven to be installed and available in PATH
.
Pkg.clone("https://github.com/dfdx/Spark.jl")
Pkg.build("Spark")
# we also need latest master of JavaCall.jl
Pkg.checkout("JavaCall")
This will download and build all Julia and Java dependencies. To use Spark.jl type:
using Spark
Spark.init()
All examples below are runnable from REPL
sc = SparkContext(master="local")
path = "file:///var/log/syslog"
txt = text_file(sc, path)
count(txt)
close(sc)
sc = SparkContext(master="spark://spark-standalone:7077", appname="Say 'Hello!'")
path = "file:///var/log/syslog"
txt = text_file(sc, path)
rdd = map(txt, line -> length(split(line)))
reduce(rdd, +)
close(sc)
NOTE: currently named Julia functions cannot be fully serialized, so functions passed to executors should be either already defined there (e.g. in preinstalled library) or be anonymous functions.
sc = SparkContext(master="mesos://mesos-master:5050")
path = "hdfs://namenode:8020/user/hdfs/test.log"
txt = text_file(sc, path)
rdd = map_partitions(txt, it -> filter(line -> contains(line, "a"), it))
collect(rdd)
close(sc)
For the full supported API see the list of exported functions.
All examples assume that you have a file people.json
with content like this:
{"name": "Alice", "age": 27}
{"name": "Bob", "age": 32}
Read dataframe from JSON and collect to a driver:
spark = SparkSession()
df = read_json(spark, "/path/to/people.json")
collect(df)
Read JSON and write Parquet:
spark = SparkSession()
df = read_json(spark, "/path/to/people.json")
write_parquet(df, "/path/to/people.parquet")