Skip to content
Gezim Sejdiu edited this page Jul 18, 2018 · 12 revisions
Website http://spark.apache.org/
Supported versions 2.3.1 for Hadoop 2.7+ with OpenJDK 8
2.3.0 for Hadoop 2.7+ with OpenJDK 8
2.2.1 for Hadoop 2.7+ with OpenJDK 8
2.2.0 for Hadoop 2.7+ with OpenJDK 8
2.1.1 for Hadoop 2.7+ with OpenJDK 8
2.1.0 for Hadoop 2.7+ with OpenJDK 8
2.0.2 for Hadoop 2.7+ with OpenJDK 8
2.0.1 for Hadoop 2.7+ with OpenJDK 8
2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8
2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7
1.6.2 for Hadoop 2.6
1.5.1 for Hadoop 2.6
Current responsible(s) Erika Pauwels @ TenForce -- erika.pauwels@tenforce.com
Aad Versteden @ TenForce -- aad.versteden@tenforce.com
Gezim Sejdiu @ UBO -- g.sejdiu@gmail.com
Ivan Ermilov @ InfAI -- ivan.s.ermilov@gmail.com
Docker image(s) bde2020/spark-master:latest
bde2020/spark-worker:latest
bde2020/spark-java-template:latest
bde2020/spark-python-template:latest
More info http://spark.apache.org/docs/latest/programming-guide.html

Short description

Apache Spark is an in-memory data processing engine. It provides APIs in Java, Python and Scala which try to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDD), i.e. a logical collection of data partitioned across machines. The way applications manipulate RDDs is similar to manipulating local collections of data.

On top of its core, Apache Spark provides 4 libraries:

  • Spark SQL - Library to make Spark work with (semi-)structured data by providing a data abstraction SchemaRDD/DataFrame on top of Spark core. The library also provides SQL support and a domain-specific language to manipulate SchemaRDD/DataFrame.
  • Spark streaming - Library that adds stream data processing to Spark core. Spark streaming makes it easy to build scalable fault-tolerant streaming applications by ingesting data in mini-batches. Moreover, application code developed for batch processing can be reused for stream processing in Spark.
  • Mlib Machine Learning Library - Library that provides a Machine Learning framework on top of Spark core.
  • GraphX - Library that provides a distributed graph processing framework on top of Spark core. GraphX comes with a variety of graph algorithms unifying ETL, exploratory analysis, and iterative graph computation within a single system.

Example usage

Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.

The repository big-data-europe/demo-spark-sensor-data contains a demo application in Java which also integrates with HDFS.

Scaling

RDDs are fault-tolerant collections of elements that can be operated on in parallel. As a consequence, Spark applications scale automatically when augmenting the number of Spark worker nodes in the cluster.

Clone this wiki locally