Gezim Sejdiu edited this page Jul 18, 2018 · 12 revisions
Website http://spark.apache.org/
Supported versions 2.3.1 for Hadoop 2.7+ with OpenJDK 8
2.3.0 for Hadoop 2.7+ with OpenJDK 8
2.2.1 for Hadoop 2.7+ with OpenJDK 8
2.2.0 for Hadoop 2.7+ with OpenJDK 8
2.1.1 for Hadoop 2.7+ with OpenJDK 8
2.1.0 for Hadoop 2.7+ with OpenJDK 8
2.0.2 for Hadoop 2.7+ with OpenJDK 8
2.0.1 for Hadoop 2.7+ with OpenJDK 8
2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8
2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7
1.6.2 for Hadoop 2.6
1.5.1 for Hadoop 2.6
Current responsible(s) Erika Pauwels @ TenForce -- erika.pauwels@tenforce.com
Aad Versteden @ TenForce -- aad.versteden@tenforce.com
Gezim Sejdiu @ UBO -- g.sejdiu@gmail.com
Ivan Ermilov @ InfAI -- ivan.s.ermilov@gmail.com
Docker image(s) bde2020/spark-master:latest
bde2020/spark-worker:latest
bde2020/spark-java-template:latest
bde2020/spark-python-template:latest
More info http://spark.apache.org/docs/latest/programming-guide.html

Short description

Apache Spark is an in-memory data processing engine. It provides APIs in Java, Python and Scala which try to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDD), i.e. a logical collection of data partitioned across machines. The way applications manipulate RDDs is similar to manipulating local collections of data.

On top of its core, Apache Spark provides 4 libraries:

  • Spark SQL - Library to make Spark work with (semi-)structured data by providing a data abstraction SchemaRDD/DataFrame on top of Spark core. The library also provides SQL support and a domain-specific language to manipulate SchemaRDD/DataFrame.
  • Spark streaming - Library that adds stream data processing to Spark core. Spark streaming makes it easy to build scalable fault-tolerant streaming applications by ingesting data in mini-batches. Moreover, application code developed for batch processing can be reused for stream processing in Spark.
  • Mlib Machine Learning Library - Library that provides a Machine Learning framework on top of Spark core.
  • GraphX - Library that provides a distributed graph processing framework on top of Spark core. GraphX comes with a variety of graph algorithms unifying ETL, exploratory analysis, and iterative graph computation within a single system.

Example usage

Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.

The repository big-data-europe/demo-spark-sensor-data contains a demo application in Java which also integrates with HDFS.

Scaling

RDDs are fault-tolerant collections of elements that can be operated on in parallel. As a consequence, Spark applications scale automatically when augmenting the number of Spark worker nodes in the cluster.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.