Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
|Supported versions||2.3.1 for Hadoop 2.7+ with OpenJDK 8|
|2.3.0 for Hadoop 2.7+ with OpenJDK 8|
|2.2.1 for Hadoop 2.7+ with OpenJDK 8|
|2.2.0 for Hadoop 2.7+ with OpenJDK 8|
|2.1.1 for Hadoop 2.7+ with OpenJDK 8|
|2.1.0 for Hadoop 2.7+ with OpenJDK 8|
|2.0.2 for Hadoop 2.7+ with OpenJDK 8|
|2.0.1 for Hadoop 2.7+ with OpenJDK 8|
|2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8|
|2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7|
|1.6.2 for Hadoop 2.6|
|1.5.1 for Hadoop 2.6|
|Current responsible(s)||Erika Pauwels @ TenForce -- firstname.lastname@example.org|
|Aad Versteden @ TenForce -- email@example.com|
|Gezim Sejdiu @ UBO -- firstname.lastname@example.org|
|Ivan Ermilov @ InfAI -- email@example.com|
Apache Spark is an in-memory data processing engine. It provides APIs in Java, Python and Scala which try to simplify the programming complexity by introducing the abstraction of Resilient Distributed Datasets (RDD), i.e. a logical collection of data partitioned across machines. The way applications manipulate RDDs is similar to manipulating local collections of data.
On top of its core, Apache Spark provides 4 libraries:
- Spark SQL - Library to make Spark work with (semi-)structured data by providing a data abstraction SchemaRDD/DataFrame on top of Spark core. The library also provides SQL support and a domain-specific language to manipulate SchemaRDD/DataFrame.
- Spark streaming - Library that adds stream data processing to Spark core. Spark streaming makes it easy to build scalable fault-tolerant streaming applications by ingesting data in mini-batches. Moreover, application code developed for batch processing can be reused for stream processing in Spark.
- Mlib Machine Learning Library - Library that provides a Machine Learning framework on top of Spark core.
- GraphX - Library that provides a distributed graph processing framework on top of Spark core. GraphX comes with a variety of graph algorithms unifying ETL, exploratory analysis, and iterative graph computation within a single system.
Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.
The repository big-data-europe/demo-spark-sensor-data contains a demo application in Java which also integrates with HDFS.
RDDs are fault-tolerant collections of elements that can be operated on in parallel. As a consequence, Spark applications scale automatically when augmenting the number of Spark worker nodes in the cluster.