<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Databricks Learning" style="width: 600px; height: 240px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Spark Overview

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Short History of Apache Spark
* <a href="https://en.wikipedia.org/wiki/Apache_Spark" target="_blank">Apache Spark</a> started as a research project at the 
University of California AMPLab, in 2009 by <a href="https://en.wikipedia.org/wiki/Matei_Zaharia" target="_blank">Matei Zaharia</a>.
* In 2013, the project was
  * donated to the Apache Software Foundation
  * open sourced
  * adopted the Apache 2.0 license
* In February 2014, Spark became a Top-Level <a href="https://spark.apache.org/" target="_blank">Apache Project<a/>.
* Latest stable release: <a href="https://spark.apache.org/downloads.html" target="_blank">CLICK-HERE</a>
* 600,000+ lines of code (75% Scala)
* Built by 1,000+ developers from more than 250+ organizations

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) What is Apache Spark?

Spark is a unified processing engine that can analyze big data using SQL, machine learning, graph processing or real-time stream analysis:

![Spark Engines](https://www.quantiaconsulting.com/logos/img/spark_4engines.png)
<br/>
<br/>
* At its core is the Spark Engine.
* The DataFrames API provides an abstraction above RDDs while simultaneously improving performance 5-20x over traditional RDDs with its Catalyst Optimizer.
* Spark ML provides high quality and finely tuned machine learning algorithms for processing big data.
* The Graph processing API gives us an easily approachable API for modeling pairwise relationships between people, objects, or nodes in a network.
* The Streaming APIs give us End-to-End Fault Tolerance, with Exactly-Once semantics, and the possibility for sub-millisecond latency.

And it all works together seamlessly!

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) A Unifying Engine

And as a compute engine, Apache Spark is not tied to a specific environment or data warehouse strategy.

![Unified Engine](https://www.quantiaconsulting.com/logos/img/unified-engine.png)
<br/>
<br/>
* Built upon the Spark Core
* Apache Spark is data and environment agnostic.
* Languages: **Scala, Java, Python, R, SQL**
* Environments: **Yarn, Docker, EC2, Mesos, OpenStack, Databricks (our favorite), Digital Ocean, and much more...**
* Data Sources: **Hadoop HDFS, Casandra, Kafka, Apache Hive, HBase, JDBC (PostgreSQL, MySQL, etc.), CSV, JSON, Azure Blob, Amazon S3, ElasticSearch, Parquet, and much, much more...**

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) RDDs
* The primary data abstraction of Spark engine is the RDD: Resilient Distributed Dataset
  * Resilient, i.e., fault-tolerant with the help of RDD lineage graph and so able to recompute missing or damaged partitions due to node failures.
  * Distributed with data residing on multiple nodes in a cluster.
  * Dataset is a collection of partitioned data with primitive values or values of values, e.g., tuples or other objects.
* The original paper that gave birth to the concept of RDD is <a href="https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf" target="_blank">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a> by Matei Zaharia et al.
* Since Spark 2.x, RDDs are considered as the assembly language of the Spark ecosystem.
* DataFrames, Datasets & SQL provide the higher level abstraction over RDDs.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Scala, Python, Java, R & SQL
* Besides being able to run in many environments...
* Apache Spark makes the platform even more approachable by supporting multiple languages:
  * Scala - Apache Spark's primary language.
  * Python - More commonly referred to as PySpark
  * R - <a href="https://spark.apache.org/docs/latest/sparkr.html" target="_blank">SparkR</a> (R on Spark)
  * Java
  * SQL - Closer to ANSI SQL 2003 compliance
    * Since spark 2.x running all 99 TPC-DS queries
    * New standards-compliant parser (with good error messages!)
    * Subqueries (correlated & uncorrelated)
    * Approximate aggregate stats
* With the older RDD API, there are significant differences with each language's implementation, namely in performance.
* With the newer DataFrames API, the performance differences between languages are nearly nonexistence (especially for Scala, Java & Python).
* With that, not all languages get the same amount of love - just the same, that API gap for each language is rapidly closing, especially between Spark 1.x and 2.x.

![RDD vs DataFrames](https://www.quantiaconsulting.com/logos/img/rdd-vs-dataframes.png)

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) The Cluster: Drivers, Executors, Slots & Tasks
![Spark Physical Cluster, slots](https://www.quantiaconsulting.com/logos/img/spark_cluster_slots.png)

* The **Driver** is the JVM in which our application runs.
* The secret to Spark's awesome performance is parallelism.
  * Scaling vertically is limited to a finite amount of RAM, Threads and CPU speeds.
  * Scaling horizontally means we can simply add new "nodes" to the cluster almost endlessly.
* We parallelize at two levels:
  * The first level of parallelization is the **Executor** - a Java virtual machine running on a node, typically, one instance per node.
  * The second level of parallelization is the **Slot** - the number of which is determined by the number of cores and CPUs of each node.
* Each **Executor** has a number of **Slots** to which parallelized **Tasks** can be assigned to it by the **Driver**.

![Spark Physical Cluster, tasks](https://files.training.databricks.com/images/105/spark_cluster_tasks.png)
<br/>
<br/>
* The JVM is naturally multithreaded, but a single JVM, such as our **Driver**, has a finite upper limit.
* By creating **Tasks**, the **Driver** can assign units of work to **Slots** for parallel execution.
* Additionally, the **Driver** must also decide how to partition the data so that it can be distributed for parallel processing (not shown here).
* Consequently, the **Driver** is assigning a **Partition** of data to each task - in this way each **Task** knows which piece of data it is to process.
* Once started, each **Task** will fetch from the original data source the **Partition** of data assigned to it.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Quick Note on Jobs & Stages
* Each parallelized action is referred to as a **Job**.
* The results of each **Job** (parallelized/distributed action) is returned to the **Driver**.
* Depending on the work required, multiple **Jobs** will be required.
* Each **Job** is broken down into **Stages**. 
* This would be analogous to building a house (the job)
  * The first stage would be to lay the foundation.
  * The second stage would be to erect the walls.
  * The third stage would be to add the room.
  * Attempting to do any of these steps out of order just won't make sense, if not just impossible.

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Cluster Management & Local Mode

* At a much lower level, Spark Core employs a **Cluster Manager** that is responsible for provisioning nodes in our cluster.
  * Additional Cluster Managers are available for 
    <a href="https://spark.apache.org/docs/latest/running-on-mesos.html" target="_blank">Mesos</a>,
    <a href="https://spark.apache.org/docs/latest/running-on-yarn.html" target="_blank">Yarn</a> and by other third parties.
  * In addition to this, Spark has a <a href="https://spark.apache.org/docs/latest/spark-standalone.html" target="_blank">Standalone</a> mode in which you manually configure each node.
* In each of these scenarios, the **Driver** is [presumably] running on one node, with each **Executors** running on N different nodes.

One option, commonly used for local development, is to run Spark in **Local Mode**.
* In **Local Mode**, both the **Driver** and one **Executor** share the same JVM.
* This is an ideal scenario for experimentation, prototyping, and learning!

## ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Latest Release - Spark 3

In June 2020 a new Spark major release (3.0.0) was made public.

The major improvement of the new release are
* Improved SQL Engine: 2x performance improvement on TPC-DS over Spark 2.4 
    * Adaptive query execution (AQE)
    * Dynamic partition pruning
    * Improved ANSI SQL compliance
    * Join Hints
* Pandas APIs improvements:
    * Python type hints
    * Additional pandas UDFs type
    * Improved Error Handling
* Up to 40x speedups for calling R user-defined functions
* ....

[Official release page](https://spark.apache.org/releases/spark-release-3-0-0.html) with the complete list of improvements.

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.