# Introduction to Apache Spark

## What is Spark?

- an Apache foundation **open source** project; not a product
- enables highly **iterative** analysis on **massive** volumes of data at scale
- an **in-memory computing** engine that works with **distributed** data, not a data store
- **unified environment** for data scientists, developers and data engineers
- radivally simplifies the process of developing **intelligent apps** fuelled by data

## Spark Motivation

Current popular programming models for cluster transform data flowing from stable storage to stable storage.

Example: MapReduce

<div style="display: inline-block; text-align: left; margin: 15px 0px 15px 0px;"><img src='assets/mapReduceExample.jpg' alt='MapReduce example' width='580'/></div>

**Benefits of data flow:** runtime can decide where to run tasks and can automatically recover from failures

- Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:
    - **Iterative** algorithms (many in machine learning)
    - **Interactive** data mining tools (R, Excel, Python)
- Spark makes working sets a first-class concept to efficiently support these apps

## Spark Goal

- provide distributed memory abstractions for cluster to support apps with working sets
- retain the attractive properties of MapReduce
    1. fault tolerance (for crashes & stragglers)
    2. data locality
    3. scalability

**Solution:** augment data flow model with "resilient distributed datasets" (RDDs)

## Apache Spark is ...

<div style="float:right; margin: 15px 0px 15px 0px;"><img src='assets/logistic-regression.png' alt='Logistic regression in Hadoop and Spark' width='250'/></div>

### 1. Fast
- leverages aggressively cached in-memory distributed computing and JVM threads
- faster than MapReduce (run workloads 100x faster)
- apache spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine


### 2. Ease of use (for programmers)
- written in Scala, an object-oriented, functional programming language
- Scala, Python, and Java APIs
- Scala and Python interactive shells
- runs on Hadoop, Mesos, Kubernetes standalone or cloud

<div style="float:right; margin: 15px 0px 15px 0px;"><img src='assets/spark-stack.png' alt='Spart stack' width='250'/></div>

### 3. General purpose
- covers a wide range of workloads
- provides SQL, streaming and complex analytics
- powers a stac of libraries including SQL and Dataframes, MLlib for machine learning, GraphX, and Spark Streaming
- you can combine theses libraries seamlessly in the same application

## Spark Stack

<div style="display: inline-block; text-aling: left; margin: 15px 0px 15px 0px;"><img src='assets/spark-complete-stack.png' alt='Spark complete stack' width='580'/></div>