In [1]:
%autosave 10

Autosaving every 10 seconds



# Table of Contents

- [I. What is Apache Spark?](#What-is-Apache-Spark?)
- [II. Spark Jobs and APIs](#Spark-Jobs-and-APIs)
- [III. RDDs, DataFrames, and Datasets](#RDDs-DataFrames-Datasets)
- [IV. Catalyst Optimizer](#Catalyst-Optimizer)

  

# What is Apache Spark?

Apache Spark is:
- an open-source
- powerful
- distributed
- querying and
- processing engine

It provides:
- flexibility
- extensibility of MapReduce
but at significantly higher speeds.

Apache Spark allows the user to:
- read
- transform
- and aggregate data
- as well as train
- deploy sophisticated statistical models


The Spark APIs are accessible in 
- Java
- Scala
- Python
- R 
- SQL

Apache Spark can be used to:
- build applications
- package them up as libraries to be deployed on a cluster
- perform quick analytics interactively through notebooks:
 - Jupyter
 - Spark-Notebook
 - Databricks notebooks
 - Apache Zeppelin
 
Apache Spark exposes a host of libraries familiar to data analysts, data scientists or researchers who have worked with Python's ```pandas``` or R's ```data.frames``` or ```data.tables```.

Note: There are some differences between pandas or data.frames/data.tables and Spark DataFrames.

Also, delivered with Apache Spark are several already implemented and tuned algorithms, statistical models, and frameworks: MLlib and ML for machine learning, GraphX and GraphFrames for graph processing, and Spark Streaming (DStreams and Structured). Spark allows the user to combine these libraries seamlessly in the same application.

Apache Spark can easily run locally on a laptop, yet can also easily be deployed in standalone mode, over YARN, or Apache Mesos - either on your local cluster or in the cloud. It can read and write from a diverse data sources including (but not limited to) HDFS, Apache Cassandra, Apache HBase, and S3:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_01.jpg)

*Source: Apache Spark is the smartphone of Big Data http://bit.ly/1QsgaNj*



# Spark Jobs and APIs
[back to top](#Table-of-Contents)


## Execution process

Any Spark application spins off a single driver process(that can contain multiple jobs) on the **master node** that then directs **executor** processes(that contain multiple tasks) distributed to a number of **worker nodes**

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_02.jpg)

The driver process determines the number and the composition of the task processes directed to the executor nodes based on the graph generated for the given job.

Note: Any worker node can execute tasks from a number of different jobs.


A Spark job is associated with a chain of object dependencies organized in a **direct acyclic graph(DAG)** such as the following example generated from the Spark UI. Given this, Spark Can optimize the scheduling ( for example, determine the number of tasks and workers required) and execution of these tasks:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_03.jpg)

# RDDs, DataFrames, and Datasets

## Resilient Distributed Dataset
[back to top](#Table-of-Contents)

Spark is built around a distributed collection of immutable Java Virtual Machine(JVM) objects called **Resilient Distributed Datasets(RDDs)**.

In PySpark, it is important to note that the Python data is stored within these JVM objects and these objects allow  any job to perform calculations very quickly.

RDDs are:
- calculated against
- cached
- stored in-memory

At the same time, RDDs expose some coarse-gained transformations such as:
- ```map(...)```
- ```reduce(...)```
- ```filter(...)```

RDDs have two sets of parallel operations:
- **transformations**(which return pointers to new RDDs) and
- **actions**(which return values to the driver after running a computation)


RDD transformation operations are lazy in a sense that they do not compute their results immediately. The transformations are only computed when an action is executed and the results need to be returned to the driver. This delayed execution results in more fine-tuned queries: Queries that are optimized for performance. 

## DataFrames
[back to top](#Table-of-Contents)

DataFrames, like RDDs, are immutable collections of data distributed among teh nodes in a cluster. However, unlike RDDs, in DataFrames data is organized into named columns.


DataFrames were designed to make large data sets processing even easier. They allow developers to formalize the structure of the data, allowing higher-level abstraction; in that sense DataFrames resemble tables from the relational database world. DataFrames provide a domain specific language API to manipulate the distributed data and make Spark accessible to a wider audience, beyond specialized data engineers.

One of the major benefits of DataFrames is that the Spark Engine initially builds a logical execution plan and executes generated code based on a physical plan determined by a cost optimizer. Unlide RDDs that can be significantly slower on Python compared with Java or Scala.


## Datasets

The goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. 



# Catalyst Optimizer
[back to top](#Table-of-Contents)

Spark SQL is one of the most technically involved components of Apache Spark as it powers both SQL queries and the DataFrame API. At the core of Spark SQL is the Catalyst Optimizer. The optimizer is based on functional programming constructs and was designed with two purposes in mind: 
- To ease the addition of new optimization techniques and features to Spark SQL and 
- to allow external developers to extend the optimizer (for example, adding data source specific rules, support for new data types, and so on):

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_04.jpg)