In [1]:
%autosave 10

Autosaving every 10 seconds



# Table of Contents

- [I. What is Apache Spark?](#What-is-Apache-Spark?)
- [II. Spark Jobs and APIs](#Spark-Jobs-and-APIs)
- [III. RDDs, DataFrames, and Datasets](#RDDs-DataFrames-Datasets)
- [IV. Catalyst Optimizer](#Catalyst-Optimizer)
- [V. Spark 2.0 architecture](#Spark-2.0-architecture)

  

# What is Apache Spark?

Apache Spark is:
- an open-source
- powerful
- distributed
- querying and
- processing engine

It provides:
- flexibility
- extensibility of MapReduce
but at significantly higher speeds.

Apache Spark allows the user to:
- read
- transform
- and aggregate data
- as well as train
- deploy sophisticated statistical models


The Spark APIs are accessible in 
- Java
- Scala
- Python
- R 
- SQL

Apache Spark can be used to:
- build applications
- package them up as libraries to be deployed on a cluster
- perform quick analytics interactively through notebooks:
 - Jupyter
 - Spark-Notebook
 - Databricks notebooks
 - Apache Zeppelin
 
Apache Spark exposes a host of libraries familiar to data analysts, data scientists or researchers who have worked with Python's ```pandas``` or R's ```data.frames``` or ```data.tables```.

Note: There are some differences between pandas or data.frames/data.tables and Spark DataFrames.

Also, delivered with Apache Spark are several already implemented and tuned algorithms, statistical models, and frameworks: MLlib and ML for machine learning, GraphX and GraphFrames for graph processing, and Spark Streaming (DStreams and Structured). Spark allows the user to combine these libraries seamlessly in the same application.

Apache Spark can easily run locally on a laptop, yet can also easily be deployed in standalone mode, over YARN, or Apache Mesos - either on your local cluster or in the cloud. It can read and write from a diverse data sources including (but not limited to) HDFS, Apache Cassandra, Apache HBase, and S3:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_01.jpg)

*Source: Apache Spark is the smartphone of Big Data http://bit.ly/1QsgaNj*



# Spark Jobs and APIs
[back to top](#Table-of-Contents)


## Execution process

Any Spark application spins off a single driver process(that can contain multiple jobs) on the **master node** that then directs **executor** processes(that contain multiple tasks) distributed to a number of **worker nodes**

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_02.jpg)

The driver process determines the number and the composition of the task processes directed to the executor nodes based on the graph generated for the given job.

Note: Any worker node can execute tasks from a number of different jobs.


A Spark job is associated with a chain of object dependencies organized in a **direct acyclic graph(DAG)** such as the following example generated from the Spark UI. Given this, Spark Can optimize the scheduling ( for example, determine the number of tasks and workers required) and execution of these tasks:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_03.jpg)

# RDDs, DataFrames, and Datasets

## Resilient Distributed Dataset
[back to top](#Table-of-Contents)

Spark is built around a distributed collection of immutable Java Virtual Machine(JVM) objects called **Resilient Distributed Datasets(RDDs)**.

In PySpark, it is important to note that the Python data is stored within these JVM objects and these objects allow  any job to perform calculations very quickly.

RDDs are:
- calculated against
- cached
- stored in-memory

At the same time, RDDs expose some coarse-gained transformations such as:
- ```map(...)```
- ```reduce(...)```
- ```filter(...)```

RDDs have two sets of parallel operations:
- **transformations**(which return pointers to new RDDs) and
- **actions**(which return values to the driver after running a computation)


RDD transformation operations are lazy in a sense that they do not compute their results immediately. The transformations are only computed when an action is executed and the results need to be returned to the driver. This delayed execution results in more fine-tuned queries: Queries that are optimized for performance. 

## DataFrames
[back to top](#Table-of-Contents)

DataFrames, like RDDs, are immutable collections of data distributed among teh nodes in a cluster. However, unlike RDDs, in DataFrames data is organized into named columns.


DataFrames were designed to make large data sets processing even easier. They allow developers to formalize the structure of the data, allowing higher-level abstraction; in that sense DataFrames resemble tables from the relational database world. DataFrames provide a domain specific language API to manipulate the distributed data and make Spark accessible to a wider audience, beyond specialized data engineers.

One of the major benefits of DataFrames is that the Spark Engine initially builds a logical execution plan and executes generated code based on a physical plan determined by a cost optimizer. Unlide RDDs that can be significantly slower on Python compared with Java or Scala.


## Datasets

The goal of Spark Datasets is to provide an API that allows users to easily express transformations on domain objects, while also providing the performance and benefits of the robust Spark SQL execution engine. 



# Catalyst Optimizer
[back to top](#Table-of-Contents)

Spark SQL is one of the most technically involved components of Apache Spark as it powers both SQL queries and the DataFrame API. At the core of Spark SQL is the Catalyst Optimizer. The optimizer is based on functional programming constructs and was designed with two purposes in mind: 
- To ease the addition of new optimization techniques and features to Spark SQL and 
- to allow external developers to extend the optimizer (for example, adding data source specific rules, support for new data types, and so on):

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_04.jpg)

# Spark 2.0 architecture
[back to top](#Table-of-Contents)


## Unifying Datasets and DataFrames

The history of the Spark APIs is denoted in the following diagram noting the progression from RDD to DataFrame to Dataset:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_06.jpg)

*Source: From Webinar Apache Spark 1.5: What is the difference between a DataFrame and a RDD? http://bit.ly/29JPJSA*

 As you can see from the following diagram, DataFrame and Dataset both belong to the new Dataset API introduced as part of Apache Spark 2.0:
 ![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_07.jpg)
 
 *Source: A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets http://bit.ly/2accSNA*
 
 ## Introducing SparkSession
 [back to top](#Table-of-Contents)

In the past, you would potentially work with ```SparkConf```, ```SparkContext```, ```SQLContext```, and ```HiveContext``` to execute your various Spark queries for configuration, Spark context, SQL context, and Hive context respectively. The ```SparkSession``` is a combination of these contexts including ```StreamingContext```.

The ```SparkSession``` is now the entry point for reading data, working with metadata, configuring the session, and managing the cluster resources.

 
 ## Structured Streaming
  [back to top](#Table-of-Contents)
  
  
As quoted by Reynold Xin during Spark Summit East 2016:
>"The simplest way to perform streaming analytics is not having to reason about streaming."

This is the underlying foundation for building Structured Streaming. While streaming is powerful, one of the key issues is that streaming can be difficult to build and maintain. While companies such as Uber, Netflix, and Pinterest have Spark Streaming applications running in production, they also have dedicated teams to ensure the systems are highly available.

### Spark Streaming: What Is It and Who’s Using It?

![](https://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/11/spark-streaming-datanami-300x169.png)
*Spark Streaming ecosystem: Spark Streaming can consume static and streaming data from various sources, process data using Spark SQL and DataFrames, apply machine learning techniques from MLlib, and finally push out results to external data storage systems.*


Streaming data is likely collected and used in batch jobs when generating daily reports and updating models. This means that a modern stream processing pipeline needs to be built, taking into account not just the real-time aspect, but also the associated pre-processing and post-processing aspects (e.g. model building).

Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open source software meant dealing with multiple frameworks, each built for a niche purpose, such as Storm for real-time actions, Hadoop MapReduce for batch processing, etc.

Besides the pain of developing with disparate programming models, there was a huge cost of managing multiple frameworks in production. Spark and Spark Streaming, with its unified programming model and processing engine, makes all of this very simple.
 
### Why Spark Streaming is Being Adopted Rapidly
[back to top](#Table-of-Contents)
  
Spark Streaming was added to Apache Spark in 2013, an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources like Kafka, Flume, and Amazon Kinesis. Its key abstraction is a Discretized Stream or, in short, a DStream, which represents a stream of data divided into small batches. DStreams are built on RDDs, Spark’s core data abstraction. This allows Spark Streaming to seamlessly integrate with any other Spark components like MLlib and Spark SQL.

- This unification of disparate data processing capabilities is the key reason behind Spark Streaming’s rapid adoption. 
 - It makes it very easy for developers to use a single framework to satisfy all the processing needs. 
 - They can use MLlib (Spark’s machine learning library) to train models offline and directly spark_87use them online for scoring live data in Spark Streaming. In fact, some models perform continuous, online learning, and scoring. 
 - Furthermore, data from streaming sources can be combined with a very large range of static data sources available through Spark SQL. For example, static data from Amazon Redshift can be loaded in memory in Spark and used to enrich the streaming data before pushing to downstream systems.

- Last but not least, all the data collected can be later post-processed for report generation or queried interactively for ad-hoc analysis using Spark. 
 - The code and business logic can be shared and reused between streaming, batch, and interactive processing pipelines. 
 - In short, developers and system administrators can spend less time learning, implementing, and maintaining different frameworks, and focus on developing smarter applications.
 
 
### Streaming Use Cases – From Uber to Pinterest

While each business puts Spark Streaming into action in different ways, depending on their overall objectives and business case, there are four broad ways Spark Streaming is being used today.


![uber_logo](https://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/10/uber_logo.png)

- Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores.
- Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly. E.g. unusual behavior of sensor devices generating actions.
- Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time analysis.
- Complex sessions and continuous learning – Events related to a live session (e.g. user activity after logging into a website or application) are grouped together and analyzed. In some cases, the session information is used to continuously update machine learning models.


In looking at use cases, Uber, for example, collects terabytes of event data every day from their mobile users for real-time telemetry analytics. By building a continuous ETL pipeline using Kafka, Spark Streaming, and HDFS, Uber can convert the raw unstructured event data into structured data as it is collected, making it ready for further complex analytics. Similarly, Pinterest built an ETL data pipeline starting with  Kafka, which feeds that data into Spark via Spark Streaming to provide immediate insight into how users are engaging with Pins across the globe in real-time. This helps Pinterest become a better recommendation engine for showing related Pins as people use the service to plan products to buy, places to go, and recipes to cook, and more. Similarly, Netflix receives billions of events per day from various sources, and they have used Kafka and Spark Streaming to build a real-time engine that provide movie recommendations to its users.

Source: https://www.datanami.com/2015/11/30/spark-streaming-what-is-it-and-whos-using-it/

### Continuous applications
[back to top](#Table-of-Contents)

Spark 2.0 has the ability to aggregate data into a stream and then serving it using traditional JDBC/ODBC, to change queries at run time, and/or to build and apply ML models in for many scenario in a variety of latency use cases:

![](https://www.safaribooksonline.com/library/view/learning-pyspark/9781786463708/graphics/B05793_01_10.jpg)

*Source: Apache Spark Key Terms, Explained https://databricks.com/blog/2016/06/22/apache-spark-key-terms-explained.html.*
