# Introduction to Apache Spark

## What is Spark?

Spark Programming is nothing but a general-purpose & lightning fast cluster computing platform.

- an Apache foundation **open source** project; not a product
- **unified environment** for data scientists, developers and data engineers
- enables highly **iterative** analysis on **massive** volumes of data at scale
- an **in-memory computing** engine that works with **distributed** data, not a data store

Designed to integrate with all the BigData tools. Like Spark can access any Hadoop data source.\
Extending Hadoop MapReduce to the next level.

### Its relation with Hadoop

Hadoop framework made a grasp on the market, as is based on a simple programming model (MapReduce introduced by Google) extensively used by companies enabling a scalable, flexible, fault-tolerant and cost effective computing solution.

But Hadoop was designed for batch processing => no real-time data streaming

<u>So, is Spark another competing technology to the famous framework?</u>

Actually, **Spark is the natural evolution of Hadoop**, introduced for speeding up the Hadoop computational computing software process:

- 100 times faster in-memory mode
- 10 times faster running on disk

allowing iterative queries and real-time Data Analytics using Spark Streaming.

However, Spark is independent of Hadoop => it has its **own cluster management system**.\
Only uses Hadoop for storage purpose only.

### Why choose Apache Spark?

As we know, there was no general-purpose computing engine in the industry, since

1. To perform batch processing, we were using Hadoop MapReduce.
2. Also, to perform stream processing, we were using Apache Storm / S4.
3. Moreover, for interactive processing, we were using Apache Impala / Apache Tez.
4. To perform graph processing, we were using Neo4j / Apache Giraph.

Hence, there was no powerful engine in the industry, that can process the data both in real-time and batch mode. Even, Big data is characterized by its velocity, volume, variety, value, and veracity due to which it needs to be processed at a higher speed.

That's when Spark rises ★


### Spark Features

**1. Fast Processing:** saving time reducing the number of read-write operations to disk. 

**2. Flexibility:** supporting multiple languages, allows developers to write applications in **Java, Scala, R, or Python.**

**3. In-memory processing:** stores data in the RAM => quick access => accelerates the speed of analytics.

**4. Real-time processing:** designed to process real-time streaming data => instant outcomes.

**5. Better analytics:** more than 80 high-level operators including:

- not only MapReduce, but also
- SQL queries
- Streaming data
- Machine learning algorithms
- Graph algorithms
    
**6. Compatibility with Hadoop:** it can work on top of Hadoop as well, thus, it can read existing Hadoop data.

**7. Easy to manage:** complete data analysis engine all integrated in the same cluster.

- Hadoop provides only the batch-processing engine, needing different ones for each task.
- With Spark there is no need for managing variuos Spark Components for each task.

**8. Fault-tolerant:** Spark RDDs are designed to handle the failure of any worker node in the cluster ensuring the loss of data to zero.

- designed to cover a wide range of workloads such as batch apps, iterative algorithms, interactive queries and streaming

> Note: Resilient Distributed Datasets (RDD) is a fundamental immutable data structure of Spark. 


### Spark Motivation

Current popular programming models for cluster transform data flowing from stable storage to stable storage.

Example: MapReduce

<div style="display: inline-block; text-align: left; margin: 15px 0px 15px 0px;"><img src='../assets/mapReduceExample.jpg' alt='MapReduce example' width='580'/></div>

**Benefits of data flow:** runtime can decide where to run tasks and can automatically recover from failures

- Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:
    - **Iterative** algorithms (many in machine learning)
    - **Interactive** data mining tools (R, Excel, Python)
- Spark makes working sets a first-class concept to efficiently support these apps

### Spark Goal

- provide distributed memory abstractions for cluster to support apps with working sets
- retain the attractive properties of MapReduce
    1. fault tolerance (for crashes & stragglers)
    2. data locality
    3. scalability

**Solution:** augment data flow model with "resilient distributed datasets" (RDDs)

### Apache Spark is ...

<div style="float:right; margin: 15px 0px 15px 0px;"><img src='../assets/logistic-regression.png' alt='Logistic regression in Hadoop and Spark' width='250'/></div>

#### 1. Fast
- leverages aggressively cached in-memory distributed computing and JVM threads
- faster than MapReduce (run workloads 100x faster)
- apache spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine


#### 2. Ease of use (for programmers)
- written in Scala, an object-oriented, functional programming language
- Scala, Python, and Java APIs
- Scala and Python interactive shells
- runs on Hadoop, Mesos, Kubernetes standalone or cloud

<div style="float:right; margin: 15px 0px 15px 0px;"><img src='../assets/spark-stack.png' alt='Spart stack' width='250'/></div>

#### 3. General purpose
- covers a wide range of workloads
- provides SQL, streaming and complex analytics
- powers a stac of libraries including SQL and Dataframes, MLlib for machine learning, GraphX, and Spark Streaming
- you can combine theses libraries seamlessly in the same application

### Spark Stack

- Spark Core Engine
- Spark SQL
- Spark Streaming
- Mlib
- GraphX

Spark Core is the base upon the others are built/run.

Spark SQL es el modulo para trabajar con datos estructurados.

Spark Streaming es el componente que permite a Spark procesar datos en tiempo real.

MLlib, hace que Spark esté equipado con una biblioteca de aprendizaje automatico. Contiene una amplia gama de algoritmos.

GraphX, biblioteca que usa Spark para trabajar con grafos y hacer calculos sobre ellos.

<div style="display: inline-block; text-aling: left; margin: 15px 0px 15px 0px;"><img src='../assets/spark-complete-stack.png' alt='Spark complete stack' width='580'/></div>


- procesamiento de datos en tiempo real
    - gracias a spark streaming se puede procesar a tiempo real
    - a diferencia de Hadoop MapReduce que sólo procesa los datos almacenados en HDFS

In [1]:
print("asdf")

asdf
