# Internal Details of Spark
#### Agenda
* Understanding Cluster
* Driver & Executors
* Partitions
* Spark Entities - Application, Job, Stages & Task
* Resilient Distributed Datastructure
* Spark DataFrames

## 1. Understanding Cluster
<hr>
* In distributed computing environment, **systems have to be inter-connected for information exchange**. This inter-connected system is known as **cluster**. 
* Systems are connected using switch.
* Cluster needs to be managed. 
* Spark have it's own **cluster manager**.
* Other cluster managers are Mesos, YARN.
<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/23586e93bfb6846c5bbd0a5c14f2ed4a1e1ec40a/2-Figure1-1.png" width="300px">

## 2. Driver & Executors
<hr>
<img src="https://spark.apache.org/docs/latest/img/cluster-overview.png">
* Spark runs on cluster in deployments.
* Spark follows a master-slave architecture.
* Spark applications run as independent sets of processes on a cluster, coordinated by the program running in **driver node**.
* **SparkContext** object is part of driver program & controls the spark processes.
* For controlling processes across cluster SparkContext needs to use cluster managers (Sparks' Cluster Manager, YARN, Mesos).
* One responsibility of cluster manager is to allocate resources in executors ( nodes in which data processing & storage happens )
* Resources like (cpu,memory) gets allocated at **executor**, application code is sent.
* Task runs on executor & is controlled by driver.
* Driver program listens to executors.

## 3. Data Partitioning
<hr>
* Data is splitted into partitions.
* Each machine has more than one partitions.
* Number of partitions per machine is configurable. Too-much or too-less may not be a great idea.
* Partitions don't span across multiple machines.
* Usual sizes of partitions are 64MB, 128MB or 256MB
* By increasing number of partitioning, we can achieve higher parallelism.
* But, network communication is expensive thing in distributed computing and manging too many tasks gets difficult.
* Having right balance of partition maximizes usage.
* Two ways of partitioning data
  - HashPartition : Tries to evenly distribute data based on hash value.
  - RangePartition : Tuples having keys of same range will appear in same partition.

## 4. Spark Entities
<hr>
* **Application**- Complete user program built on Spark consisting of driver & executor code.
* **Job** - An application when subjected to execution is known as job.
* **Stages** - Job gets divided into stages. Same stage can be parallelized, but different stages are sequential.
* **Task** - A stage in execution is known as task. Number of tasks at a time will be equal to number of partitions.
<img src="https://cdn-images-1.medium.com/max/1000/1*wiXLNwwMyWdyyBuzZnGrWA.png" width="600px">

## 5. Resilient Distributed Datasets (RDDs)
<hr>
* Fundamental & Low-level data-structure around which spark revolves.
* RDDs are immutable in nature.
* RDD for a data is the mapping where the partition lies.

    `rdd = sc.parallelize([1,2,3,4,5],2)`
    
    `rdd.glom().collect()`
    
<img src="https://github.com/awantik/machine-learning-slides/blob/master/rdd.PNG?raw=true">

## 6. Spark DataFrames
<hr>
* Distributed tabular data structure.
* Spark dataframes is scalable unlike pandas dataframe.
* They are immutable.
* RDD has already gone into maintainance phase.
* Spark Engine is responsible for generating optimized RDD from dataframes or spark-sql.

    `df = spark.createDataFrame([(1,2),(2,3),(4,3),(7,8)],['A','B'])`
    
    `display(df)`