#  Spark Architecture and Applications 

![](https://miro.medium.com/max/1400/1*arBqq7O7umskV4O7JjhdrA.jpeg)

# /etc/rc.d/rc.sysinit

In [1]:
import findspark
import pyspark
conf = pyspark.SparkConf().setAppName('Tap').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
sc

# Spark Variables

![](https://preview.redd.it/jqsehmlbl1i01.jpg?width=640&crop=smart&auto=webp&s=9c9a383364946a3b3fe490420eaca4d74ac09249)

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. 

These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. 

Supporting general, read-write shared variables across tasks would be inefficient.

However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

## Broadcast

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. 

They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. 

Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. 

Spark automatically broadcasts the common data needed by tasks within each stage. 

The data broadcasted this way is cached in serialized form and deserialized before running each task. 

This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

In [3]:
broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value

[1, 2, 3]

## Accumulators

![](https://upload.wikimedia.org/wikipedia/en/4/4e/Electro_%28Max_Dillon%29.png)

Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. 

They can be used to implement counters (as in MapReduce) or sums. 

Spark natively supports accumulators of numeric types, and programmers can add support for new types.

For accumulator updates performed inside **actions** only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. 

In **transformations**, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

In [4]:
accum = sc.accumulator(0)
accum


Accumulator<id=0, value=0>

In [5]:
data=sc.parallelize([1, 2, 3, 4])
data.foreach(lambda x: accum.add(x))

In [6]:
accum.value

10

Accumulators do not change the lazy evaluation model of Spark. 

If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. 

Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). 

In [7]:
accum = sc.accumulator(0) # Reset
accum

Accumulator<id=1, value=0>

In [12]:
data.collect()

[1, 2, 3, 4]

In [14]:
def g(x):
    accum.add(x)
    return x
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
accum

Accumulator<id=1, value=14>

In [15]:
data.map(g).foreach(lambda x: accum.add(1))

# Which is the value of accum ?

In [16]:
accum

Accumulator<id=1, value=28>

# Spark Application

[Spark Documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html)

Spark application consists of a **driver** program that runs the user’s main function and executes various parallel operations on a cluster. 



![](https://miro.medium.com/max/700/1*B9lbB8uU7a_Xi0a1uDImRw.jpeg)

## Driver

[Anatomy](https://medium.com/@meenakshisundaramsekar/anatomy-of-a-spark-application-in-a-nutshell-2e542d5f334e)
![](images/anatomyofsparkapp.png)

The life of Spark programs starts and ends with the Spark Driver. 

The Spark driver is the process which the clients used to submit the spark program. 

The Driver is also responsible for application planning and execution of the spark program and returning the status/results to the client.

[Apache Spark Architeture](https://www.dezyre.com/article/apache-spark-architecture-explained-in-detail/338)
![](images/sparkarchiteture.png)

It is the central point and the entry point of the Spark Shell (Scala, Python, and R). 

The driver program runs the main () function of the application and is the place where the Spark Context is created. 

Spark Driver contains various components responsible for the translation of spark user code into actual spark jobs executed on the cluster.

![](https://i.kym-cdn.com/photos/images/original/000/933/846/223.jpg)

## DAG Scheduler

[SparkBasic](https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454)

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. 

It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

```scala
val input = sc.textFile("log.txt")
val splitedLines = input
.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
```

![](https://miro.medium.com/max/700/1*1WfneX6c7Lc9fqAaR9MaGA.png)

## TaskScheduler

Sends tasks to the cluster
running them
retrying if there are failures
and mitigating stragglers

## BackendScheduler 

Spark comes with a pluggable backend mechanism called scheduler backend (aka backend scheduler) to support various cluster managers, e.g. Apache Mesos, Hadoop YARN or Spark’s own Spark Standalone and Spark local.

These cluster managers differ by their custom task scheduling modes and resource offers mechanisms, and Spark’s approach is to abstract the differences in SchedulerBackend Contract.

## BlockManager 

Spark storage system is managed by BlockManager that runs both in Driver and Executor instances.

Is a key-value store of blocks of data (block storage) identified by a block ID.

Among the types of data stored in blocks we can find:
- RDD 
- shuffle: in this category we can distinguish shuffle data, shuffle index and temporary shuffle files (intermediate results)
- broadcast - broadcasted data is organized in blocks too
- task results
- stream data
- temp data (including swap)

# Repetita Iuvant

https://databricks.com/glossary/what-are-spark-applications
![](https://databricks.com/wp-content/uploads/2018/05/Spark-Applications.png)

The driver process:

- runs your main() function

- sits on a node in the cluster

- is responsible for three things: 
   1. maintaining information about the Spark Application;
   2. responding to a user’s program or input; 
   3. and analyzing, distributing, and scheduling work across the executors (defined momentarily). 

The driver process is absolutely essential 

it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application.

## Spark Context 

The Spark context is application's Instance created by the Spark driver for each individual Spark programs when it is first submitted by the user.

Allows Spark Driver to access the cluster through a Cluster Resource Manager and it can be used to create RDDs, accumulators and broadcast variables on the cluster. Spark Context also keeps track of live executors by sending heartbeat messages regularly.

The Spark Context is created by the Spark Driver for each Spark application when it is first submitted by the user. It exists throughout the entire life of a spark application.

Usually referred to as variable name sc in programming.

The Spark Context terminates once the spark application completes. Only one Spark Context can be active per JVM. You must stop() the active Spark Context before creating a new one.

![](https://mallikarjuna_g.gitbooks.io/spark/diagrams/sparkcontext-createtaskscheduler.png)

In [11]:
sc.stop()

## Example in Yarn

 https://luminousmen.com/post/spark-anatomy-of-spark-application

![](https://luminousmen.com/media/spark-yarn-architecture.jpg)

# Deploy

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. 

It can use all of Spark’s supported cluster managers through a uniform interface so you don’t have to configure your application specially for each one.

# Package

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. 

Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.

For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

# Launch

Once a user application is bundled, it can be launched using the bin/spark-submit script. 

This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:
```bash
./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
```

Some of the commonly used options are:

* --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
* --master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
* --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client) †
* --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
* application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
* application-arguments: Arguments passed to the main method of your main class, if any

```bash
# Run application locally on 8 cores
C:\Dev\spark-2.4.5-bin-hadoop2.7\bin>spark-submit.cmd --class org.apache.spark.examples.SparkPi --master local[8] ..\examples\jars\spark-examples_2.11-2.4.5.jar 10000
```

More in https://spark.apache.org/docs/latest/submitting-applications.html

# Tap Spark

- Code in tap/spark/code is copied into docker 

- Dataset is inside spark and linked into tap root (check previuos example), reference as spark/dataset...

- [Hint] Test on machine and then test on Docker (including dependencies)

- https://github.com/apache/spark/tree/master/examples/src/main/python is a good source

# Run Python Example in Docker
./sparkTap.sh simpleapp.py

# Biblio

- https://medium.com/@meenakshisundaramsekar/anatomy-of-a-spark-application-in-a-nutshell-2e542d5f334e
- https://medium.com/luckspark/scala-spark-tutorial-1-hello-world-7e66747faec
- https://luminousmen.com/post/spark-anatomy-of-spark-application
- https://medium.com/@goyalsaurabh66/spark-basics-rdds-stages-tasks-and-dag-8da0f52f0454
- http://cds.iisc.ac.in/wp-content/uploads/DS256.2017.L17.Spark_.Execution.pdf
- https://www.waitingforcode.com/apache-spark/apache-spark-blocks-explained/read

- https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427
- https://www.toptal.com/apache/apache-spark-streaming-twitter
- https://medium.com/codait/real-time-sentiment-analysis-of-twitter-hashtags-with-spark-7ee6ca5c1585
- https://towardsdatascience.com/youtube-data-analysis-using-pyspark-85b7cd07216f
- https://medium.com/@aieeshashafique/exploratory-data-analysis-using-pyspark-dataframe-in-python-bd55c02a2852
- https://github.com/ramyananth/Music-Recommender-System-using-ALS-Algorithm-with-Apache-Spark-and-Python/blob/master/recommender_ALS_Spark_Python.ipynb
- https://github.com/moorissa/audiorecommender