 <img src="uva_seal.png"> 

## Running Spark on a Cluster

### University of Virginia
### DS 5110: Big Data Systems
### Last Updated: Feb 16, 2021

---  

### SOURCES 

1. Learning Spark, Chapter 7: Running on a Cluster

### OBJECTIVES
- Learn how to run distributed Spark
- Learn about some of the common deployment environments


### CONCEPTS AND FUNCTIONS
- Cluster manager (Hadoop YARN, Apache Mesos, Standalone)
- Driver and worker/executor
- Spark application
- Directed acyclic graph (DAG)
- Build tool
- Assembly JAR

---  

### Spark Architecture

One benefit of Spark is the ability to scale computation by adding more machines and running in cluster mode

*Driver* is in charge of coordinating the workers

The *workers* / *executors* receive code and data and do the processing, sending results back to driver.

Driver + Workers = Spark application

### Driver

`main()` method of program runs on driver

Converts program into tasks

Converts into logical *directed acyclic graph* (DAG) of operations

Coordinates scheduling of tasks on executors (like a manager)

### Executors

Run the individual tasks

Launch at start of application and run for lifetime of app

Provide in-memory (RAM) storage for RDDs

### Cluster Manager

External service where the Spark application runs.  

Spark is packaged with the Standalone cluster manager.

Manages the resources between Spark applications.  
Can manage queues if there is more demand than resources for executors.
 
### Launching a Program
`spark-submit` is called to launch a Spark app

**Run in local mode using single core**

`$ bin\spark-submit --master local python_scripts\textAnalysis1.py`

**Run in local mode using 4 cores**

`$ bin\spark-submit --master local[4] python_scripts\textAnalysis1.py`

**Run in local mode using all cores**

`$ bin\spark-submit --master local[*] python_scripts\textAnalysis1.py`

**Run on Spark Standalone cluster at default port**

`$ bin\spark-submit --master spark://host:7077 python_scripts\textAnalysis1.py`

**Run on Spark Standalone cluster at default port, specifying memory to allocate**

`$ bin\spark-submit --master spark://host:7077 –-executor_memory 10g 	python_scripts\textAnalysis1.py`

**Generic Form to run Spark App**

`$ bin\spark-submit [options] <app jar | python file> [app options]`

Can include various flags in the short or long format `-shortflag` and 
`--longflag` respectively  

The flags control scheduling information and dependencies such as libraries and files

For a list of all flags issue:  
`bin\spark-submit --help`


### Spark Web UI

Spark comes with a built-in Web UI.  There are several tabs such as `Jobs` and `Stages` which provide details about the running application.  Useful information such as resources used at each stage of the computation is available here.

When running jobs locally (*local mode*), you should be able to view the UI at this URL:  
http://localhost:4040/jobs/


<img src="spark_app_mgr.png">  

### Packaging Code and Dependencies  

**Python**  
PySpark uses Python on worker machines, so can use `pip`  
Can also submit libraries using the `--py-files` argument to `spark-submit`  

**Java and Scala**   
Users will submit individual JAR files using `--jars`  
For a large set of dependencies, it is better to use a build tool (`sbt` or `Maven`) to package all dependencies into one JAR called the *assembly JAR*.

`Maven` produces a pom.xml file containing a build definition.

A *Project Object Model* or *POM* is the fundamental unit of work in `Maven`. It is an XML file that contains information about the project and configuration details used by `Maven` to build the project. It contains default values for most projects.

https://maven.apache.org/guides/introduction/introduction-to-the-pom.html

Packaging a spark application built with `Maven` is straightforward.    

**Run on Spark Standalone Cluster at Default Port**


---  
```
$ mvn package          # create the assembly JAR

# The assembly JAR will be placed in the target directory
$ bin\spark-submit --master local … target\name_of_assembly.jar
```
---  

### Hadoop YARN

**Y**et **A**nother **R**esource **N**egotiator 

`YARN` is a cluster manager introduced in `Hadoop 2.0`  
It does the following:
- allocates system resources to various applications running in a `Hadoop` cluster.  
- schedules tasks to be executed on different cluster nodes  

`YARN` is installed on same nodes as `HDFS`, making it quicker to access data.  

To use `YARN` in Spark, set an environment variable that points to the `Hadoop` config directory, then submit jobs to a special master URL with `spark-submit`.

---  
```
export HADOOP_CONF_DIR="..."
spark-submit --master yarn appname
```
---  

By default `YARN` uses 2 executors, so you will likely need to change the setting with flag:  
`--num-executors`

<img src="yarn.png">  

`Resource Manager` accepts jobs from users, schedules them and allocates resources  

`Node Manager` monitors the node and provides reporting

`Application Master` is created for each application to negotiate for resources and work with `NodeManager` to execute and monitor tasks.

`Containers` are controlled by the `NodeManager` and assigned system resources.


### Amazon EC2 (elastic cloud compute)

Spark has built-in script to launch clusters on EC2: `spark-ec2`

Will need Amazon Web Services (AWS) account  
Export the *access key ID* and *secret access key*    
By default, launching the cluster produces one master and one worker  
Storage: Spark EC2 clusters include two installations of `HDFS`  
See Learning Spark p 136 for details

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

**AWS**  
If you are interested in learning about the AWS products - which are comprehensive across the cloud computing space - there is an AWS Free Tier.

Please refer here for details:  
https://aws.amazon.com/free/?all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc

From AWS:  
*Explore more than 60 products and start building on AWS using the free tier. Three different types of free offers are available depending on the product used.*

- Always free
- 12 months free
- Trials

There will be a separate course document outlining some optional work in AWS.

**Amazon Elastic MapReduce (EMR)**

`Amazon EMR` provides a managed `Hadoop` framework to process vast amounts of data using AWS for parallel, distributed, elastic execution of data processes and tasks. `EMR` leverages `S3`, which is their elastic, highly reliable cloud storage product (covered later in the course). 
  
Here is a very short overview (1 min) of EMR:  
https://www.youtube.com/watch?v=AM8WZb2Xj2g

As a separate assignment, you will watch a video providing a deeper dive into `Amazon EMR`.