# Appendix A: Apache Spark @ NERSC

Author: **Julien Peloton** [@JulienPeloton](https://github.com/astrolabsoftware/spark-tutorials/issues/new?body=@JulienPeloton)  
Last Verifed to Run: **2018-10-25**  

This notebook is for [DESC](www.lsst-desc.org) members who want to run Apache Spark at [NERSC](https://www.nersc.gov).

__Learning objectives__

- Apache Spark and HPC machines.
- Batch jobs @ NERSC
- JupyterLab @ NERSC

## Apache Spark and HPC machines

Spark standalone mode and file system-agnostic approach, makes it also a candidate to process data stored in HPC-style shared file systems such as Lustre.
The use of Spark on HPC systems was explored for example in [1, 2] in the context of the HDF5 and FITS data format. In a future tutorials, I might present the similarities and the difference of using Spark on a HTC cluster vs HPC supercomputer.

[1] Liu, Jialin and Racah, Evan and Koziol, Quincey and Canon, Richard Shane, H5spark: bridging the I/O gap between Spark and scientific data formats on HPC systems, Cray user group (2016)  
[2] Peloton, Julien and Arnault, Christian and Plaszczynski, Stéphane, FITS Data Source for Apache Spark, Computing and Software for Big Science (1804.07501).

## Apache Spark batch jobs @ NERSC

NERSC provides support to run Spark at scale. Note that for Spark version 2.3.0+, Spark runs inside of [Shifter](http://www.nersc.gov/research-and-development/user-defined-images/). Complete information is available at [spark-distributed-analytic-framework](www.nersc.gov/users/data-analytics/data-analytics-2/spark-distributed-analytic-framework/). Note that you can use both Scala and Python API. In addition for the Scala API, you will need to load the `sbt` module to compile your libraries.

## JupyterLab & Apache Spark @ NERSC

We provide kernels to work with Apache Spark and DESC. To get a DESC python + Apache Spark kernel, follow these steps:

```bash
# Clone the repo
git clone https://github.com/astrolabsoftware/spark-kernel-nersc.git
cd spark-kernel-nersc

# Where the Spark logs will be stored
# Logs can be then be browsed from the Spark UI
LOGDIR=${SCRATCH}/spark/event_logs
mkdir -p ${LOGDIR}

# Resource to use. Here we will use 4 threads.
RESOURCE=local[4]

# Extra libraries (comma separated if many) to use.
SPARKFITS=com.github.astrolabsoftware:spark-fits_2.11:0.7.1

# Create the kernel - it will be stored under
# HOME/.ipython/kernels/<kernelname>
python makekernel.py \
  -kernelname desc-pyspark --desc \
  -pyspark_args "--master ${RESOURCE} \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=file://${LOGDIR} \
  --conf spark.history.fs.logDirectory=file://${LOGDIR} \
  --packages ${SPARKFITS}"


```

And then select the kernel `desc-pyspark` in the JupyerLab [interface](https://jupyter-dev.nersc.gov/).
More information can be found at [spark-kernel-nersc](https://github.com/astrolabsoftware/spark-kernel-nersc).