<img src="https://swan.web.cern.ch/sites/swan.web.cern.ch/files/pictures/logo_swan_letters.png" alt="SWAN" style="float: left; width: 25%; margin-right: 15%; margin-left: 15%; margin-bottom: 2.0em;">
<img src="http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png" alt="EP-SFT" style="float: left; width: 25%; margin-right: 10%; margin-bottom: 2.0em;">
<p style="clear: both;">
# **Integration of SWAN with Spark clusters**
<hr style="border-top-width: 4px; border-top-color: #34609b;">

This notebook demonstrates the functionality provided by a SWAN server that allows to offload computations to an external Spark cluster. The code below works with Spark version 2.2.0 or higher. We will connect to the Spark cluster that was previously selected in the SWAN web form.

We will first acquire the necessary credentials to access the Spark cluster.

In [1]:
import getpass
import os, sys

print("Please enter your password")
ret = os.system("echo \"%s\" | kinit" % getpass.getpass())

if ret == 0: print("Credentials created successfully")
else:        sys.stderr.write('Error creating credentials, return code: %s\n' % ret)

Please enter your password
········
Credentials created successfully


Next, some Spark imports.

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

Now we will create the SparkContext, where we will configure the Spark driver to connect to the previously selected cluster. Note that this cell will take some time to run since it triggers the first connection to the cluster.

It is worth pointing out that **this configuration step will be automatized**: a graphical interface will hide most of these configuration options and allow to set other parameters like the number of executors.

In [3]:
conf = SparkConf()
conf.set('spark.driver.host', os.environ['SERVER_HOSTNAME'])
conf.set('spark.driver.port', os.environ['SPARK_PORT_1'])
conf.set('spark.blockManager.port', os.environ['SPARK_PORT_2'])
conf.set('spark.ui.port', os.environ['SPARK_PORT_3'])
conf.set('spark.master', 'yarn')
conf.set('spark.authenticate', True)
conf.set('spark.network.crypto.enabled', True)
conf.set('spark.authenticate.enableSaslEncryption', True)
sc = SparkContext(conf = conf)
spark = SparkSession(sc)

Once Spark is configured, we can try a simple map-reduce example that calculates the sum of squares of a list of numbers.

In [4]:
myNums = sc.parallelize(range(10))
myNums.map(lambda x: x*x).reduce(lambda x,y: x+y)