# RADICAL-Pilot, Spark, Dask Throughput

Utilize the examples below to familiarize yourself with RADICAL-Pilot.

We will:
* Modify settings (environment variables) if needed
* Modify the example to print out the hostname of the machine that runs the Pilot


**Please make sure that you always close the session before terminating the notebook using `session.close()`**

for local testing set `export SPARK_LOCAL_IP=127.0.0.1`

Start Mongo (on Wrangler):
        
        mongod --dbpath=/gpfs/flash/users/tg804093/mongo

## RADICAL-Pilot Setup

Documentation: http://radicalpilot.readthedocs.org/en/latest/machconf.html#preconfigured-resources

First, we will import the necessary dependencies and define some helper functions.

In [1]:
%matplotlib inline
import os, sys
import commands
import radical.pilot as rp
import random
import pandas as pd
import ast
import seaborn as sns

def print_details(detail_object):
    if type(detail_object)==str:
        detail_object = ast.literal_eval(detail_object)
    for i in detail_object:
        detail_object[i]=str(detail_object[i])
    return pd.DataFrame(detail_object.values(), 
             index=detail_object.keys(), 
             columns=["Value"])

os.environ["RADICAL_PILOT_VERBOSE"]="ERROR"
os.environ["RADICAL_SAGA_PTY_VERBOSE"]="ERROR" 
#os.environ["RADICAL_PILOT_DBURL"]="mongodb://mongo.radical-cybertools.org:24242/sc15-test000"
os.environ["RADICAL_PILOT_DBURL"]="mongodb://localhost:27017/sc15-test000"



## Local Pilot Example

This example shows how to execute a task using a Pilot-Job running on the local machine. In this case, the Pilot-Job is started using **ssh** on the edge node machine of the Hadoop cluster (which runs Jupyterhub - the iPython notebook server).

### Create a new Session and Pilot-Manager. 

In [2]:
%%time
session = rp.Session()
pmgr = rp.PilotManager(session=session)
umgr = rp.UnitManager (session=session,
                       scheduler=rp.SCHED_ROUND_ROBIN)
print "Session id: %s Pilot Manager: %s" % (session.uid, str(pmgr.as_dict()))

Session id: rp.session.c251-116.wrangler.tacc.utexas.edu.tg804093.017252.0002 Pilot Manager: {'uid': 'pmgr.0000'}
CPU times: user 128 ms, sys: 35.2 ms, total: 163 ms
Wall time: 201 ms


In [3]:
print_details(umgr.as_dict())

Unnamed: 0,Value
uid,umgr.0000
scheduler,RoundRobinScheduler
scheduler_details,NO SCHEDULER DETAILS (Not Implemented)


### 2.2 Submit Pilot and add to Unit Manager

In [4]:
pdesc = rp.ComputePilotDescription()
pdesc.resource = "local.localhost_anaconda"  # NOTE: This is a "label", not a hostname
pdesc.runtime  = 120 # minutes
pdesc.cores    = 48
pdesc.cleanup  = False
pilot = pmgr.submit_pilots(pdesc)
umgr.add_pilots(pilot)

In [5]:
print_details(pilot.as_dict())

Unnamed: 0,Value
uid,pilot.0000
stdout,
start_time,
resource_detail,"{'cores_per_node': None, 'nodes': None, 'lm_de..."
submission_time,1490581667.5
logfile,
resource,local.localhost_anaconda
log,[]
sandbox,file://localhost/home/01131/tg804093/radical.p...
state,Launching


### 2.3 Submit Compute Units

Create a description of the compute unit, which specifies the details of the task to be executed.

In [None]:
cudesc_list=[]
cudesc = rp.ComputeUnitDescription()
cudesc.executable  = "/bin/sleep"
cudesc.arguments   = ['0']
cudesc.cores       = 1
cudesc_list.append(cudesc)

Submit the previously created ComputeUnit descriptions to the PilotManager. This will trigger the selected scheduler (in this case the round-robin scheduler) to start assigning ComputeUnits to the ComputePilots.

In [None]:
print "Submit Compute Units to Unit Manager ..."
cu_set = umgr.submit_units(cudesc_list)
print "Waiting for CUs to complete ..."
umgr.wait_units()
print "All CUs completed successfully!"
cu_results = cu_set[0]
details=cu_results.as_dict()

---
The next command will provide the state of the Pilot and other pilot details.

In [None]:
print_details(details)

And some more details...

In [None]:
print_details(details["execution_details"])

Parse the output of the CU

In [None]:
print cu_results.stdout.strip()

### 2.4 Exercise

Write a task (i.e., a ComputeUnit) that prints out the hostname of the machine!

Answer: In the example above, in cudesc.executable replace `/bin/echo` with `hostname`.

### Performance Analysis

In the examples below we will show how RADICAL-Pilot can be used for interactive analytics. We will plot and analyze the execution times of a set of ComputeUnits.

In [6]:
def get_runtime(compute_unit):
    details=compute_unit.as_dict()
    execution_details=details['execution_details']
    state_details=execution_details["statehistory"]
    results = {}
    for i in state_details:
        results[i["state"]]=i["timestamp"]
    #print str(results)
    start = results["New"]
    end = results["Done"]
    runtime = end-start
    return runtime

In [None]:
import time

scenarios = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
#scenarios = [16]    

for n in scenarios:
    cudesc_list = []
    for i in range(n):
        cudesc = rp.ComputeUnitDescription()
        cudesc.executable  = "/bin/date"
        #cudesc.environment = {'CU_NO': i}
        #cudesc.arguments   = ['$(CU_NO)']
        cudesc.cores       = 1
        cudesc_list.append(cudesc)
    
    start_time = time.time()
    cu_set = umgr.submit_units(cudesc_list)
    states = umgr.wait_units()
    end_time= time.time()
    #time.sleep(6)
    print("RP-0.45.1, %d, Runtime, %.4f"%(n, (end_time-start_time)))
    #runtimes=[]
    #for compute_unit in cu_set:
    #    task_runtime = get_runtime(compute_unit)
    #    print "RP-0.45.1, %d, Task_Runtime, %.4f"%(n, task_runtime)

RP-0.45.1, 1, Runtime, 11.5234
RP-0.45.1, 2, Runtime, 6.0129
RP-0.45.1, 4, Runtime, 6.0175
RP-0.45.1, 8, Runtime, 6.0256
RP-0.45.1, 16, Runtime, 6.0570
RP-0.45.1, 32, Runtime, 6.0736
RP-0.45.1, 64, Runtime, 6.1393
RP-0.45.1, 128, Runtime, 5.7766
RP-0.45.1, 256, Runtime, 6.0505
RP-0.45.1, 512, Runtime, 9.2991
RP-0.45.1, 1024, Runtime, 15.0897


In [None]:
for compute_unit in cu_set:
    task_runtime = get_runtime(compute_unit)
    print "RP-0.45.1, %d, Task_Runtime, %.4f"%(n, task_runtime)

`/bin/sleep` assigns a random sleep time. We plot the distribution of runtimes of the above 20 ComputeUnits using [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/). See [distplot documentation](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html?highlight=distplot).

In [None]:
plot=sns.distplot(runtimes, kde=False, axlabel="Runtime")

### Close and Delete Session 

In [None]:
session.close()
del session

# Spark


In [1]:
import os, sys, time

SPARK_HOME="/home/01131/tg804093/work/spark-2.1.0-bin-hadoop2.7" 
os.environ["SPARK_HOME"]=SPARK_HOME
print "Init Spark: " + SPARK_HOME

os.environ["PYSPARK_PYTHON"]="/home/01131/tg804093/anaconda2/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="ipython"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook"
os.environ["PYTHONPATH"]= os.path.join(SPARK_HOME, "python")+":" + os.path.join(SPARK_HOME, "python/lib/py4j-0.10.1-src.zip")
    
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
#sys.path.insert(0, os.path.join(SPARK_HOME, 'python/lib/py4j-0.9-src.zip')) 
sys.path.insert(0, os.path.join(SPARK_HOME, 'python/lib/py4j-0.10.4-src.zip')) 
sys.path.insert(0, os.path.join(SPARK_HOME, 'bin') )

# import Spark Libraries
from pyspark import SparkContext, SparkConf, Accumulator, AccumulatorParam
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.mllib.linalg import Vector

import pandas as pd
from IPython.display import HTML

Init Spark: /home/01131/tg804093/work/spark-2.1.0-bin-hadoop2.7


In [2]:
conf = SparkConf().setAppName("SparkTest").setMaster("spark://c251-116.wrangler.tacc.utexas.edu:7077")
sc = SparkContext(conf=conf)

In [15]:
import subprocess
scenarios = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
#scenarios = [16]    

for n in scenarios:
    rdd = sc.parallelize(range(n))
    start_time = time.time()
    rdd.map(lambda a: subprocess.check_output(["/bin/date"])).saveAsTextFile("/gpfs/flash/users/tg804093/spark-out-%d"%n)
    end_time= time.time()
    print("Spark-2.1.0, %d, Runtime, %.4f"%(n, (end_time-start_time)))

Spark-2.1.0, 1, Runtime, 0.1919
Spark-2.1.0, 2, Runtime, 0.2762
Spark-2.1.0, 4, Runtime, 0.2528
Spark-2.1.0, 8, Runtime, 0.3092
Spark-2.1.0, 16, Runtime, 0.2794
Spark-2.1.0, 32, Runtime, 0.2631
Spark-2.1.0, 64, Runtime, 0.3021
Spark-2.1.0, 128, Runtime, 0.2612
Spark-2.1.0, 256, Runtime, 0.3131
Spark-2.1.0, 512, Runtime, 0.3343
Spark-2.1.0, 1024, Runtime, 0.3351
Spark-2.1.0, 2048, Runtime, 0.5081
Spark-2.1.0, 4096, Runtime, 0.7267
Spark-2.1.0, 8192, Runtime, 1.2800
Spark-2.1.0, 16384, Runtime, 2.2863
Spark-2.1.0, 32768, Runtime, 4.3769
Spark-2.1.0, 65536, Runtime, 9.6486
Spark-2.1.0, 131072, Runtime, 17.3124


In [14]:
import subprocess
subprocess.check_output(["/bin/date"])

'Sun Mar 26 20:12:39 CDT 2017\n'

# Dask

In [7]:
import dask.array as da
from dask import delayed
import dask
from dask import multiprocessing
from dask.multiprocessing import get
import numpy as np
import time 
import subprocess


@delayed
def output_date(n):
    with open("/gpfs/flash/users/tg804093/dask-out-%d.txt"%n, "w") as f:
        f.write(subprocess.check_output(["/bin/date"]))
    
scenarios = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
#scenarios = [16]    

for n in scenarios:
    out = []
    for i in range(n):
        out.append(output_date(i))
    
    start_time = time.time()
    delayed(out).compute()
    end_time= time.time()
    print("Dask-%s, %d, Runtime, %.4f"%(dask.__version__, n, (end_time-start_time)))


Dask-0.14.1, 1, Runtime, 0.0070
Dask-0.14.1, 2, Runtime, 0.0083
Dask-0.14.1, 4, Runtime, 0.0112
Dask-0.14.1, 8, Runtime, 0.0197
Dask-0.14.1, 16, Runtime, 0.0443
Dask-0.14.1, 32, Runtime, 0.0890
Dask-0.14.1, 64, Runtime, 0.1914
Dask-0.14.1, 128, Runtime, 0.3775
Dask-0.14.1, 256, Runtime, 0.7861
Dask-0.14.1, 512, Runtime, 1.5812
Dask-0.14.1, 1024, Runtime, 3.1460
Dask-0.14.1, 2048, Runtime, 6.4133
Dask-0.14.1, 4096, Runtime, 13.3216
Dask-0.14.1, 8192, Runtime, 28.4429
Dask-0.14.1, 16384, Runtime, 61.9768
Dask-0.14.1, 32768, Runtime, 149.0796
Dask-0.14.1, 65536, Runtime, 378.4265
Dask-0.14.1, 131072, Runtime, 1100.8267
