## Example which uses on-demand Spark to perform computation and read/write to remote secure HDFS


## First get the HDFS Delegation token

SPARK communicates with HDFS securely using an HDFS Delegation token. This token is requesting from the HDFS service via identifying the client using the Kerberos ticket (Run `klist` to see the ticket details from your cache).

We need to have this ticket accessible by both the driver and the executors. Hence we place it in the DATASET for the project. Ideally we need to have a DATASET per user and only accessible to the user to be contain this ticket for security reasons. 

**In more ideally this should be supported inside SPARK via a side-car container in both the Worskspace and Spark Nodes (Via Domsed or natively) in transient location such as `/tmp/` folder**.

Set the value for this token file in the project environment variable `HADOOP_TOKEN_FILE_LOCATION`


In [1]:
!klist

Ticket cache: FILE:/tmp/krb5cc_12574
Default principal: dominospark@KDCDOMINO.COM

Valid starting       Expires              Service principal
08/18/2021 12:29:44  08/18/2021 22:29:44  krbtgt/KDCDOMINO.COM@KDCDOMINO.COM
	renew until 08/19/2021 12:29:44


In [2]:
import os
#Generate an HDFS Delegation token
#%env HADOOP_TOKEN_FILE_LOCATION=/mnt/data/$ON-DEMAND-SPARK/hdfsdt.token
#Run the following commands
!/mnt/code/scripts/my_hdfs.sh fetchdt --renewer null /mnt/data/$DOMINO_PROJECT_NAME/hdfsdt.token
#%env HADOOP_TOKEN_FILE_LOCATION=/mnt/data/$DOMINO_PROJECT_NAME/hdfsdt.token

%env HADOOP_TOKEN_FILE_LOCATION=/mnt/data/ON-DEMAND-SPARK/hdfsdt.token

hdfs_dt_token_path=os.environ['HADOOP_TOKEN_FILE_LOCATION']



2021-08-18 16:13:49,825 INFO hdfs.DFSClient: Created token for dominospark: HDFS_DELEGATION_TOKEN owner=dominospark@KDCDOMINO.COM, renewer=null, realUser=, issueDate=1629303229812, maxDate=1629908029812, sequenceNumber=125, masterKeyId=6 on 10.0.123.114:8020
env: HADOOP_TOKEN_FILE_LOCATION=/mnt/data/ON-DEMAND-SPARK/hdfsdt.token


Next configure the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON`. This is not necessary if both the workspace and executors use the same path (`/usr/bin/python3`). But if your workspace uses `/opt/conda/bin/python3` your job will fail in the executors. You can set both to  `/opt/conda/bin/python3`/ and see how it fails.

Also fetch the HDFS_ENDPOINT from the Project Environment variable `HDFS_ENDPOINT`. In my example the HDFS Namenode location is running on `hdfs://10.0.123.114:8020`

In [3]:
%env PYSPARK_PYTHON /usr/bin/python3
%env PYSPARK_DRIVER_PYTHON /usr/bin/python3
hdfs_endpoint=os.environ['HDFS_ENDPOINT']

env: PYSPARK_PYTHON=/usr/bin/python3
env: PYSPARK_DRIVER_PYTHON=/usr/bin/python3


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, DoubleType, IntegerType
import random

Now create the Spark Session. Note the value of the config `spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION` .

In [5]:
#On DEMAND SPARK WITH DT


sparkSession = SparkSession.builder.appName("Calculate Pi using OnDemand Spark") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("fs.default.name", hdfs_endpoint) \
.config("spark.driver.extraClassPath", "/opt/hadoop/etc/hadoop:/usr/lib/hadoop-lzo/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/hdfs/*") \
.config("spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION",hdfs_dt_token_path) \
.getOrCreate()
sc=sparkSession.sparkContext
#.config("spark.executor.extraClassPath", "/opt/hadoop/etc/hadoop:/usr/lib/hadoop-lzo/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/hdfs/*") \

In [6]:
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1


In [7]:


columns = StructType([ StructField("name", StringType(), True),
                      StructField("value", DoubleType(), True)
                    ])

count = sc.parallelize(range(0, 1000),1) \
             .filter(inside).count()
data = [("Pi",4.0 * count/1000)]

df = sparkSession.createDataFrame(data=data, schema=columns)

df.show()

+----+-----+
|name|value|
+----+-----+
|  Pi|3.156|
+----+-----+



Now we write to secure (Kerberized) HDFS. This uses the hdfs token file we generated in step 1.

In [8]:
#Let us write to a dataset
ds_path = '/user/dominospark/my-pi'
!/mnt/code/scripts/my_hdfs.sh dfs -rmr '/user/dominospark/my-pi*'
df.write.csv(ds_path)
#Read it back
sparkSession.read.csv(ds_path).show()
!/mnt/code/scripts/my_hdfs.sh dfs -ls /user/dominospark/my-pi

rmr: DEPRECATED: Please use '-rm -r' instead.
Deleted /user/dominospark/my-pi
+---+-----+
|_c0|  _c1|
+---+-----+
| Pi|3.156|
+---+-----+

Found 3 items
-rw-r--r--   3 dominospark dominospark          0 2021-08-18 16:15 /user/dominospark/my-pi/_SUCCESS
-rw-r--r--   3 dominospark dominospark          0 2021-08-18 16:15 /user/dominospark/my-pi/part-00000-ab57bb64-0aed-4b51-aad3-624e442436ca-c000.csv
-rw-r--r--   3 dominospark dominospark          9 2021-08-18 16:15 /user/dominospark/my-pi/part-00003-ab57bb64-0aed-4b51-aad3-624e442436ca-c000.csv


And now we just do more the same. Generate a dataset in HDFS and then retrive it and filter the records based on a simple criteria.

In [9]:
!/mnt/code/scripts/my_hdfs.sh dfs -rmr '/user/dominospark/small-data-100/'
hdfs_src_path = '/user/dominospark/largedata/'
hdfs_dest_path =  '/user/dominospark/small-data-100/'
local_dest_path = 'file:///mnt/data/ON-DEMAND-SPARK/small-data-100'
!rm -rf /mnt/data/ON-DEMAND-SPARK/small-data-100
filter_criteria = 100
'''
sparkSession = SparkSession.builder.appName("Generate Data") \
    .config("spark.dynamicAllocation.enabled", "false") \
    .config("fs.default.name", hdfs_endpoint) \
    .getOrCreate()
'''
    
columns = StructType([ StructField("id", IntegerType(), True), \
                       StructField("v1", IntegerType(), True),\
                       StructField("v2", IntegerType(), True),\
                       StructField("v3", IntegerType(), True) ])

df_load = sparkSession.read.csv(hdfs_src_path,columns)
df_load_filtered = df_load.where(df_load.id < filter_criteria)
df_load_filtered.write.csv(hdfs_dest_path)
df_load_filtered.write.csv(local_dest_path)


rmr: DEPRECATED: Please use '-rm -r' instead.
Deleted /user/dominospark/small-data-100


In [10]:
df_load_read = sparkSession.read.csv(hdfs_dest_path,columns)
df_load_read.show()

+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
|  0| 63| 94| 50|
|  1| 25| 26| 73|
|  2| 84| 84| 84|
|  3| 47| 19|  8|
|  4| 24| 31|  6|
|  5| 75| 17| 11|
|  6| 49| 38| 57|
|  7| 56| 31| 90|
|  8|100|  3|  3|
|  9| 43| 72| 34|
| 10| 18| 64| 57|
| 11| 63| 75| 80|
| 12| 82| 85| 28|
| 13| 31|  8| 42|
| 14| 20| 80|  3|
| 15| 27| 91| 86|
| 16| 55| 70| 42|
| 17| 69|  3|  5|
| 18| 65| 28| 28|
| 19| 57|  8| 69|
+---+---+---+---+
only showing top 20 rows



In [11]:
df_load_read = sparkSession.read.csv(local_dest_path,columns)
df_load_read.show()

+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
|  0| 63| 94| 50|
|  1| 25| 26| 73|
|  2| 84| 84| 84|
|  3| 47| 19|  8|
|  4| 24| 31|  6|
|  5| 75| 17| 11|
|  6| 49| 38| 57|
|  7| 56| 31| 90|
|  8|100|  3|  3|
|  9| 43| 72| 34|
| 10| 18| 64| 57|
| 11| 63| 75| 80|
| 12| 82| 85| 28|
| 13| 31|  8| 42|
| 14| 20| 80|  3|
| 15| 27| 91| 86|
| 16| 55| 70| 42|
| 17| 69|  3|  5|
| 18| 65| 28| 28|
| 19| 57|  8| 69|
+---+---+---+---+
only showing top 20 rows



In [13]:
sparkSession.stop()

In [13]:
!/mnt/code/scripts/my_hdfs.sh dfs -ls

Found 12 items
drwxr-xr-x   - dominospark dominospark          0 2021-08-18 12:59 .sparkStaging
drwxr-xr-x   - dominospark dominospark          0 2021-08-18 12:59 example
drwxr-xr-x   - dominospark dominospark          0 2021-08-13 21:45 large-data
drwxr-xr-x   - dominospark dominospark          0 2021-08-17 15:47 large-data-livy
drwxr-xr-x   - dominospark dominospark          0 2021-08-14 18:55 largedata
drwxr-xr-x   - dominospark dominospark          0 2021-08-13 21:57 ld-10
drwxr-xr-x   - dominospark dominospark          0 2021-08-16 21:13 livy-large-data
drwxr-xr-x   - dominospark dominospark          0 2021-08-18 13:05 my-pi
drwxr-xr-x   - dominospark dominospark          0 2021-08-18 12:56 mypi
drwxr-xr-x   - dominospark dominospark          0 2021-08-13 21:58 sd-5
drwxr-xr-x   - dominospark dominospark          0 2021-08-18 13:06 small-data-100
drwxr-xr-x   - dominospark dominospark          0 2021-08-14 23:45 smalldata-1000
