## Test secure HDFS access from EMR via spark in client mode
This is straightforward. The only things to be aware is configure the environment variables `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` correctly. Both the workspace python and executor side python must be the exact same python version. Their parts in this workspace environment are differnet. 
**Remember that the DRIVER program runs in the Workspace. The rest runs in the Executors on EMR**

In [None]:
import os
hdfs_endpoint=os.environ['HDFS_ENDPOINT']
#Setting PYSPARK_PYTHON for EMR is crucial because /opt/conda/bin does not exist on EMR worker nodes
#hdfs_endpoint='hdfs://10.0.123.114:8020'
%env PYSPARK_PYTHON /usr/bin/python3
%env PYSPARK_DRIVER_PYTHON /opt/conda/bin/python

In [None]:
!hdfs dfs -rmr /user/dominospark/mypi/

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, DoubleType
import random

In [None]:
sparkSession = SparkSession.builder.appName("Calculate Pi using EMR Spark") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("fs.default.name", hdfs_endpoint) \
.getOrCreate()
sc=sparkSession.sparkContext

In [None]:
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1


In [None]:


columns = StructType([ StructField("name", StringType(), True),
                      StructField("value", DoubleType(), True)
                    ])

count = sc.parallelize(range(0, 1000),1) \
             .filter(inside).count()
data = [("Pi",4.0 * count/1000)]

df = sparkSession.createDataFrame(data=data, schema=columns)

df.show()

In [None]:
outpath = '/user/dominospark/mypi/'
df.write.csv(outpath)

In [None]:
df_load = sparkSession.read.csv(outpath, schema=columns)
df_load.show()

In [None]:
sc.stop()

In [None]:
!hdfs dfs -ls /user/dominospark/mypi/