Examples for Spark-LAMA can be found in examples/spark/
. These examples can be run both locally and remotely on a cluster.
To run examples locally one needs just ensure that data files lay in appropriate locations. These locations typically /opt/spark_data directory. (Data for the examples can be found in examples/data)
To run examples remotely on a cluster under Kubernetes control one needs to have installed and configured kubectl utility.
This step is necessary to make uploading of script file (e.g. executable of Spark LAMA) into a location that is accessible from anywhere on cluster. This file will be used by spark driver which is also submitted to the cluster. Upon configuring set appropriate value for spark.kubernetes.file.upload.path in ./bin/slamactl.sh
or mount it to /mnt/nfs
on the localhost.
- Examples required 2 PVC for their functioning (defined in slamactl.sh, spark-submit arguments):
- spark-lama-data - provides access for driver and executors to data
- mnt-nfs - provide access for driver and executors to the mentioned above upload dir
Define required environment variables to use appropriate kubernetes namespace and remote docker repository accessible from anywhere in the cluster. :
export KUBE_NAMESPACE=spark-lama-exps
export REPO=node2.bdcl:5000
On this step use slamactl.sh utility to build dependencies and docker images: :
./bin/slamactl.sh build-dist
It will:
- compile jars containing Scala-based components (currently only LAMLStringIndexer required for LE-family transformers)
- download Spark distro and use dockerfiles from there to build base pyspark images (and push these images to the remote docker repo)
- compile lama wheel (including spark subpackage) and build a docker image based upon mentioned above pyspark images (this image will be pushed to the remote repository too)
To do that use the following command: :
./bin/slamactl.sh submit-job ./examples/spark/tabular-preset-automl.py
The command submits a driver pod (using spark-submit) to the cluster which creates executor pods.
The utility provides a command to make port forwording for the running example. :
./bin/slamactl.sh port-forward ./examples/spark/tabular-preset-automl.py
The driver's 4040 port will be forwarded to http://localhost:9040.