Skip to content

biodatageeks/sequila-cloud-recipes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sequila-recipes

example workflow sequila version pysequila version

SeQuiLa recipes, examples and other cloud-related content demonstrating how to run SeQuila jobs in the cloud. For most tasks we use Terraform as a main IaC (Infrastructure as Code) tool.

Table of Contents

Disclaimer

These are NOT production-ready examples. Terraform modules and Docker images are scanned/linted with tools such as checkov, tflint and tfsec but some security tweaks have been disabled for the sake of simplicity. Some cloud deployments best practices has been intentionally skipped as well. Check code comments for details.

Demo scenario

  1. The presented scenario can be deployed on one of the main cloud providers: Azure(Microsoft), AWS(Amazon) and GCP(Google).
  2. For each cloud two options are presented - deployment on managed Hadoop ecosystem (Azure - HDInsight, AWS - EMR, GCP - Dataproc) or or using managed Kubernetes service (Azure - AKS, AWS - EKS and GCP - GKE).
  3. Scenario includes the following steps:
    1. setup distributed object storage
    2. copy test data
    3. setup computing environment
    4. run a test PySeQuiLa job using PySpark using YARN or spark-on-k8s-operator
    5. We assume that:
    • on AWS: an account is created
    • on GCP: a project is created and attached to billing account
    • on Azure: a subscription is created (A Google Cloud project is conceptually similar to the Azure subscription, in terms of billing, quotas, and limits).

Set SeQuiLa and PySeQuiLa versions

Support matrix

Cloud Service Release Spark SeQuiLa PySeQuila Image tag*
GCP GKE 1.23.8-gke.1900 3.2.2 1.1.0 0.4.1 docker.io/biodatageeks/spark-py:pysequila-0.4.1-gke-latest
GCP Dataproc 2.0.27-ubuntu18 3.1.3 1.0.0 0.3.3 -
GCP Dataproc Serverless 1.0.21 3.2.2 1.1.0 0.4.1 gcr.io/${TF_VAR_project_name}/spark-py:pysequila-0.4.1-dataproc-latest
Azure AKS 1.23.12 3.2.2 1.1.0 0.4.1 docker.io/biodatageeks/spark-py:pysequila-0.4.1-aks-latest
Azure HDInsight 5.0.300.1 3.2.2 1.1.0 0.4.1 -
AWS EKS 1.23.9 3.2.2 1.1.0 0.4.1 docker.io/biodatageeks/spark-py:pysequila-0.4.1-eks-latest
AWS EMR Serverless emr-6.7.0 3.2.1 1.1.0 0.4.1 -

Based on the above table set software versions and Docker images accordingly, e.g.: :bulb: These environment variables need to be set prior launching SeQuiLa-cli container.

### All clouds
export TF_VAR_pysequila_version=0.4.1
export TF_VAR_sequila_version=1.1.0
## GCP only
export TF_VAR_pysequila_image_gke=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-gke-latest
export TF_VAR_pysequila_image_dataproc=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-dataproc-latest
## Azure only
export TF_VAR_pysequila_image_aks=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-aks-latest
## AWS only
export TF_VAR_pysequila_image_eks=docker.io/biodatageeks/spark-py:pysequila-${TF_VAR_pysequila_version}-eks-latest

Using SeQuiLa cli Docker image

💡 It is strongly recommended to use biodatageeks/sequila-cloud-cli:latest image to run all the commands. This is image contains all the tools required to set up both infrastructure and run SeQuiLa demo jobs.

Using SeQuiLa cli Docker image for GCP

## change to your project and region/zone
export TF_VAR_project_name=tbd-tbd-devel
export TF_VAR_region=europe-west2
export TF_VAR_zone=europe-west2-b
##
docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
    -e TF_VAR_project_name=${TF_VAR_project_name} \
    -e TF_VAR_region=${TF_VAR_region} \
    -e TF_VAR_zone=${TF_VAR_zone} \
    -e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
    -e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
    -e TF_VAR_pysequila_image_gke=${TF_VAR_pysequila_image_gke} \
biodatageeks/sequila-cloud-cli:latest

💡 The rest of the commands in this demo should be executed in the container.

cd git && git clone https://github.com/biodatageeks/sequila-cloud-recipes.git && \
cd sequila-cloud-recipes && \
cd cloud/gcp
terraform init

Using SeQuiLa cli Docker image for Azure

export TF_VAR_region=westeurope
docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
    -e TF_VAR_region=${TF_VAR_region} \
    -e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
    -e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
    -e TF_VAR_pysequila_image_aks=${TF_VAR_pysequila_image_aks} \
    biodatageeks/sequila-cloud-cli:latest 

💡 The rest of the commands in this demo should be executed in the container.

cd git && git clone https://github.com/biodatageeks/sequila-cloud-recipes.git && \
cd sequila-cloud-recipes && \
cd cloud/azure
terraform init

Using SeQuiLa cli Docker image for AWS

docker pull biodatageeks/sequila-cloud-cli:latest
docker run --rm -it \
    /var/run/docker.sock:/var/run/docker.sock \
    -e TF_VAR_pysequila_version=${TF_VAR_pysequila_version} \
    -e TF_VAR_sequila_version=${TF_VAR_sequila_version} \
    -e TF_VAR_pysequila_image_eks=${TF_VAR_pysequila_image_eks} \
    biodatageeks/sequila-cloud-cli:latest

💡 The rest of the commands in this demo should be executed in the container.

cd git && git clone https://github.com/biodatageeks/sequila-cloud-recipes.git && \
cd sequila-cloud-recipes && \
cd cloud/aws
terraform init

Modules statuses

GCP

Azure

AWS

AWS

Login

There are a few authentication method available. Pick up the one is the most convenient for you - e.g. set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_REGION environment variables.

export AWS_ACCESS_KEY_ID="anaccesskey"
export AWS_SECRET_ACCESS_KEY="asecretkey"
export AWS_REGION="eu-west-1"

💡 Above-mentioned User/Service Account should have account admin privileges to manage EKS/EMR and S3 resources.

EKS

Deploy

  1. Ensure you are in the right subfolder
echo $PWD | rev | cut -f1,2 -d'/' | rev
cloud/aws
  1. Run
terraform apply -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-eks.tfvars

Run

  1. Connect to the K8S cluster, e.g.:
## Fetch configuration
aws eks update-kubeconfig --region eu-west-1 --name sequila
## Verify
kubectl get nodes
NAME                                       STATUS   ROLES    AGE   VERSION
ip-10-0-1-241.eu-west-1.compute.internal   Ready    <none>   36m   v1.23.9-eks-ba74326
  1. Use sparkctl (recommended - available in sequila-cli image) or use kubectl to deploy a SeQuiLa job:
sparkctl create ../../jobs/aws/eks/pysequila.yaml

After a while you will be able to check the logs:

sparkctl log -f pysequila

💡 Or you can use k9s tool (available in the image) to check Spark Driver std output: img.png

Cleanup

sparkctl delete pysequila
terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-eks.tfvars

EMR Serverless

Deploy

Unlike GCP Dataproc Serverless that support providing custom docker images for Spark driver and executors, AWS EMR Serverless requires preparing both: a tarball of a Python virtual environment (using venv-pack or conda-pack) and copying extra jar files to a s3 bucket. Both steps are automated by emr-serverless module. More info can be found here Starting from EMR release 6.7.0 it is possible to specify extra jars using --packages option but requires an additional VPN NAT setup. :bulb: This is why it may take some time (depending on you network bandwidth) to prepare and upload additional dependencies to a s3 bucket - please be patient.

terraform apply -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars

Run

As an output of the above command you will find a rendered command that you can use to launch a sample job (including environment variables export):

Apply complete! Resources: 178 added, 0 changed, 0 destroyed.

Outputs:

emr_server_exec_role_arn = "arn:aws:iam::927478350239:role/sequila-role"
emr_serverless_command = <<EOT

export APPLICATION_ID=00f5c6prgt01190p
export JOB_ROLE_ARN=arn:aws:iam::927478350239:role/sequila-role

aws emr-serverless start-job-run \
  --application-id $APPLICATION_ID \
  --execution-role-arn $JOB_ROLE_ARN \
  --job-driver '{
      "sparkSubmit": {
          "entryPoint": "s3://sequilabhp8knyc/jobs/pysequila/sequila-pileup.py",
          "entryPointArguments": ["pyspark_pysequila-0.4.1.tar.gz"],
          "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.driver.memory=2g --conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.executor.instances=1 --archives=s3://sequilabhp8knyc/venv/pysequila/pyspark_pysequila-0.4.1.tar.gz#environment --jars s3://sequilabhp8knyc/jars/sequila/1.1.0/* --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.files=s3://sequilabhp8knyc/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,s3://sequilabhp8knyc/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai"
      }
  }'

EOT

Cleanup

terraform destroy -var-file=../../env/aws.tfvars -var-file=../../env/_all.tfvars -var-file=../../env/aws-emr.tfvars

Azure

Login

Install Azure CLI and set default subscription

az login
az account set --subscription "Azure subscription 1"

HDInsight

💡 According to the release notes HDInisght 5.0 comes with Apache Spark 3.1.2. Unfortunately it is 3.0.2:

img.png

Since HDInsight is in fact a full-fledged Hadoop cluster we were able to add to the Terraform module support for Apache Spark 3.2.2 using a script action mechanism.

Deploy

export TF_VAR_hdinsight_gateway_password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo '')
export TF_VAR_hdinsight_ssh_password=$(tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo '')
terraform apply -var-file=../../env/azure.tfvars -var-file=../../env/azure-hdinsight.tfvars -var-file=../../env/_all.tfvars

Check Terraform output variables for ssh connection string, credentials and Spark Submit command, e.g.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

hdinsight_gateway_password = "w8aN6oVSJobq7eu4"
hdinsight_ssh_password = "wun6RzBBPWD9z9ke"
pysequila_submit_command = <<EOT
export SPARK_HOME=/opt/spark
spark-submit \
--master yarn \
--packages org.biodatageeks:sequila_2.12:1.1.0 \
--conf spark.pyspark.python=/usr/bin/miniforge/envs/py38/bin/python3 \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=1g \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=3g \
--conf spark.executor.instances=1 \
--conf spark.files=wasb://sequila@sequilai1aayxsd.blob.core.windows.net/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,wasb://sequila@sequilai1aayxsd.blob.core.windows.net/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai \
wasb://sequila@sequilai1aayxsd.blob.core.windows.net/jobs/pysequila/sequila-pileup.py

EOT
ssh_command = "ssh sequila@sequila-6lsnhqtc-ssh.azurehdinsight.net"

Run

  1. Use ssh_command and hdinsight_ssh_password to connect to the head node.
  2. Run pysequila_submit_command command.

img.png img.png

Cleanup

terraform destroy -var-file=../../env/azure.tfvars -var-file=../../env/azure-hdinsight.tfvars -var-file=../../env/_all.tfvars

AKS

Deploy

  1. Ensure you are in the right subfolder
echo $PWD | rev | cut -f1,2 -d'/' | rev
cloud/azure
  1. Run
terraform apply -var-file=../../env/azure.tfvars -var-file=../../env/azure-aks.tfvars -var-file=../../env/_all.tfvars

Run

  1. Connect to the K8S cluster, e.g.:
## Fetch configuration
az aks get-credentials --resource-group sequila-resources --name sequila-aks1
# check connectivity
kubectl get nodes
NAME                              STATUS   ROLES   AGE   VERSION
aks-default-37875945-vmss000002   Ready    agent   59m   v1.20.9
aks-default-37875945-vmss000003   Ready    agent   59m   v1.20.9
  1. Use sparkctl (recommended - available in sequila-cli image) or use kubectl to deploy a SeQuiLa job:
sparkctl create ../../jobs/azure/aks/pysequila.yaml

After a while you will be able to check the logs:

sparkctl log -f pysequila

💡 Or you can use k9s tool (available in the image) to check Spark Driver std output: img.png

Cleanup

sparkctl delete pysequila
terraform destroy -var-file=../../env/azure.tfvars -var-file=../../env/azure-aks.tfvars -var-file=../../env/_all.tfvars

GCP

Login

  1. Install Cloud SDK
  2. Authenticate
gcloud auth application-default login
# set default project
gcloud config set project $TF_VAR_project_name

General GCP setup

  1. Set GCP project-related env variables, e.g.: :bulb: If you use our image all the env variables are already set.
export TF_VAR_project_name=tbd-tbd-devel
export TF_VAR_region=europe-west2
export TF_VAR_zone=europe-west2-b

Above variables are necessary for both Dataproc and GKE setups. 2. Ensure you are in the right subfolder

echo $PWD | rev | cut -f1,2 -d'/' | rev
cloud/gcp

Dataproc

Deploy

terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars

Run

gcloud dataproc workflow-templates instantiate pysequila-workflow --region ${TF_VAR_region}

Waiting on operation [projects/tbd-tbd-devel/regions/europe-west2/operations/36cbc4dc-783c-336c-affd-147d24fa014c].
WorkflowTemplate [pysequila-workflow] RUNNING
Creating cluster: Operation ID [projects/tbd-tbd-devel/regions/europe-west2/operations/ef2869b4-d1eb-49d8-ba56-301c666d385b].
Created cluster: tbd-tbd-devel-cluster-s2ullo6gjaexa.
Job ID tbd-tbd-devel-job-s2ullo6gjaexa RUNNING
Job ID tbd-tbd-devel-job-s2ullo6gjaexa COMPLETED
Deleting cluster: Operation ID [projects/tbd-tbd-devel/regions/europe-west2/operations/0bff879e-1204-4971-ae9e-ccbf9c642847].
WorkflowTemplate [pysequila-workflow] DONE
Deleted cluster: tbd-tbd-devel-cluster-s2ullo6gjaexa.

or from GCP UI Console: img.png

img.png

Cleanup

terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars

Dataproc serverless

Deploy

  1. Prepare infrastructure including a Container registry (see point 2)
terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars
  1. Since accoring to the documentation Dataproc Serverless services cannot fetch containers from other registries than GCP ones (in particular from docker.io). This is why you need to pull a required image from docker.io and push it to your project GCR(Google Container Registry), e.g.:
gcloud auth configure-docker
docker tag  biodatageeks/spark-py:pysequila-0.4.1-dataproc-b3c836e  $TF_VAR_pysequila_image_dataproc
docker push $TF_VAR_pysequila_image_dataproc

Run

gcloud dataproc batches submit pyspark gs://${TF_VAR_project_name}-staging/jobs/pysequila/sequila-pileup.py \
  --batch=pysequila \
  --region=${TF_VAR_region} \
  --container-image=${TF_VAR_pysequila_image_dataproc} \
  --version=1.0.21 \
  --files gs://bigdata-datascience-staging/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta,gs://bigdata-datascience-staging/data/Homo_sapiens_assembly18_chr1_chrM.small.fasta.fai

Batch [pysequila] submitted.
Pulling image gcr.io/bigdata-datascience/spark-py:pysequila-0.3.4-dataproc-b3c836e
Image is up to date for sha256:30b836594e0a768211ab209ad02ad3ad0fb1c40c0578b3503f08c4fadbab7c81
Waiting for container log creation
PYSPARK_PYTHON=/usr/bin/python3.9
JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64
SPARK_EXTRA_CLASSPATH=/opt/spark/.ivy2/jars/*
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/.ivy2/jars/org.slf4j_slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
:: loading settings :: file = /etc/spark/conf/ivysettings.xml
+------+---------+-------+---------+--------+--------+-----------+----+-----+
|contig|pos_start|pos_end|      ref|coverage|countRef|countNonRef|alts|quals|
+------+---------+-------+---------+--------+--------+-----------+----+-----+
|     1|       34|     34|        C|       1|       1|          0|null| null|
|     1|       35|     35|        C|       2|       2|          0|null| null|
|     1|       36|     37|       CT|       3|       3|          0|null| null|
|     1|       38|     40|      AAC|       4|       4|          0|null| null|
|     1|       41|     49|CCTAACCCT|       5|       5|          0|null| null|
+------+---------+-------+---------+--------+--------+-----------+----+-----+
only showing top 5 rows

Batch [pysequila] finished.
metadata:
  '@type': type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata
  batch: projects/bigdata-datascience/locations/europe-west2/batches/pysequila
  batchUuid: c798a09f-c690-4bc8-9dc8-6be5d1e565e0
  createTime: '2022-11-04T08:37:17.627022Z'
  description: Batch
  operationType: BATCH
name: projects/bigdata-datascience/regions/europe-west2/operations/a746a63b-61ed-3cca-816b-9f2a4ccae2f8

img.png

Cleanup

  1. Remove Dataproc serverless batch
 gcloud dataproc batches delete pysequila --region=${TF_VAR_region}
  1. Destroy infrastructure
terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-dataproc.tfvars -var-file=../../env/_all.tfvars

GKE

Deploy

terraform apply -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-gke.tfvars -var-file=../../env/_all.tfvars

Run

  1. Connect to the K8S cluster, e.g.:
## Fetch configuration
gcloud container clusters get-credentials ${TF_VAR_project_name}-cluster --zone ${TF_VAR_zone} --project ${TF_VAR_project_name}
# check connectivity
kubectl get nodes
NAME                                                  STATUS   ROLES    AGE   VERSION
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-8wqh   Ready    <none>   25m   v1.21.5-gke.1302
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-dlr1   Ready    <none>   25m   v1.21.5-gke.1302
gke-tbd-tbd-devel-cl-tbd-tbd-devel-la-cb515767-r5l3   Ready    <none>   25m   v1.21.5-gke.1302
  1. Use sparkctl (recommended - available in sequila-cli image) or use kubectl to deploy a SeQuiLa job:
sparkctl create ../../jobs/gcp/gke/pysequila.yaml

After a while you will be able to check the logs:

sparkctl log -f pysequila

💡 Or you can use k9s tool (available in the image) to check Spark Driver std output: img.png

Cleanup

sparkctl delete pysequila
terraform destroy -var-file=../../env/gcp.tfvars -var-file=../../env/gcp-gke.tfvars -var-file=../../env/_all.tfvars

Development and contribution

Setup pre-commit checks

  1. Activate pre-commit integration
pre-commit install
  1. Install pre-commit hooks deps