EKS Deep Learning Benchmark Utility
The EKS Deep Learning Benchmark Utility is an automated tool for machine learning benchmarking on Kubernetes clusters.
Features
- Automated end-to-end benchmarking workflow from cluster creation to cluster tear down
- Allows highly configurable Kubernetes cluster configurations
- Supports different backend storage systems including Amazon Elastic File System and Amazon FSx for Lustre
- Uses S3 to read benchmark configs and write back experiment results
- Backed by kubeflow operators and kubebench.
- Supports multiple frameworks including:
- Tensorflow
- Tensorflow + Horovod + OpenMPI
- PyTorch
- MxNet
- Exit handlers to copy immediate results and automatically tear down cluster
- Run multiple experiments in parallel
High Level Design
Prerequisite to run benchmarks
To successfully run benchmarks automatically, you need to:
- Setup NFS
- Install Argo Workflow
- Configure AWS credentials
- Configure your GitHub token
- Setup S3 buckets for your benchmark results and (optional) your training data
- Configure your Kubernetes cluster
Setup NFS
Each benchmark has many steps and needs a file system to sync its status. We setup a NFS to store benchmark configuration, required source files, and benchmark results. All files will be synced to the S3 bucket after the experiment completes.
Note: This is not a real NFS, it's actually a website frontend server emulate as NFS. Please check source for details.
kubectl create -f deploy/benchmark-nfs-svc.yaml
kubectl get svc benchmark-nfs-svc -o=jsonpath={.spec.clusterIP}
# Replace ip in the `deploy/benchmark-nfs-volume.yaml` before following step
kubectl create -f deploy/benchmark-nfs-volume.yaml
Install Argo workflow
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Each benchmark experiment is an argo workflow and we use this to orchestrate and manage our jobs.
kubectl create ns argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.2.1/manifests/install.yaml
# you can forward port to localhost and look at Argo UI
kubectl port-forward deployment/argo-ui 8001:8001 -n argo
Setup AWS Credentials
Replace YOUR_AWS_ACCESS_KEY_ID
and YOUR_AWS_SECRET_ACCESS_KEY
with your own aws credentials.
This account needs to have at least following permissions. It will be used in the experiment to create EKS cluster, setup data storage like EFS or FSx for Lustre, write to S3 buckets.
kubectl apply -f deploy/aws-secret.yaml
Setup Github Token
Replace YOUR_GITHUB_TOKEN
with your github token. Github token is used with ksonnet otherwise the experiment will quickly runs into GitHub API limits.
kubectl apply -f deploy/github-token.yaml
Setup S3 buckets
Firstly, please create a bucket for benchmark results. copy-result
step will sync results to bucket s3ResultBucket
specified in your configuration.
If you like to use real storage for testing, Please create another S3 bucket and upload your training files there. Please set s3DatasetBucket
and storageBackend
in the configuration and workflow will automatically create backend storage like Amazon Elastic File System or Amazon FSx For Lustre and sync files in s3DatasetBucket
to the storage. During training, storage will be mounted as Persistent Volume to worker pods.
Cluster configuration
Kubernetes & Worker Node:
- clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml'
Cluster configuration example
# For details, Please check eksctl documentation or API specs.
# https://github.com/weaveworks/eksctl/blob/master/pkg/apis/eksctl.io/v1alpha4/types.go
apiVersion: eksctl.io/v1alpha4
kind: ClusterConfig
metadata:
name: YOUR_EKS_CLUSTER_NAME
region: us-west-2
version: '1.12'
# If your region has multiple availability zones, you can specify 3 of them.
availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]
# NodeGroup holds all configuration attributes that are specific to a nodegroup
# You can have several node group in your cluster.
nodeGroups:
- name: training
instanceType: p3.16xlarge
desiredCapacity: 1
minSize: 0
maxSize: 2
volumeSize: 30
availabilityZones: ["us-west-2a"]
iam:
withAddonPolicies:
efs: true
fsx: true
# Node Group AMI Id
# ami: xxxxx
Training model:
- storageBackend: 'fsx' | 'efs'
- s3DatasetPath: 's3://eks-dl-benchmark/imagenet/'
- s3ResultPath: ''s3://eks-dl-benchmark/benchmark/'
- experiments:
- experiment: 'experiment-20190424-gpu-16',
- trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml',
- trainingJobPkg: 'mpi-job',
- trainingJobPrototype: 'mpi-job-custom',
- trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow
Training job configuration:
args: --batch_size=256,--model=resnet50,--num_batches=100,--fp16,--display_every=50,--lr_decay_mode=poly,--intra_op_parallelism_threads=2,--inter_op_parallelism_threads=8,--num_parallel_calls=8,--data_dir=data_dir=/kubebench/data/imagenet/train
command: mpirun,-mca,btl_tcp_if_exclude,lo,-mca,pml,ob1,-mca,btl,^openib,--bind-to,none,-map-by,slot,-x,LD_LIBRARY_PATH,-x,PATH,-x,NCCL_DEBUG=INFO,-x,NCCL_MIN_NRINGS=4,-x,HOROVOD_FUSION_THRESHOLD=16777216,-x,HOROVOD_HIERARCHICAL_ALLREDUCE=1,python,models/resnet/tensorflow/train_imagenet_resnet_hvd.py
gpusPerReplica: 1
image: seedjeffwan/eks-dl-benchmark:cuda10-tf1.13.1-hvd0.16.0-py3.5
name: resnset-aws-imagenet
replicas: 1
Run the benchmmark jobs
You have two ways to configure your benchmark jobs.
- Update your workflow setting using
ks
command
ks param set workflows storageBackend fsx
- Update benchmark workflow manifest directly
vim ks-app/components/params.libsonnet
Here's an example of full configurations in ks-app/components/params.libsonnet
:
s3ResultPath: 's3://kubeflow-pipeline-data/benchmark/',
s3DatasetPath: 's3://eks-dl-benchmark/imagenet/',
clusterConfig: 's3://kubeflow-pipeline-data/benchmark/cluster_config.yaml',
experiments: [{
experiment: 'experiment-20190415-01',
trainingJobConfig: 's3://kubeflow-pipeline-data/benchmark/mpi-job-imagenet.yaml',
trainingJobPkg: 'mpi-job',
trainingJobPrototype: 'mpi-job-custom',
// Change to upstream once https://github.com/kubeflow/kubeflow/pull/3062 is merged
trainingJobRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow',
}],
githubSecretName: 'github-token',
githubSecretTokenKeyName: 'GITHUB_TOKEN',
image: 'seedjeffwan/benchmark-runner:20190424',
name: '20190424-00',
namespace: 'default',
nfsVolume: 'benchmark-pv',
nfsVolumeClaim: 'benchmark-pvc',
region: 'us-west-2',
trainingDatasetVolume: 'dataset-claim',
s3SecretName: 'aws-secret',
s3SecretAccesskeyidKeyName: 'AWS_ACCESS_KEY_ID',
s3SecretSecretaccesskeyKeyName: 'AWS_SECRET_ACCESS_KEY',
storageBackend: 'fsx',
kubeflowRegistry: 'github.com/jeffwan/kubeflow/tree/make_kubebench_reporter_optional/kubeflow'
For clusterConfig
and trainingJobConfig
, please check config for example. Be sure to change the name
value for every benchmark.
Once you are done, you can run ks show default -c workflows > workflow.yaml
. If your input is valid, you will see workflow.yaml in your folder.
This is an argo workflow and you can easily submit to your cluster by kubectl apply -f workflow.yaml
.
Benchmark Workflow
Experiment Outputs
Experiment outputs will sync to S3 after experiment done. You can check configuration of your cluster, storage and experiments. The most important thing is training logs and metrics, you can find it under experiments/${experiment_id}/{Launch_pod}
├── eksctl-cluster-config.yaml
├── storage-config.yaml
├── experiments
│ └── mpi-job-imagenet-201904251700-sszd
│ ├── config
│ │ ├── kf-job-manifest.yaml
│ │ └── mpi-job-imagenet.yaml
│ └── output
│ └── mpi-job-imagenet-201904251700-sszd-launcher-6b69v (training logs)
├── ks-app
├── kubeconfig
└── logs
└── start_cluster.log
Optimizations
We have compiled a list of performance optimizations that can improve the results of your deep learning jobs. Apply these optimizations and re-run the benchmark to see if they affect your results.
Sample workload
We have sample scripts to train deep learning models optimized to run well on Amazon Elastic Container Service for Kubernetes that you can run yourself.
Contributing Guidance
See our contributing guidance.
Test Python module locally
export PYTHONPATH=${YOUR_PATH_TO}/kubeflow/testing/py:{YOUR_PATH_TO}/aws-eks-deep-learning-benchmark/src
python -m benchmark.test.install_storage_backend --storage_backend=fsx --experiment_id=001 --s3_import_path=s3://eks-dl-benchmark
Security disclosures
If you think you’ve found a potential security issue, please do not post it in the Issues. Instead, please follow the instructions here or email AWS security directly.
Acknowledgements
Thanks Xinyuan Huang from Cisco AI team for the help and support on kubebench integration. We also want to ackownledge Kubeflow community and we reuse some of logics and utils of Test infrastructure and tooling for Kubeflow.