Skip to content

Latest commit

 

History

History

amazon-eks-machine-learning-with-terraform-and-kubeflow

Distributed TensorFlow training using Kubeflow on Amazon EKS

For detailed walkthrough, please refer to this blog

Prerequisites

  1. Create and activate an AWS Account
  2. Select your AWS Region. For the tutorial below, we assume the region to be us-west-2
  3. Manage your service limits for GPU enabled EC2 instances. We recommend service limits be set to at least 4 instances each for p3.16xlarge, p3dn.24xlarge, p4d.24xlarge, and g5.48xlarge for training, and 2 instances each for g4dn.xlarge, and g5.xlarge for testing.

Build machine

For the build machine, we need AWS CLI and Docker installed. The AWS CLI must be configured for Adminstrator job function. You may use your laptop for your build machine if it has AWS CLI and Docker installed, or you may launch an EC2 instance for your build machine, as described below.

(Optional) Launch EC2 instance for the build machine

To launch an EC2 instance for the build machine, you will need Adminstrator job function access to AWS Management Console. In the console, execute following steps:

  1. Create an Amazon EC2 key pair in your selected AWS region, if you do not already have one

  2. Create an AWS Service role for an EC2 instance, and add AWS managed policy for Administrator access to this IAM Role.

  3. Launch a m5.xlarge instance from Amazon Linux 2 AMI using the IAM Role created in the previous step. Use 100 GB for Root volume size.

  4. After the instance state is Running, connect to your linux instance as ec2-user. On the linux instance, install the required software tools as described below:

     sudo yum install -y docker git
     sudo systemctl enable docker.service
     sudo systemctl start docker.service
     sudo usermod -aG docker ec2-user
     exit
    

Now, reconnect to your linux instance. All steps described under Step by step section below must be executed on the build machine.

Step by step tutorial

While the solution described in this tutorial is general, and can be used to train and test any type of deep learning network (DNN) model, we will make the tutorial concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN, and AWS Mask-RCNN models. The high-level outline of solution is as follows:

  1. Setup build environment on the build machine
  2. Use Terraform to create infrastructure
  3. Stage training data on EFS, or FSx for Lustre shared file-system
  4. Use Helm charts to launch training jobs in the EKS cluster
  5. Use Jupyter notebook to test the trained model

For this tutorial, we assume the region to be us-west-2. You may need to adjust the commands below if you use a different AWS Region.

Setup build environment on build machine

Clone git repository

Clone this git repository on the build machine using the following commands:

cd ~
git clone https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow.git

Install Kubectl

To install kubectl on Linux, execute following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./eks-cluster/install-kubectl-linux.sh

For non-Linux, install and configure kubectl for EKS, install aws-iam-authenticator, and make sure the command aws-iam-authenticator help works.

Install Terraform

Install Terraform. Terraform configuration files in this repository are consistent with Terraform v1.1.4 syntax, but may work with other Terraform versions, as well.

Install Helm

Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm.

Use Terraform to create infrastructre

We recommend the quick start option for first-time walk-through.

Quick start option

This option creates an Amazon EKS cluster, three managed node groups (system, inference, training), Amazon EFS and Amazon FSx for Lustre shared file-systems. The system node group size is fixed, and it runs the pods in the kube-system namespace. The EKS cluster uses Cluster Autoscaler to automatically scale the inference and training node groups up and down as needed. To complete quick-start, execute the commands below:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup
terraform init

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'

Advanced option

The advanced option separates the creation of the EKS cluster from EKS managed node group for training.

Create EKS cluster

To create the EKS cluster, two managed node groups (system, inference), Amazon EFS and Amazon FSx for Lustre shared file-systems, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster
terraform init

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'

Save the output of the this command for creating the EKS managed node group.

Create EKS managed node group

This step creates an EKS managed node group for training. Use the output of previous command for specifying node_role_arn, and subnet_ids below. Specify a unique value for nodegroup_name variable. To create the node group, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-nodegroup
terraform init

terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster"  -var="node_role_arn=" -var="nodegroup_name=" -var="subnet_ids="

Attach to shared file-systems

The infrastructure created above includes an EFS file-system, and a FSx for Lustre file-system.

Start attach-pvc container for access to the EFS shared file-system by executing following steps:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster
kubectl apply -f attach-pvc.yaml  -n kubeflow

Start attach-pvc-fsx container for access to the FSx for Lustre shared file-system by executing following steps:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster
kubectl apply -f attach-pvc-fsx.yaml  -n kubeflow

Build and Upload Docker Image to Amazon EC2 Container Registry (ECR)

Below, we will build and push all the Docker images to Amazon ECR. Replace aws-region below, and execute:

  cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
  ./build-ecr-images.sh aws-region

Stage COCO 2017 Dataset

To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh script to specify your S3 bucket in S3_BUCKET variable, and execute eks-cluster/prepare-s3-bucket.sh

Next, we stage the data on EFS and FSx file-systems, so you have the option to use either one in training.

Stage Data on EFS, and FSx for Lustre

To stage data on EFS, customize S3_BUCKET variable in eks-cluster/stage-data.yaml and execute:

kubectl apply -f stage-data.yaml -n kubeflow

To stage data on FSx for Lustre, customize S3_BUCKET variable in eks-cluster/stage-data-fsx.yaml and execute:

kubectl apply -f stage-data-fsx.yaml -n kubeflow

Execute kubectl get pods -n kubeflow to check the status of the two Pods. Once the status of the two Pods is marked Completed, data is successfully staged.

Install Helm charts for model training

Install mpijob chart

To deploy Kubeflow MPIJob CustomResouceDefintion using mpijob chart, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug mpijob ./mpijob/

Install Mask-RCNN charts

You have two Helm charts available for training Mask-RCNN models. Both these Helm charts use the same Kubernetes namespace, namely kubeflow. Do not install both Helm charts at the same time.

To train TensorPack Mask-RCNN model, customize charts/maskrcnn/valuex.yaml, as described below:

  1. Set shared_fs and data_fs to efs, or fsx, as applicable. Set shared_pvc to the name of the respective persistent-volume-claim, which is tensorpack-efs-gp-bursting for efs, and tensorpack-fsx for fsx.
  2. Use AWS check ip to get the public IP of your web browser client. Use this public IP to set global.source_cidr as a /32 CIDR. This will restrict Internet access to Jupyter notebook and TensorBoard services to your public IP.

To password protect TensorBoard, you must set htpasswd in charts/maskrcnn/charts/jupyter/value.yaml to a quoted MD5 password hash.

To install the maskrcnn chart, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug maskrcnn ./maskrcnn/

To train AWS Mask-RCNN optimized model, customize charts/maskrcnn-optimized/values.yaml, as described below:

  1. Set shared_fs and data_fs to efs, or fsx, as applicable. Set shared_pvc to the name of the respective persistent-volume-claim, which is tensorpack-efs-gp-bursting for efs, and tensorpack-fsx for fsx.
  2. Set global.source_cidr to your public source CIDR.

To password protect TensorBoard, you must set htpasswd in charts/maskrcnn-optimized/charts/jupyter/value.yaml to a quoted MD5 password hash.

To install the maskrcnn-optimized chart, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug maskrcnn-optimized ./maskrcnn-optimized/

Monitor training

Note, this solution uses EKS autoscaling to automatically scale-up (from zero nodes) and scale-down (to zero nodes) the size of the EKS managed nodegroup used for training. So, if currently your training node group has zero nodes, it may take several minutes for the GPU nodes to be Ready and for the training pods to reach Running state. During this time, the maskrcnn-launcher-xxxxx pod may crash and restart automatically several times, and that is nominal behavior. Once the maskrcnn-launcher-xxxxx is in Running state, replace xxxxx with your launcher pod suffix below and execute:

kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow

This will show the live training log from the launcher pod.

Training logs

Model checkpoints and all training logs are also available on the shared_fs file-system set in values.yaml, i.e. efs or fsx.

If you configured your shared_fs file-system to be efs, you can access your training logs by going inside the attach-pvc container as follows:

kubectl exec --tty --stdin -n kubeflow attach-pvc bash
cd /efs
ls -ltr maskrcnn-*

Type exit to exit from the attach-pvc container.

If you configured your shared_fs file-system to be fsx, you can access your training by going inside the attach-pvc-fsx container as follows :

kubectl exec --tty --stdin -n kubeflow attach-pvc-fsx bash
cd /fsx
ls -ltr maskrcnn-*

Type exit to exit from the attach-pvc-fsx container.

Test trained model

Execute kubectl logs -f jupyter-xxxxx -n kubeflow -c jupyter to display Jupyter log. At the beginning of the Jupyter log, note the security token required to access Jupyter service in a browser.

Execute kubectl get services -n kubeflow to get the service DNS address. To test the trained model using a Jupyter notebook, access the service in a browser on port 443 using the service DNS and the security token. Your URL to access the Jupyter service should look similar to the example below:

https://xxxxxxxxxxxxxxxxxxxxxxxxx.elb.xx-xxxx-x.amazonaws.com?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Because the service endpoint in this tutorial uses a self-signed certificate, accessing Jupyter service in a browser will display a browser warning. If you deem it appropriate, proceed to access the service. Open the notebook, and run it it to test the trained model. Note, there may not be any trained model checkpoint available at a given time, while training is in progress.

Visualize TensorBoard summaries

To access TensorBoard via web, use the service DNS address noted above. Your URL to access the TensorBoard service should look similar to the example below:

https://xxxxxxxxxxxxxxxxxxxxxxxxx.elb.xx-xxxx-x.amazonaws.com:6443/

Accessing TensorBoard service in a browser will display a browser warning, because the service endpoint uses a self-signed certificate. If you deem it appropriate, proceed to access the service. When prompted for authentication, use the default username tensorboard, and your password.

Delete Helm charts after training

When training is complete, you may delete an installed chart by executing helm delete chart-name, for example helm delete maskrcnn. This will destroy all pods used in training and testing, including Tensorboard and Jupyter service pods. However, the logs and trained models will be preserved on the shared file system used for training. When you delete all the helm charts, the kubenetes cluster autoscaler may scale down the inference and training node groups to zero size.

Use Terraform to destroy infastructure

When you are done with this tutorial, you can destory all the infrastructure, including the shared EFS, and FSx for Lustre file-systems. If you want to preserve the data on the shared file-systems, you may want to first upload it to Amazon S3.

Quick start option

If you used the quick start option above, execute following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup

terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'

Advanced option

If you used the advanced option above, execute following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-nodegroup

terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster"  -var="node_role_arn=" -var="nodegroup_name=" -var="subnet_ids="

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster

terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'