This example shows how to use pytorch-distributed Helm chart to pre-train GPT-NEOX model on Wikicorpus dataset with Neuronx-Distributed library, using distributed data-parallel, tensor-parallel, and ZeRO-1.
The example also shows use of data-process Helm chart to pre-process the Hugging Face Wikicorpus dataset for use with GPT2-NEOX-6.9B model.
Before proceeding, complete the Prerequisites and Getting started. In particular, you must Apply Terraform by specifying the variable neuron_az
so you can automatically launch trn1.32xlarge
instances with AWS Elastic Fabric Adapter (EFA).
See What is in the YAML file to understand the common fields in the Helm values files. There are some fields that are specific to a machine learning chart.
Following variables are implicitly defined by the pytorch-distributed Helm chart for use with Torch distributed run:
PET_NNODES
: Maps tonnodes
PET_NPROC_PER_NODE
: Maps tonproc_per_node
PET_NODE_RANK
: Maps tonode_rank
PET_MASTER_ADDR
: Maps tomaster_addr
PET_MASTER_PORT
: Maps tomaster_port
We define the runtime for pre-processing the dataset in wikicorpus.yaml values file.
To launch the data processing job, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug nxd-gpt-neox-6-9b \
charts/machine-learning/data-prep/data-process \
-f examples/neuronx-distributed/gpt_neox_6.9b/wikicorpus.yaml -n kubeflow-user-example-com
To monitor the logs, execute:
kubectl logs -f data-process-nxd-gpt-neox-6-9b -n kubeflow-user-example-com
Uninstall the Helm chart at completion:
helm uninstall nxd-gpt-neox-6-9b -n kubeflow-user-example-com
We define the runtime for the compile job in compile.yaml values file.
To launch compile job:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug nxd-gpt-neox-6-9b \
charts/machine-learning/training/pytorchjob-distributed \
-f examples/neuronx-distributed/gpt_neox_6.9b/compile.yaml -n kubeflow-user-example-com
Uninstall the Helm chart at completion:
kubectl logs -f pytorchjob-nxd-gpt-neox-6-9b-master-0 -n kubeflow-user-example-com
To uninstall the Helm chart:
helm uninstall nxd-gpt-neox-6-9b -n kubeflow-user-example-com
We define the runtime for pre-training job in pretrain.yaml values file.
To launch the pre-training job:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug nxd-gpt-neox-6-9b \
charts/machine-learning/training/pytorchjob-distributed \
-f examples/neuronx-distributed/gpt_neox_6.9b/pretrain.yaml -n kubeflow-user-example-com
Uninstall the Helm chart at completion:
kubectl logs -f pytorchjob-nxd-gpt-neox-6-9b-master-0 -n kubeflow-user-example-com
To uninstall the Helm chart:
helm uninstall nxd-gpt-neox-6-9b -n kubeflow-user-example-com
To access the output stored on EFS and FSx for Lustre file-systems, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
kubectl apply -f eks-cluster/utils/attach-pvc.yaml -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash
This will put you in a pod attached to the EFS and FSx for Lustre file-systems, mounted at /efs
, and /fsx
, respectively. Type exit
to exit the pod.
Pre-processed data is available in /fsx/home/nxd-gpt-neox-6-9b/examples_datasets/
folder.
Pre-training logs
are available in /efs/home/nxd-gpt-neox-6-9b/logs
folder.
Pre-training checkpoints
, if any, are available in /fsx/home/nxd-gpt-neox-6-9b/checkpoints
folder.
Any content stored under /fsx
is automatically backed up to your configured S3 bucket.