Skip to content

Latest commit

 

History

History
105 lines (60 loc) · 4.67 KB

File metadata and controls

105 lines (60 loc) · 4.67 KB

Pre-train GPT Neox 6.9b on Wikicorpus dataset using Neuronx Distributed library

This example shows how to use pytorch-distributed Helm chart to pre-train GPT-NEOX model on Wikicorpus dataset with Neuronx-Distributed library, using distributed data-parallel, tensor-parallel, and ZeRO-1.

The example also shows use of data-process Helm chart to pre-process the Hugging Face Wikicorpus dataset for use with GPT2-NEOX-6.9B model.

Prerequisites

Before proceeding, complete the Prerequisites and Getting started. In particular, you must Apply Terraform by specifying the variable neuron_az so you can automatically launch trn1.32xlarge instances with AWS Elastic Fabric Adapter (EFA).

See What is in the YAML file to understand the common fields in the Helm values files. There are some fields that are specific to a machine learning chart.

Implicitly defined environment variables

Following variables are implicitly defined by the pytorch-distributed Helm chart for use with Torch distributed run:

  1. PET_NNODES : Maps to nnodes
  2. PET_NPROC_PER_NODE : Maps to nproc_per_node
  3. PET_NODE_RANK : Maps to node_rank
  4. PET_MASTER_ADDR: Maps to master_addr
  5. PET_MASTER_PORT: Maps to master_port

Pre-process Wikicorpus dataset

We define the runtime for pre-processing the dataset in wikicorpus.yaml values file.

To launch the data processing job, execute:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug nxd-gpt-neox-6-9b \
    charts/machine-learning/data-prep/data-process \
    -f examples/neuronx-distributed/gpt_neox_6.9b/wikicorpus.yaml -n kubeflow-user-example-com

To monitor the logs, execute:

kubectl logs -f data-process-nxd-gpt-neox-6-9b -n kubeflow-user-example-com

Uninstall the Helm chart at completion:

helm uninstall nxd-gpt-neox-6-9b -n kubeflow-user-example-com

Compile

We define the runtime for the compile job in compile.yaml values file.

To launch compile job:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug nxd-gpt-neox-6-9b \
    charts/machine-learning/training/pytorchjob-distributed \
    -f examples/neuronx-distributed/gpt_neox_6.9b/compile.yaml -n kubeflow-user-example-com

Uninstall the Helm chart at completion:

kubectl logs -f pytorchjob-nxd-gpt-neox-6-9b-master-0  -n kubeflow-user-example-com

To uninstall the Helm chart:

helm uninstall nxd-gpt-neox-6-9b -n kubeflow-user-example-com

Pre-train

We define the runtime for pre-training job in pretrain.yaml values file.

To launch the pre-training job:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
helm install --debug nxd-gpt-neox-6-9b \
    charts/machine-learning/training/pytorchjob-distributed \
    -f examples/neuronx-distributed/gpt_neox_6.9b/pretrain.yaml -n kubeflow-user-example-com

Uninstall the Helm chart at completion:

kubectl logs -f pytorchjob-nxd-gpt-neox-6-9b-master-0  -n kubeflow-user-example-com

To uninstall the Helm chart:

helm uninstall nxd-gpt-neox-6-9b -n kubeflow-user-example-com

Output

To access the output stored on EFS and FSx for Lustre file-systems, execute following commands:

cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
kubectl apply -f eks-cluster/utils/attach-pvc.yaml  -n kubeflow
kubectl exec -it -n kubeflow attach-pvc -- /bin/bash

This will put you in a pod attached to the EFS and FSx for Lustre file-systems, mounted at /efs, and /fsx, respectively. Type exit to exit the pod.

Data

Pre-processed data is available in /fsx/home/nxd-gpt-neox-6-9b/examples_datasets/ folder.

Logs

Pre-training logs are available in /efs/home/nxd-gpt-neox-6-9b/logs folder.

Checkpoints

Pre-training checkpoints, if any, are available in /fsx/home/nxd-gpt-neox-6-9b/checkpoints folder.

S3 Backup

Any content stored under /fsx is automatically backed up to your configured S3 bucket.