- AWS Parallel Cluster
- SLURM (Installed with Parallel Cluster)
- Composer and LLM Foundry
-
Create key pair
-
VPC creation with web UI (Need NAT gateway)
- Use the cluster name as the project name
- Check "Enable auto-assign public IPv4 address" in the public subnet
-
Create two S3 buckets
- common: for data, checkpoint, and Python source files
- mlflow: for MLFlow logging
-
Modify cluster config YAML with:
- KeyName
- Common S3 bucket name
- Public subnet ID for
HeadNode
- Private subnet ID for
SlurmQueues
- Number of p4d.24xlarge instances
- Capacity reservation ID
-
Launch with
pcluster
package- Wait for it to finish (around 30+ min)
-
Manually upload a copy of
llm-foundry-<commit hash>.zip
tos3://<cluster name>-common/source_files
-
SSH into the head node
pcluster ssh -i </path/to/key> -n <cluster name>
-
Take ownership and change permission of
/fsx/mlflow
sudo chown -R $USER:$USER /fsx/mlflow chmod 777 /fsx/mlflow
-
Clone
nscc_working
to/fsx
and checkout theaws-ec2-3b
branch
cd /fsx
git clone https://github.com/aisingapore/nscc_working.git
cd nscc_working
git checkout aws-ec2-3b
-
Start an interactive session to install dependencies
- It takes some time to spin up a compute node
- It takes some time to build
flash-attn
srun --nodes 1 --ntasks-per-node 1 --cpus-per-task 96 --gres=gpu:8 --time 12:00:00 --pty bash sudo apt-get update && sudo apt-get install python3.8-venv python3 -m venv /fsx/envs/mosaicml source /fsx/envs/mosaicml/bin/activate cd </path/to/nscc_working/engr/mosaicml_workspace> PYTHONPATH=$(pwd) bash scripts/setup.sh
-
Launch the training job
sbatch launch.slurm
-
Wait for job to start running and create the MLFlow Server instance (in a separate local terminal)
cd </path/to/nscc_working/engr/mosaicml_workspace> python scripts/python/create_mlflow_instance.py -n <cluster name> --instance-type <instance type>
Note: Instance type must be available in the cluster's availability zone.
-
Copy content from
fstab_entry
to/etc/fstab
and reboot -
Start the MLFlow server
source /fsx/envs/mosaicml/bin/activate mlflow server -h 0.0.0.0 --backend-store-uri file:///fsx/mlflow/<model size>-multi-node-sharded/mlruns --no-serve-artifacts
Note: this needs to be done on both the head node and the mlflow server instance
-
Obtain a public key
- Either convert the AWS key pair
.pem
fileorssh-keygen -f </path/to/keypair.pem> -y > </path/to/key.pub>
- Generate a new key pair with a third party tool
- Either convert the AWS key pair
-
Copy the content of the public key file and append to
$HOME/.ssh/authorized_keys
-
Modify
launch.slurm
- Change job name to the appropriate model size to log files are name properly
-
Modify
launch.sh
- Change
model_size
to the appropriate model size - Ensure
load_path='null'
andautoresume='false'
- Change
-
Launch with
sbatch launch.slurm
-
Modify
launch.slurm
- Export
MLFLOW_CONCAT_RUN_ID
with content fromMLFLOW_RUN_ID
found in the base directory. Alternatively, get the "Run ID" from MLFlow web UI.
- Export
-
Modify
launch.sh
- Change
autoresume='true'
- Change
-
Modify
MLFLOW_CONCAT_RUN_ID
inlaunch.slurm
-
Modify
launch.sh
- Change
autoresume='false'
- Change
load_path=${S3_BUCKET}/checkpoint/${run_name}/<epoch>-<batch>/<epoch>-<batch>-rank{rank}.pt
- Change
sinfo
: General queue statussqueue
: Running/queued jobs in queuescontrol show nodes
: Display compute node info.State
is useful for telling why job isn't running yet, e.g., powering up, down, not respondingcat /var/log/parallelcluster/clustermgtd
: View cluster management logs
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cr-cpg.html
1200 -> 7200
Left PerUnitStorageThroughput
untouched as 7.2*125=900
is more than the throughput spike observed in the 8 instances trial.
Seeing spikes are >900, will increase to 250 throughput
{
"level": "WARNING",
"type": "PlacementGroupCapacityReservationValidator",
"message": "When using an open or targeted capacity reservation with an unrelated placement group, insufficient capacity errors may occur due to placement constraints outside of the reservation even if the capacity reservation has remaining capacity. Please consider either not using a placement group for the compute resource or creating a new capacity reservation in a related placement group."
}
After failing a job due to write permissions (did not chown
/fsx/mlfow
), encountered a timeout error message.
2023-08-07 05:11:35,692 - [slurm_plugin.clustermgtd:_reset_timeout_expired_compute_resources] - INFO - The following compute resources are in down state due to insufficient capacity: {'queue1': {'p4d24xlarge': ComputeResourceFailureEvent(timestamp=datetime.datetime(2023, 8, 7, 5, 2, 29, 692383, tzinfo=datetime.timezone.utc), error_code='InsufficientInstanceCapacity')}}, compute resources will be reset after insufficient capacity timeout (600.0 seconds) expired
Suggested solutions seem to be to just wait it out https://repost.aws/knowledge-center/ec2-insufficient-capacity-errors