# Using LLama Factory finetune on SageMaker - HyperPod 集群
# 5. HyperPod集群环境准备

In [103]:
!pip install -Uq sagemaker boto3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
awscli 1.32.101 requires botocore==1.34.101, but you have botocore 1.34.123 which is incompatible.[0m[31m
[0m

In [104]:
import boto3
import sagemaker
import os

In [105]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

sagemaker role arn: arn:aws:iam::434444145045:role/notebook-hyperpod-ExecutionRole-xHaRX2L05qHQ
sagemaker bucket: sagemaker-us-west-2-434444145045
sagemaker session region: us-west-2
boto3 version: 1.34.122
sagemaker version: 2.222.0


## 1. 创建HyperPod集群
### 1.1 把lifecycle配置文件上传到 S3 存储桶。在创建集群期间，在每个实例组中 HyperPod 运行它们。

- 写一个 Slurm 配置文件并将其另存为provisioning_parameters.json。在文件中，指定基本的 Slurm 配置参数，以便将 Slurm 节点正确分配给集群实例组。 
- 本教程中，设置2个名为
    - my-controller-group、
    - worker-group-1，
- 如以下示例配置所示。provisioning_parameters.json



In [106]:
local_code_dir = 'lifecycle-scripts'
!rm -rf {local_code_dir}
!mkdir -p {local_code_dir}

In [107]:
%%writefile {local_code_dir}/provisioning_parameters.json
{
    "version": "1.0.0",
    "workload_manager": "slurm",
    "controller_group": "my-controller-group",
    "worker_groups": [
        {
            "instance_group_name": "worker-group-1",
            "partition_name": "partition-1"
        }
    ]
}

Writing lifecycle-scripts/provisioning_parameters.json


In [108]:
!rm -rf {local_code_dir}/.ipynb_checkpoints

### 1.2 将awsome-distributed-training/中的sample配置文件和provisioning_params.json 上传到s3目录中

In [109]:
!aws s3 cp --recursive awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://{bucket}/hyperpod/LifecycleScripts/
!aws s3 cp --recursive lifecycle-scripts/ s3://{bucket}/hyperpod/LifecycleScripts/

upload: awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py to s3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScripts/config.py
upload: awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScripts/add_users.sh
upload: awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/apply_hotfix.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScripts/apply_hotfix.sh
upload: awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/hotfix/hold-lustre-client.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScripts/hotfix/hold-lustre-client.sh
upload: awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/initsmhp/fix-profile.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScr

### 1.3 准备一个 create_cluster.json 文件，用于CreateCluster

In [110]:
import json

In [126]:
cluster_name = "hyperpod-cluster-1"
SourceS3Uri = f"s3://{bucket}/hyperpod/LifecycleScripts"
worker_instance = "ml.c5.2xlarge"
worker_count = 2
create_cluster = \
{
    "ClusterName": cluster_name,
    "InstanceGroups": [
        {
            "InstanceGroupName": "my-controller-group",
            "InstanceType": "ml.c5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
              "SourceS3Uri": SourceS3Uri,
              "OnCreate": "on_create.sh"
            },
            "ExecutionRole": role,
            "ThreadsPerCore": 1
        },
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": worker_instance,
            "InstanceCount": worker_count,
            "LifeCycleConfig": {
              "SourceS3Uri": SourceS3Uri,
              "OnCreate": "on_create.sh"
            },
            "ExecutionRole": role,
            "ThreadsPerCore": 1
        }
    ]
}

In [127]:
with open("create_cluster.json","w") as f:
    json.dump(create_cluster,f)
create_cluster

{'ClusterName': 'hyperpod-cluster-1',
 'InstanceGroups': [{'InstanceGroupName': 'my-controller-group',
   'InstanceType': 'ml.c5.xlarge',
   'InstanceCount': 1,
   'LifeCycleConfig': {'SourceS3Uri': 's3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScripts',
    'OnCreate': 'on_create.sh'},
   'ExecutionRole': 'arn:aws:iam::434444145045:role/notebook-hyperpod-ExecutionRole-xHaRX2L05qHQ',
   'ThreadsPerCore': 1},
  {'InstanceGroupName': 'worker-group-1',
   'InstanceType': 'ml.c5.2xlarge',
   'InstanceCount': 2,
   'LifeCycleConfig': {'SourceS3Uri': 's3://sagemaker-us-west-2-434444145045/hyperpod/LifecycleScripts',
    'OnCreate': 'on_create.sh'},
   'ExecutionRole': 'arn:aws:iam::434444145045:role/notebook-hyperpod-ExecutionRole-xHaRX2L05qHQ',
   'ThreadsPerCore': 1}]}

### 1.4 Validate the JSON configuration files before creating a Slurm cluster on HyperPod

In [128]:
# !python3 awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/validate-config.py --cluster-config create_cluster.json --provisioning-parameters {local_code_dir}/provisioning_parameters.json

### 1.5 运行以下命令来创建集群。

In [129]:
!aws sagemaker create-cluster --cli-input-json file://~/SageMaker/llm_finetune/create_cluster.json

{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:434444145045:cluster/lufsrbfh2k78"
}


### 1.6 持续检测集群部署进度

In [130]:
import time
sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker

In [131]:

resp = sm_client.describe_cluster(
    ClusterName=cluster_name
)
status = resp["ClusterStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_cluster(
        ClusterName=cluster_name
    )
    status = resp["ClusterStatus"]
    print("Status: " + status)

print("Arn: " + resp["ClusterStatus"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: InService
Status: InService


### 1.7 列出集群node信息

In [132]:
!aws sagemaker list-clusters
!aws sagemaker list-cluster-nodes --cluster-name {cluster_name} --region us-west-2

{
    "ClusterSummaries": [
        {
            "ClusterArn": "arn:aws:sagemaker:us-west-2:434444145045:cluster/lufsrbfh2k78",
            "ClusterName": "hyperpod-cluster-1",
            "CreationTime": 1718116220.42,
            "ClusterStatus": "InService"
        }
    ]
}
{
    "ClusterNodeSummaries": [
        {
            "InstanceGroupName": "my-controller-group",
            "InstanceId": "i-00a9b6904d395124d",
            "InstanceType": "ml.c5.xlarge",
            "LaunchTime": 1718116227.781,
            "InstanceStatus": {
                "Status": "Running",
                "Message": ""
            }
        },
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceId": "i-076e276ccf232112e",
            "InstanceType": "ml.c5.2xlarge",
            "LaunchTime": 1718116229.03,
            "InstanceStatus": {
                "Status": "Running",
                "Message": ""
            }
        },
        {
            "InstanceGroupName": 

### 1.8 使用awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh 快速登录

1. 打开notebook上的终端terminal
2. 安装ssm plugin
```bash
sudo yum install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm

```

3. 拷贝easy-ssh.sh到当前用户目录
```bash
cp awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh ./
```

4. 登录工作节点
```bash
chmod +x easy-ssh.sh 
cluster_name=hyperpod-cluster-1
group=worker-group-1
./easy-ssh.sh $cluster_name --controller-group $group
```

5. 登录进入之后，切换到ubuntu用户,运行sinfo查看当前集群状态
```bash
sudo su ubuntu
sinfo
```

## 2. 上传训练脚本到S3 bucket中，之后S3 bucket会挂载到集群所有节点中，这样所有计算节点都可以访问训练代码和数据

In [79]:
#

In [102]:
!./s5cmd sync ./LLaMA-Factory s3://{bucket}/hyperpod/
!aws s3 cp --recursive hyperpod-scripts/ s3://{bucket}/hyperpod/LLaMA-Factory/

upload: hyperpod-scripts/.ipynb_checkpoints/llama_factory_setup-checkpoint.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/.ipynb_checkpoints/llama_factory_setup-checkpoint.sh
upload: hyperpod-scripts/train_batch.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/train_batch.sh
upload: hyperpod-scripts/llama_factory_setup.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/llama_factory_setup.sh
upload: hyperpod-scripts/train_multi_ds.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/train_multi_ds.sh
upload: hyperpod-scripts/train_single_lora.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/train_single_lora.sh
upload: hyperpod-scripts/.ipynb_checkpoints/train_single_lora-checkpoint.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/.ipynb_checkpoints/train_single_lora-checkpoint.sh


## 3. 挂载S3
- Hyperpod 集群特别适合大规模集群分布式训练，由于其提供了底层 IaaS 基础设施的接入，因此可以方便的使用业界流行的各种分布式框架，如 accelerate，Deepspeed…etc。
- 与 EC2 实例一样，Hyperpod 集群实例上可以挂载各种共享存储，如 EFS，Lustre，S3 等，此处我们以 mount-s3 为例。
- mount-s3 共享存储安装及挂载脚本示例：
- 仍然ssh到集群中执行
```bash
###下载 s3mount
cd ~

srun -N2 "wget" "https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb"
srun -N2 sudo apt-get install -y  ./mount-s3.deb


# 挂载到"~/mnt" 中， 注意实验中只用了1个计算节点，所以N 1,如果是多节点，则>1
srun -N2 "sudo" "mkdir" "/home/ubuntu/mnt" 

#在所有节点上挂载，注意region，account-id替换成您自己的aws region和 account id
srun -N2 "sudo" "mount-s3" "--allow-other" "--allow-overwrite"   "sagemaker-us-west-2-434444145045" "/home/ubuntu/mnt" 

# unmount s3
# srun -N1 "sudo" "umount" "/home/ubuntu/mnt"
```

## 4. 在集群上安装LLaMA-Factory

1. 仍然保持登录到集群节点中，如果session expired，请参考以上的登录方法
```bash
chmod +x easy-ssh.sh 
cluster_name=hyperpod-cluster-1
group=worker-group-1
./easy-ssh.sh $cluster_name --controller-group $group
```

2. 把S3 bucket的目录下的代码copy到本地目录
```bash
sudo su ubuntu
cd ~
srun -N2 "cp" "-r" "mnt/hyperpod/LLaMA-Factory" "LLaMA-Factory"
```

3. 执行按照脚本
```bash
cd LLaMA-Factory
srun -N2 "rm" "-rf" "../miniconda3"
srun -N2 "rm" "-rf" "Miniconda3-latest*"
srun -N2 "bash" "llama_factory_setup.sh" 
```


In [67]:
# 删除集群

# !aws sagemaker delete-cluster --cluster-name hyperpod-cluster-1