# Using LLama Factory finetune on SageMaker - HyperPod 集群
# 6. 在HyperPod集群提交训练任务-单节点任务

## 6.1. 单节点单GPU QLora训练 

#### 先决条件：完成01.llama_factory_finetune_on_SageMaker_QLora-Local-Notebook中数据和yml配置准备部分

In [1]:
import boto3
import sagemaker
import os
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::434444145045:role/notebook-hyperpod-ExecutionRole-xHaRX2L05qHQ
sagemaker bucket: sagemaker-us-west-2-434444145045
sagemaker session region: us-west-2
boto3 version: 1.34.123
sagemaker version: 2.222.0


### 准备LLaMA-Factory 的 训练配置yaml文件
-从LLaMA-Factory/examples/train_qlora/目录中复制出llama3_lora_sft_awq.yaml，并修改

In [7]:
#load template
import yaml
file_name = './LLaMA-Factory/examples/train_qlora/llama3_lora_sft_awq.yaml'
with open(file_name) as f:
    doc = yaml.safe_load(f)

In [8]:
#设置模型的保存目录在本notebook实例本地
# 如果是用SageMaker则使用以下模型文件路径
doc['output_dir'] ='/home/ubuntu/finetuned_model'
doc['per_device_train_batch_size'] =1
doc['gradient_accumulation_steps'] =8
# doc['lora_target'] = 'all'
doc['cutoff_len'] = 2048
doc['num_train_epochs'] = 5.0
doc['warmup_steps'] = 10

#实验时间，只选取前200条数据做训练
doc['max_samples'] = 200 
#数据集
doc['dataset'] = 'identity,ruozhiba'

In [9]:
#保存为训练配置文件
sg_config = 'sg_config_qlora.yaml'
with open(f'./LLaMA-Factory/{sg_config}', 'w') as f:
    yaml.safe_dump(doc, f)
doc

{'model_name_or_path': 'TechxGenus/Meta-Llama-3-8B-Instruct-AWQ',
 'stage': 'sft',
 'do_train': True,
 'finetuning_type': 'lora',
 'lora_target': 'all',
 'dataset': 'identity,ruozhiba',
 'template': 'llama3',
 'cutoff_len': 2048,
 'max_samples': 200,
 'overwrite_cache': True,
 'preprocessing_num_workers': 16,
 'output_dir': '/home/ubuntu/finetuned_model',
 'logging_steps': 10,
 'save_steps': 500,
 'plot_loss': True,
 'overwrite_output_dir': True,
 'per_device_train_batch_size': 1,
 'gradient_accumulation_steps': 8,
 'learning_rate': 0.0001,
 'num_train_epochs': 5.0,
 'lr_scheduler_type': 'cosine',
 'warmup_ratio': 0.1,
 'fp16': True,
 'ddp_timeout': 180000000,
 'val_size': 0.1,
 'per_device_eval_batch_size': 1,
 'eval_strategy': 'steps',
 'eval_steps': 500,
 'warmup_steps': 10}

- ❌ 准备训练启动脚本 注意把s3 bucket 替换成自己账号的地址

In [10]:
%%writefile hyperpod-scripts/train_single_lora.sh
#!/bin/bash
source  ../miniconda3/bin/activate

conda activate py310

chmod +x ./s5cmd

#download training dataset
./s5cmd sync s3://sagemaker-us-west-2-434444145045/dataset-for-training/train/* /home/ubuntu/LLaMA-Factory/data/

#start train
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train sg_config_qlora.yaml


./s5cmd sync /home/ubuntu/finetuned_model s3://sagemaker-us-west-2-434444145045/hyperpod/llama3-8b-qlora/

Overwriting hyperpod-scripts/train_single_lora.sh


#### 上传训练脚本到S3 bucket中，之后S3 bucket会挂载到集群所有节点中，这样所有计算节点都可以访问训练代码

In [11]:
!./s5cmd sync ./LLaMA-Factory s3://{bucket}/hyperpod/
!aws s3 cp --recursive hyperpod-scripts/ s3://{bucket}/hyperpod/LLaMA-Factory/

cp LLaMA-Factory/sg_config_qlora.yaml s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/sg_config_qlora.yaml
cp LLaMA-Factory/data/dataset_info.json s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/data/dataset_info.json
upload: hyperpod-scripts/train_batch.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/train_batch.sh
upload: hyperpod-scripts/llama_factory_setup.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/llama_factory_setup.sh
upload: hyperpod-scripts/train_multi_ds.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/train_multi_ds.sh
upload: hyperpod-scripts/train_single_lora.sh to s3://sagemaker-us-west-2-434444145045/hyperpod/LLaMA-Factory/train_single_lora.sh


#### S3 bucket的目录下的代码更像到本地目录
```bash
sudo su ubuntu
cd ~/LLaMA-Factory
srun -N1 "cp" "-r" "../mnt/hyperpod/LLaMA-Factory/data/dataset_info.json" "./data/dataset_info.json"
srun -N1 "cp" "-r" "../mnt/hyperpod/LLaMA-Factory/train_single_lora.sh" "./train_single_lora.sh"
```

#### 提交训练
```bash
srun -N1 "bash" "train_single_lora.sh"
```