## Yolov6 Sagemaker Training

### Setup (Optional) / Run the bash script train_local.sh

Download the dataset

For sample training we will be using the Underwater Trash Trash Dataset provided by LearnOpenCV.  
reference: https://learnopencv.com/yolov6-custom-dataset-training/

In [None]:
!mkdir -p local_test/test_dir/model
!mkdir -p local_test/test_dir/output
!mkdir -p local_test/test_dir/input/data/cfg
!mkdir -p local_test/test_dir/input/data/weights

In [None]:
!wget https://www.dropbox.com/s/lbji5ho8b1m3op1/reduced_label_yolov6.zip?dl=1 -O dataset.zip
!unzip dataset.zip reduced_label_yolov6/images/* -d .
!unzip dataset.zip reduced_label_yolov6/labels/* -d .
!mv reduced_label_yolov6/* data/


Download the weights

In [None]:
!wget https://github.com/meituan/YOLOv6/releases/download/0.3.0/yolov6s.pt
!mv yolov6s.pt container/local_test/test_dir/input/data/weights/

Download yolov6s finetune python file

In [None]:
!wget https://raw.githubusercontent.com/meituan/YOLOv6/main/configs/yolov6s_finetune.py
!mv yolov6s_finetune.py container/local_test/test_dir/input/data/cfg/yolov6s_finetune.py

Create/Edit the data yaml

In [3]:
with open("container/local_test/test_dir/input/data/cfg/underwater_trash.yaml", 'w') as fp:
    fp.write(
"""
train: '/opt/ml/input/data/images/train' # train images
val: '/opt/ml/input/data/images/valid' # val images
 
# whether it is coco dataset, only coco dataset should be set to True.
is_coco: False
# Classes
nc: 4  # number of classes
names: [
    'animal',
    'plant',
    'rov',
    'trash'
]  # class names
""")

Create/Edit the train_args

In [None]:
with open("container/local_test/test_dir/input/data/cfg/train-args.json", 'w') as fp:
    fp.write("""
    {
   "DATASET": "/opt/ml/input/data/cfg/underwater_trash.yaml",
   "CFG_PATH": "/opt/ml/input/data/cfg/yolov6s_finetune.py",
   "NAME": "tutorial",
   "IMG_SIZE": "640",
   "EPOCHS": "2",
   "BATCH": "32",
   "DEVICE": 0 
}""")

### Local Host Training

#### Build the container

In [None]:
%cd container

In [None]:
#Before building image setup aws cli , configure it and run below commands
# this will setup basic auth for pulling Pytorch deep learning container from AWS 
# https://github.com/aws/deep-learning-containers/blob/master/available_images.md

!aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

In [6]:
#Run the docker build
!docker build . -t yolov6-sagemaker-training:pt2.0.0-gpu-py310-cu118-ubuntu20.04-ec2

Sending build context to Docker daemon  911.1MB
Step 1/14 : ARG BASE_IMG=763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2
Step 2/14 : ARG BASE_IMG=${BASE_IMG}
Step 3/14 : FROM ${BASE_IMG}
 ---> 06de75fa23a2
Step 4/14 : ENV PATH="/opt/code:${PATH}"
 ---> Using cache
 ---> 73072cea5b35
Step 5/14 : WORKDIR /opt/code
 ---> Using cache
 ---> 36b85ed4c909
Step 6/14 : RUN apt-get update && apt-get upgrade -y --no-install-recommends
 ---> Using cache
 ---> 5b870d7c43b9
Step 7/14 : RUN apt-get install jq -y
 ---> Using cache
 ---> 99dc6a26f0a7
Step 8/14 : RUN ldconfig -v
 ---> Using cache
 ---> 5e800240dae9
Step 9/14 : RUN cd /opt && git clone https://github.com/meituan/YOLOv6.git
 ---> Using cache
 ---> a4d78bcadb85
Step 10/14 : ENV PATH="/opt/YOLOv6:${PATH}"
 ---> Using cache
 ---> 03124f483861
Step 11/14 : RUN pip3 install -r /opt/YOLOv6/requirements.txt  --no-cache-dir
 ---> Using cache
 ---> e595a325c951
Step 12/14 : WORKDIR /opt/YOLOv6
 -

In [3]:
# Run the the docker run
%cd local_test/

!docker run -it --ipc=host --gpus=all -v $(pwd)/test_dir:/opt/ml yolov6-sagemaker-training:pt2.0.0-gpu-py310-cu118-ubuntu20.04-ec2 train

/home/ubuntu/yolov6-sagemaker/container/local_test

Dataset path: /opt/ml/input/data/cfg/underwater_trash.yaml
Model Configuration: /opt/ml/input/data/cfg/yolov6s_finetune.py
Training device: 0
Image Size: 640
Batch size: 32
Number of training epochs: 2
Experiment name: tutorial

Initiating Training...
Using 1 GPU for training... 
training args are: Namespace(data_path='/opt/ml/input/data/cfg/underwater_trash.yaml', conf_file='/opt/ml/input/data/cfg/yolov6s_finetune.py', img_size=640, rect=False, batch_size=32, epochs=2, workers=8, device='0', eval_interval=20, eval_final_only=False, heavy_eval_range=50, check_images=False, check_labels=False, output_dir='./runs/train', name='tutorial', dist_url='env://', gpu_count=0, local_rank=-1, resume=False, write_trainbatch_tb=False, stop_aug_last_n_epoch=15, save_ckpt_on_last_n_epoch=-1, distill=False, distill_feat=False, quant=False, calib=False, teacher_model_path=None, temperature=20, fuse_ab=False, bs_per_gpu=32, specific_shape=False, height

In [4]:
## Show local training results
!ls ./test_dir/model/

best_ckpt.pt


In [6]:
!ls ./test_dir/output/tutorial

args.yaml					  predictions.json
events.out.tfevents.1684777964.331cc08c060d.28.0  weights


### Sagemaker Training

In [8]:
# push test_data into s3 bucket
import boto3
region = boto3.session.Session().region_name
bucket = 'yolo-sagemaker-traning-202305'
#!aws s3api create-bucket --bucket {bucket} --create-bucket-configuration LocationConstraint={region}

In [12]:
!aws s3 cp --recursive local_test/test_dir/ s3://{bucket}

upload: local_test/test_dir/input/data/cfg/train-args.json to s3://yolo-sagemaker-traning-202305/input/data/cfg/train-args.json
upload: local_test/test_dir/input/data/cfg/yolov6s_finetune.py to s3://yolo-sagemaker-traning-202305/input/data/cfg/yolov6s_finetune.py
upload: local_test/test_dir/input/data/cfg/underwater_trash.yaml to s3://yolo-sagemaker-traning-202305/input/data/cfg/underwater_trash.yaml
upload: local_test/test_dir/input/config/resourceconfig.json to s3://yolo-sagemaker-traning-202305/input/config/resourceconfig.json
upload: local_test/test_dir/input/data/images/train/vid_000003_frame0000010.jpg to s3://yolo-sagemaker-traning-202305/input/data/images/train/vid_000003_frame0000010.jpg
upload: local_test/test_dir/input/data/images/train/vid_000003_frame0000011.jpg to s3://yolo-sagemaker-traning-202305/input/data/images/train/vid_000003_frame0000011.jpg
upload: local_test/test_dir/input/data/images/train/vid_000003_frame0000012.jpg to s3://yolo-sagemaker-traning-202305/input/

In [13]:
# ECR URI
account=boto3.client('sts').get_caller_identity().get('Account')
repositoryUri="{}.dkr.ecr.{}.amazonaws.com/yolov6s-sagemaker-training".format(account, region)
repositoryUri

'292776702400.dkr.ecr.eu-west-1.amazonaws.com/yolov6s-sagemaker-training'

In [14]:
import time

In [15]:
# define the paths
cfg='s3://{}/input/data/cfg/'.format(bucket)
images='s3://{}/input/data/images/'.format(bucket)
weights='s3://{}/input/data/weights/'.format(bucket)
labels='s3://{}/input/data/labels/'.format(bucket)
outpath='s3://{}/output/'.format(bucket)

In [16]:
import sagemaker
from sagemaker import get_execution_role
role = "arn:aws:iam::292776702400:role/service-role/AmazonSageMaker-ExecutionRole-20230519T124798"
sm = boto3.client('sagemaker')

In [17]:
job_name = "yolov6s-training" + time.strftime("%Y-%b-%d-%H-%M-%S", time.gmtime())

response = sm.create_training_job(
      TrainingJobName=job_name,
      AlgorithmSpecification={
          'TrainingImage': repositoryUri,
          'TrainingInputMode': 'File',
      },
      RoleArn=role,
      InputDataConfig=[
          {
              'ChannelName': 'cfg',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',
                      'S3Uri': cfg,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          },
          {
              'ChannelName': 'images',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',                      
                      'S3Uri': images,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          },
          {
              'ChannelName': 'labels',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',                      
                      'S3Uri': labels,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          },
          {
              'ChannelName': 'weights',
              'DataSource': {
                  'S3DataSource': {
                      'S3DataType': 'S3Prefix',                      
                      'S3Uri': weights,
                      'S3DataDistributionType': 'FullyReplicated',
                  },
              },
              'InputMode': 'File'
          }
      ],
      OutputDataConfig={
          'S3OutputPath': outpath
      },
      ResourceConfig={
          'InstanceType': 'ml.p3.2xlarge',
          'InstanceCount': 1,
          'VolumeSizeInGB': 20,
      },
      StoppingCondition={
        'MaxRuntimeInSeconds': 60*60*5,
      }
  )
response

{'TrainingJobArn': 'arn:aws:sagemaker:eu-west-1:292776702400:training-job/yolov6s-training2023-May-22-18-21-23',
 'ResponseMetadata': {'RequestId': '0c3cb9bc-8737-4607-8eae-0ef045909942',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0c3cb9bc-8737-4607-8eae-0ef045909942',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '111',
   'date': 'Mon, 22 May 2023 18:21:24 GMT'},
  'RetryAttempts': 0}}

In [18]:
flag = True
while flag:  
    job_status = sm.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    status_message = sm.describe_training_job(TrainingJobName=job_name)['SecondaryStatusTransitions'][-1]['StatusMessage']
    print(f"Training Job status: '{job_status}', Message: '{status_message}'", flush=True)
    flag = False if job_status == 'Completed' else True
    time.sleep(20)
print('Training Job Completed')

Training Job status: 'InProgress', Message: 'Starting the training job'
Training Job status: 'InProgress', Message: 'Starting the training job'
Training Job status: 'InProgress', Message: 'Preparing the instances for training'
Training Job status: 'InProgress', Message: 'Preparing the instances for training'
Training Job status: 'InProgress', Message: 'Preparing the instances for training'
Training Job status: 'InProgress', Message: 'Downloading input data'
Training Job status: 'InProgress', Message: 'Downloading input data'
Training Job status: 'InProgress', Message: 'Downloading input data'
Training Job status: 'InProgress', Message: 'Downloading input data'
Training Job status: 'InProgress', Message: 'Downloading input data'
Training Job status: 'InProgress', Message: 'Downloading the training image'
Training Job status: 'InProgress', Message: 'Downloading the training image'
Training Job status: 'InProgress', Message: 'Downloading the training image'
Training Job status: 'InProgres

In [21]:
model_uri = sm.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts']
model_uri

's3://yolo-sagemaker-traning-202305/output/yolov6s-training2023-May-22-18-21-23/output/model.tar.gz'

In [22]:
!aws s3 cp {model_uri} .
!tar -xvf ./model.tar.gz

download: s3://yolo-sagemaker-traning-202305/output/yolov6s-training2023-May-22-18-21-23/output/model.tar.gz to ./model.tar.gz
best_ckpt.pt
