## SageMaker Training for DDA

### Pre-requisites

1. Note: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using **Amazon SageMaker**.
1. To use this algorithm successfully, ensure that:
   
   A. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used:
   
        a. aws-marketplace:ViewSubscriptions
        b. aws-marketplace:Unsubscribe
        c. aws-marketplace:Subscribe
   
   B: or your AWS account has a subscription to:[Computer Vision Defect Detection Model](https://aws.amazon.com/marketplace/pp/prodview-j72hhmlt6avp6).

### Subscribe to the algorithm

To subscribe to the algorithm:

1. Open the algorithm listing page: [Computer Vision Defect Detection Model](https://aws.amazon.com/marketplace/pp/prodview-j72hhmlt6avp6).
1. On the AWS Marketplace listing, click on Continue to subscribe button.
1. On the Subscribe to this software page, review and click on "Accept Offer" if you agree with EULA, pricing, and support terms.
1. Once you click on Continue to configuration button and then choose a region, you will see a Product Arn. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the algorithm name and specify the same in the following cell.

In [1]:
# TODO: change this to use subscribed SageMaker algorithm
algorithm_name = "arn:aws:sagemaker:us-east-1:865070037744:algorithm/lfv-public-algorithm-2025-02-0-422b42886fa13174a28ac6ffbf8fc874"

### Set Up

In [2]:
import boto3
import sagemaker
import json

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()
# Project name would be used as part of s3 output path
project = "LFV-public-test"

In [27]:

# Your existing variables
bucket = session.default_bucket()
project = "LFV-public-test"

# Create S3 client
s3_client = boto3.client('s3')

# Create main project folder
folder_key = f"{project}/"
s3_client.put_object(Bucket=bucket, Key=folder_key)
print(f"Created folder: s3://{bucket}/{folder_key}")

# Create output folder
output_folder_key = f"{project}/output/"
s3_client.put_object(Bucket=bucket, Key=output_folder_key)
print(f"Created folder: s3://{bucket}/{output_folder_key}")

# Create compilation_output folder
compilation_output_folder_key = f"{project}/compilation_output/"
s3_client.put_object(Bucket=bucket, Key=compilation_output_folder_key)
print(f"Created folder: s3://{bucket}/{compilation_output_folder_key}")

# Define the S3 paths for later use
output_path = f's3://{bucket}/{project}/output'
compilation_output_path = f's3://{bucket}/{project}/compilation_output'

print(f"Output path: {output_path}")
print(f"Compilation output path: {compilation_output_path}")



Created folder: s3://sagemaker-us-east-1-164152369890/LFV-public-test/
Created folder: s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/
Created folder: s3://sagemaker-us-east-1-164152369890/LFV-public-test/compilation_output/
Output path: s3://sagemaker-us-east-1-164152369890/LFV-public-test/output
Compilation output path: s3://sagemaker-us-east-1-164152369890/LFV-public-test/compilation_output


## Getting cookie dataset from github


In [28]:
!git clone https://github.com/aws-samples/amazon-lookout-for-vision.git
!rm -rf cookie-dataset  # Remove existing
!cp -r amazon-lookout-for-vision/computer-vision-defect-detection/cookie-dataset ./
!rm -rf amazon-lookout-for-vision
!ls -la cookie-dataset/  # Verify dataset contents


Cloning into 'amazon-lookout-for-vision'...
remote: Enumerating objects: 520, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 520 (delta 28), reused 27 (delta 19), pack-reused 468 (from 2)[K
Receiving objects: 100% (520/520), 534.39 MiB | 88.72 MiB/s, done.
Resolving deltas: 100% (75/75), done.
total 28
drwxrwxr-x  4 ec2-user ec2-user 4096 Sep 30 02:09 .
drwxr-xr-x 13 ec2-user ec2-user 4096 Sep 30 02:09 ..
drwxrwxr-x  5 ec2-user ec2-user 4096 Sep 30 02:09 dataset-files
-rw-rw-r--  1 ec2-user ec2-user  561 Sep 30 02:09 dummy_anomaly_mask.png
-rw-rw-r--  1 ec2-user ec2-user 5399 Sep 30 02:09 getting_started.py
drwxrwxr-x  2 ec2-user ec2-user 4096 Sep 30 02:09 test-images


## Uploading manifest files to S3

In [29]:
import os
import shutil
import json
#import boto3

# Step 1: AWS Profile setup (your existing code)
aws_dir = os.path.expanduser('~/.aws')
os.makedirs(aws_dir, exist_ok=True)

config_content = """[default]
region = us-east-1

[profile lookoutvision-access]
region = us-east-1
"""

credentials_content = """[default]

[lookoutvision-access]
"""

with open(f'{aws_dir}/config', 'w') as f:
    f.write(config_content)
with open(f'{aws_dir}/credentials', 'w') as f:
    f.write(credentials_content)

print("‚úÖ Created AWS profile configuration")

# Step 2: Copy dataset-files from cookie-dataset to current directory
if os.path.exists('dataset-files'):
    shutil.rmtree('dataset-files')
shutil.copytree('cookie-dataset/dataset-files', 'dataset-files')
print("‚úÖ Copied dataset-files to current directory")

# Step 3: Copy dummy_anomaly_mask.png to mask-images as requested
if os.path.exists('cookie-dataset/dummy_anomaly_mask.png'):
    shutil.copy2('cookie-dataset/dummy_anomaly_mask.png', 'dataset-files/mask-images/dummy_anomaly_mask.png')
    print("‚úÖ Copied dummy_anomaly_mask.png to mask-images")

# Step 4: Verify the structure is ready
print("‚úÖ Dataset structure verification:")
print(f"Training images: {len(os.listdir('dataset-files/training-images'))} files")
print(f"Mask images: {len(os.listdir('dataset-files/mask-images'))} files")
print(f"Template manifest exists: {os.path.exists('dataset-files/manifests/template.manifest')}")

# Step 5: Execute the script
bucket = session.default_bucket()
project = "LFV-public-test"
s3_uri = f"s3://{bucket}/{project}/"

print(f"Executing getting_started.py with S3 URI: {s3_uri}")
!python cookie-dataset/getting_started.py {s3_uri}

# Step 6: Save training manifest S3 URI
training_manifest_s3_uri = f"{s3_uri}manifests/train.manifest"
print(f"Training manifest S3 URI: {training_manifest_s3_uri}")

# Step 7: Download and update segmentation manifest from GitHub
print("Downloading segmentation manifest from GitHub...")
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-lookout-for-vision/d4002d64b1ba395d332b994a0c268342ac62b1ed/computer-vision-defect-detection/train_segmentation.manifest

# Step 8: Update segmentation manifest with your S3 bucket paths
segmentation_lines = []

with open('train_segmentation.manifest', 'r') as f:
    for line in f:
        data = json.loads(line.strip())
        
        # Update source-ref to your S3 bucket
        if 'source-ref' in data:
            source_ref = data['source-ref']
            if source_ref.startswith('s3://lookoutvision-us-east-1-0e205be246/getting-started/'):
                # Extract the relative path and update to your bucket
                relative_path = source_ref.replace('s3://lookoutvision-us-east-1-0e205be246/getting-started/', '')
                data['source-ref'] = s3_uri + relative_path
        
        # Update anomaly-mask-ref to your S3 bucket
        if 'anomaly-mask-ref' in data:
            mask_ref = data['anomaly-mask-ref']
            if mask_ref.startswith('s3://lookoutvision-us-east-1-0e205be246/getting-started/'):
                # Extract the relative path and update to your bucket
                relative_path = mask_ref.replace('s3://lookoutvision-us-east-1-0e205be246/getting-started/', '')
                data['anomaly-mask-ref'] = s3_uri + relative_path
        
        segmentation_lines.append(json.dumps(data))

# Step 9: Save updated segmentation manifest locally
segmentation_manifest_path = 'dataset-files/manifests/train_segmentation.manifest'
with open(segmentation_manifest_path, 'w') as f:
    for line in segmentation_lines:
        f.write(line + '\n')

print(f"‚úÖ Created segmentation manifest with {len(segmentation_lines)} entries")

# Step 10: Upload segmentation manifest to S3
s3_client = boto3.client('s3')
bucket_name = bucket
s3_key = f"{project}/manifests/train_segmentation.manifest"

s3_client.upload_file(segmentation_manifest_path, bucket_name, s3_key)

# Create segmentation manifest S3 URI variable
segmentation_manifest_s3_uri = f"s3://{bucket_name}/{s3_key}"
print(f"Segmentation manifest S3 URI: {segmentation_manifest_s3_uri}")

# Clean up downloaded file
os.remove('train_segmentation.manifest')

print("\nüéâ Complete Setup Summary:")
print(f"üìÅ Training images: {s3_uri}training-images/")
print(f"üé≠ Mask images: {s3_uri}mask-images/")
print(f"üìã Training manifest: {training_manifest_s3_uri}")
print(f"üîç Segmentation manifest: {segmentation_manifest_s3_uri}")
print(f"üìä Total files in S3: 98 files")


‚úÖ Created AWS profile configuration
‚úÖ Copied dataset-files to current directory
‚úÖ Copied dummy_anomaly_mask.png to mask-images
‚úÖ Dataset structure verification:
Training images: 63 files
Mask images: 33 files
Template manifest exists: True
Executing getting_started.py with S3 URI: s3://sagemaker-us-east-1-164152369890/LFV-public-test/
Copying getting started files to s3://sagemaker-us-east-1-164152369890/LFV-public-test/
INFO: Creating manifest file from /home/ec2-user/SageMaker/dataset-files.
INFO: Destination: s3://sagemaker-us-east-1-164152369890/LFV-public-test/
INFO: Writing json line: {"source-ref": "s3://sagemaker-us-east-1-164152369890/LFV-public-test/training-images/anomaly-1.jpg", "anomaly-label-metadata": {"job-name": "anomaly-label", "class-name": "anomaly", "human-annotated": "yes", "creation-date": "2022-08-22T20:52:51.851Z", "type": "groundtruth/image-classification"}, "anomaly-label": 1, "anomaly-mask-ref-metadata": {"internal-color-map": {"0": {"class-name": "c

### Use existing IAM role

In [32]:
import sagemaker

# Get the current execution role
sm_role_arn = sagemaker.get_execution_role()
print(f"Current SageMaker execution role ARN: {sm_role_arn}")

# Now you can use sm_role_arn in your SageMaker operations



Current SageMaker execution role ARN: arn:aws:iam::164152369890:role/SageMakerExecutionRole


## Classification Model
Start training job for classification model

In [37]:
import datetime
sagemaker = boto3.Session(region_name=region).client("sagemaker")
classification_training_job_name = 'LFV-classification'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

To use robust model feature for classification model:
```
HyperParameters={
    'ModelType': 'classification-robust',
    'TestInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label',
    'TrainingInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label'
},
```

In [40]:
response = sagemaker.create_training_job(
    TrainingJobName=classification_training_job_name,
    HyperParameters={
        'ModelType': 'classification',
        'TestInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label',
        'TrainingInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label'
    },
    AlgorithmSpecification={
        'AlgorithmName': algorithm_name,
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': False
    },
    RoleArn=sm_role_arn,
    InputDataConfig=[
        {
            'ChannelName': 'training',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'AugmentedManifestFile',
                    'S3Uri': training_manifest_s3_uri,
                    'S3DataDistributionType': 'ShardedByS3Key',
                    'AttributeNames': [
                        'source-ref',
                        'anomaly-label-metadata',
                        'anomaly-label'
                    ],
                }
            },
            'CompressionType': 'None',
            'RecordWrapperType': 'RecordIO',
            'InputMode': 'Pipe'
        },
    ],
    OutputDataConfig={'S3OutputPath': output_path},
    ResourceConfig={
        'InstanceType': 'ml.g4dn.4xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 20
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 7200
    },
EnableNetworkIsolation=True
)

In [41]:
import time
while True:
    training_response = sagemaker.describe_training_job(
        TrainingJobName=classification_training_job_name
    )
    if training_response['TrainingJobStatus'] == 'InProgress':
        print(".", end='')
    elif training_response['TrainingJobStatus'] == 'Completed':
        print("Completed")
        break
    elif training_response['TrainingJobStatus'] == 'Failed':
        print("Failed")
        break
    else:
        print("?", end='')
    time.sleep(60)

.........Completed


******************

## Segmentation Model

Start traning job for segmentation model

In [42]:
segmentation_training_job_name = 'LFV-segmentation-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [43]:
sagemaker = boto3.Session(region_name=region).client("sagemaker")
response = sagemaker.create_training_job(
    TrainingJobName=segmentation_training_job_name,
    HyperParameters={
        # To use robust model feature, change "ModelType" to "segmentation-robust"
        'ModelType': 'segmentation',
        'TestInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label,anomaly-mask-ref-metadata,anomaly-mask-ref',
        'TrainingInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label,anomaly-mask-ref-metadata,anomaly-mask-ref'
    },
    AlgorithmSpecification={
        'AlgorithmName': algorithm_name,
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': False
    },
    RoleArn=sm_role_arn,
    InputDataConfig=[
        {
            'ChannelName': 'training',
            'DataSource': {
                'S3DataSource': {
                    'S3DataType': 'AugmentedManifestFile',
                    'S3Uri': segmentation_manifest_s3_uri,
                    'S3DataDistributionType': 'ShardedByS3Key',
                    'AttributeNames': [
                        'source-ref',
                        'anomaly-label-metadata',
                        'anomaly-label',
                        'anomaly-mask-ref-metadata',
                        'anomaly-mask-ref'
                    ],
                }
            },
            'CompressionType': 'None',
            'RecordWrapperType': 'RecordIO',
            'InputMode': 'Pipe'
        },
    ],
    OutputDataConfig={'S3OutputPath':output_path},
    ResourceConfig={
        'InstanceType': 'ml.g4dn.4xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 20
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 7200
    },
    EnableNetworkIsolation=True
)
print(response)

{'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:164152369890:training-job/LFV-segmentation-2025-09-30-02-27-58', 'ResponseMetadata': {'RequestId': '2757c7d0-7739-4ca2-b04c-759dc5f2b7fe', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2757c7d0-7739-4ca2-b04c-759dc5f2b7fe', 'content-type': 'application/x-amz-json-1.1', 'content-length': '111', 'date': 'Tue, 30 Sep 2025 02:28:09 GMT'}, 'RetryAttempts': 0}}


In [44]:
import time
while True:
    training_response = sagemaker.describe_training_job(
        TrainingJobName=segmentation_training_job_name
    )
    if training_response['TrainingJobStatus'] == 'InProgress':
        print(".", end='')
    elif training_response['TrainingJobStatus'] == 'Completed':
        print("Completed")
        break
    elif training_response['TrainingJobStatus'] == 'Failed':
        print("Failed")
        break
    else:
        print("?", end='')
    time.sleep(60)

.............Completed


To use Segmentation head only, use hyper parameters like following:
```
HyperParameters={
    'ModelType': 'segmentation',
    'TestInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label,anomaly-mask-ref-metadata,anomaly-mask-ref',
    'TrainingInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label,anomaly-mask-ref-metadata,anomaly-mask-ref',
    'classification_logic': 'seg_head'
},
```

To enable robust model feature for segmentation model:
```
HyperParameters={
    'ModelType': 'segmentation-robust',
    'TestInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label,anomaly-mask-ref-metadata,anomaly-mask-ref',
    'TrainingInputDataAttributeNames': 'source-ref,anomaly-label-metadata,anomaly-label,anomaly-mask-ref-metadata,anomaly-mask-ref'
},
```

***********

## Compilation job - Classification

After training job is completed, we will create a sagemaker compilation job. During compilation job we will sepecify the target device we will run on along with DDA edge application.

Since SageMaker compilation job expects only one PyTorch model file, we could not use the training job output artifact directly. 

Prepare model for compilation:
1. download trained model
2. unzip and tar the mochi.pt file to mochi.tar.gz
3. upload to S3

In [45]:
res_class = sagemaker.describe_training_job(TrainingJobName=classification_training_job_name)
output_model_path = res_class['ModelArtifacts']['S3ModelArtifacts']
print(output_model_path)

s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/LFV-classification2025-09-30-02-14-49/output/model.tar.gz


In [46]:
from urllib.parse import urlparse

parsed_url = urlparse(output_model_path)
output_bucket = parsed_url.netloc
output_key = parsed_url.path.lstrip('/')
print(output_bucket)
print(output_key)

sagemaker-us-east-1-164152369890
LFV-public-test/output/LFV-classification2025-09-30-02-14-49/output/model.tar.gz


In [47]:
import tarfile
import os
import fnmatch
from pathlib import Path

s3_client = boto3.client('s3')
path = "./classification"
Path(path).mkdir(parents=True, exist_ok=True)

# Download the .tar.gz file from S3
input_tar_gz = os.path.join(path, 'model.tar.gz')
s3_client.download_file(output_bucket, output_key, input_tar_gz)

# Extract the contents of the .tar.gz file
extract_dir = os.path.join(path, 'extracted')
Path(extract_dir).mkdir(parents=True, exist_ok=True)
with tarfile.open(input_tar_gz, 'r:gz') as tar:
    tar.extractall(path=extract_dir)
print(f"Extracted {input_tar_gz} to {extract_dir}.")

# Find the file with "mochi.pt" in its name
model_file = os.path.join(extract_dir, 'mochi.pt')
if model_file is None:
    raise Exception("No mochi.pt file found.")

print(f"Found model file: {model_file}")

# Extract the input_shape from mochi.json
mochi_path = os.path.join(extract_dir, 'mochi.json')
if mochi_path is None:
    raise Exception("No mochi.json file found.")

print(f"Found Mochi file: {mochi_path}")
with open(mochi_path, 'r') as f:
    mochi_data = json.load(f)
    input_shape = mochi_data['stages'][0]['input_shape']
    print(f"Extracted input_shape: {input_shape}")
    
    # Extract height and width
    height = input_shape[2]  # 768
    width = input_shape[3]   # 576
    
    # Build tensor shape for DataInputConfig
    tensor_shape = [1, 3, height, width]
    data_input_config = json.dumps({"input_shape": tensor_shape})
    
    print(f"Height: {height}, Width: {width}")
    print(f"DataInputConfig: {data_input_config}")


# Create a new .tar.gz file with the model.pt file
output_tar_gz = os.path.join(path, 'classification.tar.gz')
with tarfile.open(output_tar_gz, "w:gz") as tar:
    tar.add(model_file, arcname=os.path.basename(model_file))
print(f"Created tar.gz file {output_tar_gz} with {model_file}.")

# Upload the new .tar.gz file to S3
target_key = output_key.rsplit('/', 1)[0] + '/classification.tar.gz'
s3_client.upload_file(output_tar_gz, output_bucket, target_key)
print(f"Uploaded {output_tar_gz} to bucket {output_bucket} with key {target_key}.")

Extracted ./classification/model.tar.gz to ./classification/extracted.
Found model file: ./classification/extracted/mochi.pt
Found Mochi file: ./classification/extracted/mochi.json
Extracted input_shape: [1, 3, 768, 576]
Height: 768, Width: 576
DataInputConfig: {"input_shape": [1, 3, 768, 576]}
Created tar.gz file ./classification/classification.tar.gz with ./classification/extracted/mochi.pt.
Uploaded ./classification/classification.tar.gz to bucket sagemaker-us-east-1-164152369890 with key LFV-public-test/output/LFV-classification2025-09-30-02-14-49/output/classification.tar.gz.


### Target Device: Jetson xavier Jetpack4

In [48]:
compilation_job_name = "classification-xavier-gpu-"+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [49]:
compressed_model_path = f"s3://{output_bucket}/{target_key}"
print(f"Compressed model path {compressed_model_path}")

Compressed model path s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/LFV-classification2025-09-30-02-14-49/output/classification.tar.gz


In [50]:
create_response = sagemaker.create_compilation_job(
    CompilationJobName=compilation_job_name,
    RoleArn= sm_role_arn,
    InputConfig={
        'S3Uri': compressed_model_path,
        'DataInputConfig': data_input_config,
        'Framework': 'PYTORCH',
        'FrameworkVersion': '1.8'
    },
    OutputConfig={
        'S3OutputLocation': compilation_output_path,
        'TargetPlatform': {
            'Os': 'LINUX',
            'Arch': 'ARM64',
            'Accelerator': 'NVIDIA'
        },
        'CompilerOptions': '{"cuda-ver": "10.2", "gpu-code": "sm_72", "trt-ver": "8.2.1", "max-workspace-size": "2147483648", "precision-mode": "fp16", "jetson-platform": "xavier"}'
    },
    StoppingCondition={'MaxRuntimeInSeconds': 3600},
    Tags=[
        {'Key': 'Platform', 'Value': 'jetson-xavier'},
        {'Key': 'Architecture', 'Value': 'ARM64-CUDA'},
        {'Key': 'TensorRT', 'Value': '8.2.1'}
    ]
)


In [51]:
while True:
    compile_response = sagemaker.describe_compilation_job(
        CompilationJobName=compilation_job_name
    )
    if compile_response['CompilationJobStatus'] == 'INPROGRESS':
        print(".", end='')
    elif compile_response['CompilationJobStatus'] == 'STARTING':
        print("*", end='')
    elif compile_response['CompilationJobStatus'] == 'COMPLETED':
        print("Completed")
        break
    elif compile_response['CompilationJobStatus'] == 'FAILED':
        print("Failed")
        print(compile_response['FailureReason'])
        break
    else:
        print("?", end='')
    time.sleep(60)

***.Completed


### Target Device: x86 cpu

In [52]:
compilation_job_name = "classification-x86-cpu-"+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [53]:
create_response = sagemaker.create_compilation_job(
    CompilationJobName=compilation_job_name,
    RoleArn=sm_role_arn,
    InputConfig={
        'S3Uri': compressed_model_path,
        'DataInputConfig': data_input_config,
        'Framework': 'PYTORCH',
        'FrameworkVersion': '1.8'
    },
    OutputConfig={
        'S3OutputLocation': compilation_output_path,
        'TargetPlatform': {
            'Os': 'LINUX',
            'Arch': 'X86_64'
        }
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 3600
    },
)


In [54]:
import time
while True:
    compile_response = sagemaker.describe_compilation_job(
        CompilationJobName=compilation_job_name
    )
    if compile_response['CompilationJobStatus'] == 'INPROGRESS':
        print(".", end='')
    elif compile_response['CompilationJobStatus'] == 'STARTING':
        print("*", end='')
    elif compile_response['CompilationJobStatus'] == 'COMPLETED':
        print("Completed")
        break
    elif compile_response['CompilationJobStatus'] == 'FAILED':
        print("Failed")
        print(compile_response['FailureReason'])
        break
    else:
        print("?", end='')
    time.sleep(60)

**..Completed


### Target Device: arm cpu

In [55]:
compilation_arm_cpu = "classification-arm-cpu-"+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [56]:
create_arm_response = sagemaker.create_compilation_job(
    CompilationJobName=compilation_arm_cpu,
    RoleArn=sm_role_arn,
    InputConfig={
        'S3Uri': compressed_model_path,
        'DataInputConfig': data_input_config,
        'Framework': 'PYTORCH',
        'FrameworkVersion': '1.8'
    },
    OutputConfig={
        'S3OutputLocation': compilation_output_path,
        'TargetPlatform': {
            'Os': 'LINUX',
            'Arch': 'ARM64'
        }
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 3600
    },
)


In [57]:
import time
while True:
    create_arm_response = sagemaker.describe_compilation_job(
        CompilationJobName=compilation_arm_cpu
    )
    if create_arm_response['CompilationJobStatus'] == 'INPROGRESS':
        print(".", end='')
    elif create_arm_response['CompilationJobStatus'] == 'STARTING':
        print("*", end='')
    elif create_arm_response['CompilationJobStatus'] == 'COMPLETED':
        print("Completed")
        break
    elif create_arm_response['CompilationJobStatus'] == 'FAILED':
        print("Failed")
        print(create_arm_response['FailureReason'])
        break
    else:
        print("?", end='')
    time.sleep(60)

**..Completed


## Compilation job - Segmentation

In [58]:
seg_training = segmentation_training_job_name

In [59]:
res_seg = sagemaker.describe_training_job(TrainingJobName=seg_training)
seg_output_model_path = res_seg['ModelArtifacts']['S3ModelArtifacts']
print(seg_output_model_path)

s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/LFV-segmentation-2025-09-30-02-27-58/output/model.tar.gz


Prepare model for compilation:
1. download trained model
2. unzip and tar the mochi.pt file to mochi.tar.gz
3. upload to S3

In [60]:
from urllib.parse import urlparse

parsed_url = urlparse(seg_output_model_path)
output_bucket = parsed_url.netloc
output_key = parsed_url.path.lstrip('/')
print(output_bucket)
print(output_key)

sagemaker-us-east-1-164152369890
LFV-public-test/output/LFV-segmentation-2025-09-30-02-27-58/output/model.tar.gz


In [61]:
import tarfile
import os
import fnmatch
from pathlib import Path

s3_client = boto3.client('s3')
path = "./segmentation"
Path(path).mkdir(parents=True, exist_ok=True)

# Download the .tar.gz file from S3
input_tar_gz = os.path.join(path, 'model.tar.gz')
s3_client.download_file(output_bucket, output_key, input_tar_gz)

# Extract the contents of the .tar.gz file
extract_dir = os.path.join(path, 'extracted')
Path(extract_dir).mkdir(parents=True, exist_ok=True)
with tarfile.open(input_tar_gz, 'r:gz') as tar:
    tar.extractall(path=extract_dir)
print(f"Extracted {input_tar_gz} to {extract_dir}.")

# Find the file with "mochi.pt" in its name
model_file = os.path.join(extract_dir, 'mochi.pt')
if model_file is None:
    raise Exception("No mochi.pt file found.")

print(f"Found model file: {model_file}")

# Extract the input_shape from mochi.json
mochi_path = os.path.join(extract_dir, 'mochi.json')
if mochi_path is None:
    raise Exception("No mochi.json file found.")

print(f"Found Mochi file: {mochi_path}")
with open(mochi_path, 'r') as f:
    mochi_data = json.load(f)
    input_shape = mochi_data['stages'][0]['input_shape']
    print(f"Extracted input_shape: {input_shape}")
    
    # Extract height and width
    height = input_shape[2]  # 768
    width = input_shape[3]   # 576
    
    # Build tensor shape for DataInputConfig
    tensor_shape = [1, 3, height, width]
    data_input_config = json.dumps({"input_shape": tensor_shape})
    
    print(f"Height: {height}, Width: {width}")
    print(f"DataInputConfig: {data_input_config}")

# Create a new .tar.gz file with the model.pt file
output_tar_gz = os.path.join(path, 'segmentation.tar.gz')
with tarfile.open(output_tar_gz, "w:gz") as tar:
    tar.add(model_file, arcname=os.path.basename(model_file))
print(f"Created tar.gz file {output_tar_gz} with {model_file}.")

# Upload the new .tar.gz file to S3
target_key = output_key.rsplit('/', 1)[0] + '/segmentation.tar.gz'
s3_client.upload_file(output_tar_gz, output_bucket, target_key)
print(f"Uploaded {output_tar_gz} to bucket {output_bucket} with key {target_key}.")

Extracted ./segmentation/model.tar.gz to ./segmentation/extracted.
Found model file: ./segmentation/extracted/mochi.pt
Found Mochi file: ./segmentation/extracted/mochi.json
Extracted input_shape: [1, 3, 768, 576]
Height: 768, Width: 576
DataInputConfig: {"input_shape": [1, 3, 768, 576]}
Created tar.gz file ./segmentation/segmentation.tar.gz with ./segmentation/extracted/mochi.pt.
Uploaded ./segmentation/segmentation.tar.gz to bucket sagemaker-us-east-1-164152369890 with key LFV-public-test/output/LFV-segmentation-2025-09-30-02-27-58/output/segmentation.tar.gz.


## Segmentation x-86 CPU

In [62]:
compilation_job = "segmentation-x86-cpu-"+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [63]:
model_path = f"s3://{output_bucket}/{target_key}"
print(f"Compressed model path {model_path}")

Compressed model path s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/LFV-segmentation-2025-09-30-02-27-58/output/segmentation.tar.gz


In [64]:
seg_x86_response = sagemaker.create_compilation_job(
    CompilationJobName=compilation_job,
    RoleArn=sm_role_arn,
    InputConfig={
        'S3Uri': model_path,
        'DataInputConfig': data_input_config,
        'Framework': 'PYTORCH',
        'FrameworkVersion': '1.8'
    },
    OutputConfig={
        'S3OutputLocation': compilation_output_path,
        'TargetPlatform': {
            'Os': 'LINUX',
            'Arch': 'X86_64'
        }
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 3600
    },
)


In [65]:
while True:
    create_response = sagemaker.describe_compilation_job(
        CompilationJobName=compilation_job
    )
    if create_response['CompilationJobStatus'] == 'INPROGRESS':
        print(".", end='')
    elif create_response['CompilationJobStatus'] == 'STARTING':
        print("*", end='')
    elif create_response['CompilationJobStatus'] == 'COMPLETED':
        print("Completed")
        break
    elif create_response['CompilationJobStatus'] == 'FAILED':
        print("Failed")
        print(create_response['FailureReason'])
        break
    else:
        print("?", end='')
    time.sleep(60)

**..Completed


## Segmentation Jetson Xavier

In [66]:
compilation_job = "segmentation-Jetson-xavier"+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [67]:
model_path = f"s3://{output_bucket}/{target_key}"
print(f"Compressed model path {model_path}")

Compressed model path s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/LFV-segmentation-2025-09-30-02-27-58/output/segmentation.tar.gz


In [68]:
create_response = sagemaker.create_compilation_job(
    CompilationJobName=compilation_job,
    RoleArn=sm_role_arn,
    InputConfig={
        'S3Uri': model_path,
        'DataInputConfig': data_input_config,
        'Framework': 'PYTORCH',
        'FrameworkVersion': '1.8'
    },
    OutputConfig={
        'S3OutputLocation': compilation_output_path,
        'TargetPlatform': {
            'Os': 'LINUX',
            'Arch': 'ARM64',
            'Accelerator': 'NVIDIA'
        },
        'CompilerOptions': '{"cuda-ver": "10.2", "gpu-code": "sm_72", "trt-ver": "8.2.1", "max-workspace-size": "2147483648", "precision-mode": "fp16", "jetson-platform": "xavier", "aux-inputs": "{\\"batch_size\\": [1, 4, 8], \\"sequence_length\\": [128, 256, 512]}"}'
    },
    StoppingCondition={'MaxRuntimeInSeconds': 3600},
    Tags=[
        {'Key': 'Platform', 'Value': 'jetson-xavier'},
        {'Key': 'Architecture', 'Value': 'ARM64-CUDA'},
        {'Key': 'CudaVersion', 'Value': '10.2'}
    ]
)

In [69]:
while True:
    create_response = sagemaker.describe_compilation_job(
        CompilationJobName=compilation_job
    )
    if create_response['CompilationJobStatus'] == 'INPROGRESS':
        print(".", end='')
    elif create_response['CompilationJobStatus'] == 'STARTING':
        print("*", end='')
    elif create_response['CompilationJobStatus'] == 'COMPLETED':
        print("Completed")
        break
    elif create_response['CompilationJobStatus'] == 'FAILED':
        print("Failed")
        print(create_response['FailureReason'])
        break
    else:
        print("?", end='')
    time.sleep(60)

**..Completed


## Segmentation ARM CPU

In [70]:
compilation_arm_cpu = "segmentation-arm-cpu-"+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [71]:
model_path = f"s3://{output_bucket}/{target_key}"
print(f"Compressed model path {model_path}")

Compressed model path s3://sagemaker-us-east-1-164152369890/LFV-public-test/output/LFV-segmentation-2025-09-30-02-27-58/output/segmentation.tar.gz


In [72]:
create_arm_response = sagemaker.create_compilation_job(
    CompilationJobName=compilation_arm_cpu,
    RoleArn=sm_role_arn,
    InputConfig={
        'S3Uri': model_path,
        'DataInputConfig': data_input_config,
        'Framework': 'PYTORCH',
        'FrameworkVersion': '1.8'
    },
    OutputConfig={
        'S3OutputLocation': compilation_output_path,
        'TargetPlatform': {
            'Os': 'LINUX',
            'Arch': 'ARM64'
        }
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 3600
    },
)

In [74]:
while True:
    create_response = sagemaker.describe_compilation_job(
        CompilationJobName=compilation_arm_cpu
    )
    if create_response['CompilationJobStatus'] == 'INPROGRESS':
        print(".", end='')
    elif create_response['CompilationJobStatus'] == 'STARTING':
        print("*", end='')
    elif create_response['CompilationJobStatus'] == 'COMPLETED':
        print("Completed")
        break
    elif create_response['CompilationJobStatus'] == 'FAILED':
        print("Failed")
        print(create_response['FailureReason'])
        break
    else:
        print("?", end='')
    time.sleep(60)

*..Completed
