### AWS Account Info

In [None]:
import sagemaker
import boto3

# User input
role_name = "Developer" 

# Sagemaker session
sagemaker_session = sagemaker.Session()
try:
    # Run notebook from Sagemaker studio
    role = sagemaker.get_execution_role()
except ValueError:
    # Run notebook from local machine
    iam = boto3.client('iam')
    role = iam.get_role(RoleName=role_name)['Role']['Arn']
    print("Get role successfully")
account = sagemaker_session.boto_session.client('sts').get_caller_identity()['Account']
region = sagemaker_session.boto_session.region_name

### Build & Push Docker Image

**Description**
- This section deals with the variables related to Docker images that will be pushed to the Elastic Container Registry (ECR) after the docker image is built.
- Usually, there's no need to build the Docker image more than once because all source codes will be packed and sent to S3 storage. Thefore, `is_build` argument must set to `False` unless the docker image is not available on ECR or `sagemaker_main.py` and `sagemaker_utils.py` are modified 
- Any changes made to the source code will not affect the Docker image.

**Before Running**
- Please change `bucket_name` and don't forget to have docker desktop running

#### Variables for Docker Image

In [None]:
from sagemaker_utils import create_bucket_if_not_exists
image = 'cog_verse'
base_job_name = 'cog-verse-training'
bucket_name = "cog-verse"
is_build = "false" # whether to build docker
instance_type = 'ml.m5.4xlarge'

# Create an S3 client
create_bucket_if_not_exists(bucket_name=bucket_name, region=region)

In [None]:
import os
os.environ["image"] = str(image)
os.environ["account"] = str(account)
os.environ["region"] = str(region)
os.environ["bucket_name"] = str(bucket_name)
os.environ["base_job_name"] = str(base_job_name)

#### Build Image

In [None]:
import os
os.chdir("..")
!bash ./cloud/build_and_push.sh {image} {is_build}
os.chdir("cloud")

#### Push Image to ECR

In [None]:
if is_build == "true":
  !docker push $account.dkr.ecr.$region.amazonaws.com/${image}:latest

### Pack Source Code

- This section deals with packing only the necessary code for running on Sagemaker.
- We send this code to a predetermined location on S3.
- Sagemaker will start the run and download the source code, saving it to the main directory.
- During the packing process, it will ignore all cache and hidden files.

In [None]:
import os
from sagemaker_utils import pack_archive, upload_to_s3, delete_archive
current_dir = os.getcwd()  # Get current directory
project_dir = os.path.dirname(current_dir)  # Get parent directory

source_dir_names = [
    "actors",
    "cogment_verse",
    "config",
    "environments",
    "runs",
    "tests",
    "main.py",
    "simple_mlflow.py",
]
ignore_folders = ["node_modules"]
archive_name = "source_code.tar.gz"

# Pack all source code to run cogment verse
pack_archive(project_dir=project_dir, 
             main_dir=current_dir, 
             output_path=project_dir, 
             source_dir_names=source_dir_names, 
             ignore_folders=ignore_folders, 
             archive_name=archive_name)

# Upload to S3
s3_key = f"{image}/input/data/{archive_name}"
print(project_dir)
upload_to_s3(local_path=f"{project_dir}/{archive_name}", bucket=bucket_name, s3_key=s3_key)

# Delete packed source code after uploading to S3
delete_archive(archive_path=f"{project_dir}/{archive_name}")


### Training

#### User Inputs
- `main_args` is the name experiment i.e., `python -m main <main_args>` (see README.md)
- `s3_bucket` is the location where sagemaker instance will push/fetch all relevant data for the run
- `repo` is the main folder inside the `s3_bucket` that will store source code as well as model parameters

In [None]:
hyperparameters = {'main_args': "+experiment=ppo_atari_pz/pong_pz", 's3_bucket': bucket_name, "repo": image}
run_local_test = False

#### Local Test

This step is important to ensure that the Docker image has been built correctly and can run smoothly on your local machine before deploying it to an AWS instance.

In [None]:
if run_local_test:
    # Training setup
    output_path = f"s3://{bucket_name}/{image}/output"
    input_path = f"s3://{bucket_name}/{image}/input/data"
    image_name = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:latest"

    estimator = sagemaker.estimator.Estimator(image_uri=image_name,
                        base_job_name=base_job_name,
                        role=role, 
                        instance_count=1, 
                        output_path=output_path,
                        instance_type='local',
                        hyperparameters=hyperparameters)
    estimator.fit(inputs={"training": input_path})

    # Verification
    print(f"input_path: {input_path}")
    print(f"output_path: {output_path}")
    print(f"image_name: {image_name}")

#### AWS Run

**Feature**
- To monitor the progress of your machine learning training with mlflow, run the command `python -m simple_mlflow` in the terminal as usual
- Before you finish, make sure to double-check that your Sagemaker training job has ended to avoid any additional charges because sometimes cog-verse does not terminate properly 
- Model registry folder will be uploaded to S3

**Limitation**
- We do not have the capability to store historical data for mlflow runs yet. This means that each new run will overwrite the previous run's data
- Current setup does not automatically synchronize the model registry from S3 to the local machine. However, users can set up this process according to their needs

**Note**
- Sagemaker instance will be still running even if users turn off the computer or notebook

In [None]:
import os
from sagemaker_utils import download_and_extract_data_from_s3
import time

# Training setup
output_path = f"s3://{bucket_name}/{image}/output"
input_path = f"s3://{bucket_name}/{image}/input/data"
image_name = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:latest"
tag_name = [{'Key': 'cog-verse', 'Value': 'cog-verse-training'}]
base_job_name = 'cog-verse-training'

# Run the sagemaker without waiting 
estimator = sagemaker.estimator.Estimator(image_uri=image_name,
                       base_job_name=base_job_name,
                       role=role, 
                       instance_count=1, 
                       instance_type=instance_type,
                       tags=tag_name,
                       output_path=output_path,
                       sagemaker_session=sagemaker_session,
                       hyperparameters=hyperparameters)
estimator.fit(inputs={"training": input_path}, wait=False)

# Wait for training job start before syncing mlflow data
while True:
    training_job_info = estimator.latest_training_job.describe()
    status = training_job_info['TrainingJobStatus']
    if status == 'InProgress':
        time.sleep(60)
        break

# Sync mlflow data from S3 to local machine
cwd_dir = os.getcwd()  # Get current directory
project_dir = os.path.dirname(current_dir)  # Get parent directory
mlflow_archive_name = "mlflow_db.tar.gz" # this name is set in sagemaker_main.py
mlflow_s3_folder = f"{image}/mlflow/{mlflow_archive_name}" # this name is set in sagemaker_main.py
unpack_path = f"{project_dir}/.cogment_verse"

print("Syncing mlflow data...")
while True:
    # Get training job info
    training_job_info = estimator.latest_training_job.describe()

    # Stop syncing process when the job is done running
    if training_job_info["TrainingJobStatus"] in ['Completed', 'Failed', 'Stopped']:
        break

    # Sync mlflow data from S3 to local machine
    download_and_extract_data_from_s3(bucket=bucket_name, 
                                    s3_key=mlflow_s3_folder, 
                                    download_path=cwd_dir, 
                                    archive_name=mlflow_archive_name, 
                                    unpack_path=unpack_path)
print("Done.")

In situations where the user's computer gets disconnected or turned off, there is a solution available for tracking MLflow metrics. Users can execute the following code to ensure continuous tracking of their MLflow metrics even under such circumstance. Note it will only work if the SageMaker environment is still running.

In [None]:
import os
from sagemaker_utils import download_and_extract_data_from_s3
only_mlflow_sync = False

if only_mlflow_sync: 
  cwd_dir = project_dir = os.getcwd()
  mlflow_archive_name = "mlflow_db.tar.gz" # this name is set in sagemaker_main.py
  mlflow_s3_folder = f"{image}/mlflow/{mlflow_archive_name}" # this name is set in sagemaker_main.py
  unpack_path = f"{cwd_dir}/.cogment_verse"
  print("Syncing mlflow data...")
  while True:
      # Sync mlflow data from S3 to local machine
      download_and_extract_data_from_s3(bucket=bucket_name, 
                                      s3_key=mlflow_s3_folder, 
                                      download_path=cwd_dir, 
                                      archive_name=mlflow_archive_name, 
                                      unpack_path=unpack_path)
  print("Done.")