This notebook screens that it can perform submit Python script as *train* and *processing* with an **input manifest file**. It is designed to run in one go without a kernel restart, hence submits only short training and batch-transform jobs each of which runs for 3+ minutes.

Steps:
- **Action**: click *Kernel* -> *Restart Kernel and Run All Cells...* 
- **Expected outcome**: no exception seen.

# Setup

Before you run the next cell, please open `smconfig.py` and review the mandatory SageMaker `kwargs` then disable the `NotImplementedException` in the last line.

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import sagemaker as sm
from sagemaker.inputs import TrainingInput
from sagemaker.pytorch.estimator import PyTorch
from sagemaker.sklearn.processing import SKLearnProcessor

import smconfig

# Configuration of this screening test.
sess = sm.Session()
# sm_kwargs = smconfig.SmKwargs(sm.get_execution_role())
sm_kwargs = smconfig.SmKwargs(
    "arn:aws:iam::484597657167:role/service-role/AmazonSageMaker-ExecutionRole-20180516T132752"
)
s3_input_path = f"{smconfig.s3_bucket}/screening/entrypoint-input"
s3_input_manifest = f"{s3_input_path}/input-manifest.txt"
s3_sagemaker_path = f"{smconfig.s3_bucket}/screening/sagemaker"

# Enforce blocking API to validate permissions to Describe{Training,Transform}Job.
block_notebook_while_training = True

# Propagate to env vars of the whole notebook, for usage by ! or %%.
%set_env BUCKET=$smconfig.s3_bucket
%set_env S3_INPUT_PATH=$s3_input_path
%set_env S3_INPUT_MANIFEST=$s3_input_manifest
%set_env S3_SAGEMAKER_PATH=$s3_sagemaker_path

# Create dummy input files
!echo "Dummy input file 01" | aws s3 cp - $S3_INPUT_PATH/input-01.txt
!echo "Dummy input file 02" | aws s3 cp - $S3_INPUT_PATH/input-02.txt
!echo "Dummy input file 03" | aws s3 cp - $S3_INPUT_PATH/input-03.txt

# Create manifest file
!echo "[{\"prefix\": \"$S3_INPUT_PATH/\"}, \"input-01.txt\", \"input-03.txt\"]" | aws s3 cp - $S3_INPUT_MANIFEST
!aws s3 cp $S3_INPUT_MANIFEST -

# Training job

In [None]:
estimator = PyTorch(
    entry_point="screen.py",
    source_dir="./sourcedir_screen",
    framework_version="1.8.0",
    py_version="py3",
    hyperparameters={"module": "torch"},
    # sourcedir.tar.gz and output use pre-defined bucket.
    code_location=s3_sagemaker_path,
    output_path=s3_sagemaker_path,
    instance_count=1,
    instance_type="ml.m5.large",
    sagemaker_session=sess,
    **sm_kwargs.train,
)

# Submit a training job.
estimator.fit(
    {"train": TrainingInput(s3_input_manifest, s3_data_type="ManifestFile")},
    wait=block_notebook_while_training,
)

# Track the jobname for subsequent CloudWatch CLI operations.
train_job_name = estimator.latest_training_job.name
%set_env TRAIN_JOB_NAME=$estimator.latest_training_job.name

# Probe output
!aws s3 cp $S3_SAGEMAKER_PATH/$TRAIN_JOB_NAME/output/output.tar.gz - | tar --to-stdout -xzf - screenings.jsonl

# Processing job

In [None]:
processor = SKLearnProcessor(
    framework_version="0.23-1",
    instance_count=1,
    instance_type="ml.m5.large",
    sagemaker_session=sess,
    **sm_kwargs.processing,
)

# Manually upload the code to a specific S3 bucket, otherwise SageMaker SDK
# always uploads to default_bucket() `s3://sagemaker-{}-{}/`.
!aws s3 cp sourcedir_screen/screen.py $S3_SAGEMAKER_PATH/processing-code/screen.py

# Generate job name and track it. We need to do this to set the S3 output path
# to s3://mybucket/...../jobname/output/....
#
# See: https://github.com/aws/sagemaker-python-sdk/blob/570c67806f4f85f954d836d01c6bb06a24b939ee/src/sagemaker/processing.py#L315
processing_job_name = processor._generate_current_job_name()
%set_env PROCESSING_JOB_NAME=$processing_job_name

# Submit a processing job.
processor.run(
    job_name=processing_job_name,
    code=f"{s3_sagemaker_path}/processing-code/screen.py",
    inputs=[
        ProcessingInput(
            source=s3_input_manifest,
            s3_data_type="ManifestFile",
            destination="/opt/ml/processing/input",
        )
    ],
    outputs=[
        ProcessingOutput(
            source="/opt/ml/processing/output",
            destination=f"{s3_sagemaker_path}/{processing_job_name}/output",
        )
    ],
    arguments=["--module", "sklearn"],
    wait=block_notebook_while_training,
)

# Probe output
!aws s3 cp $S3_SAGEMAKER_PATH/$PROCESSING_JOB_NAME/output/screenings.jsonl -

# Appendix: CloudWatch log events

You can retrieve the training logs using `awscli` after this notebook is unblocked. This will be a good test to verify that this notebook's role has sufficient permissions to read CloudWatch logs.

Assuming the job name is stored in an environment variable `TRAIN_JOB_NAME`, run these CLI commands:

```bash
# Find out the log-stream name; should look like TRAIN_JOB_NAME/xxx.
aws logs describe-log-streams \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --log-stream-name-prefix $TRAIN_JOB_NAME \
    | jq -r '.logStreams[].logStreamName'


# Get the log events
aws logs get-log-events \
    --log-group-name /aws/sagemaker/TrainingJobs \
    --log-stream-name <LOG_STREAM_NAME>
```

For processing job, the log group name is `/aws/sagemaker/ProcessingJobs`.