Skip to content

How to fix error Exception while invoking preprocessor script. during scheduled model-quality-monitor job. #2665

@naveenmarthala

Description

@naveenmarthala

Describe the bug
I have scheduled model-quality-monitoring jobs with hourly frequency. As part of which, the ground-truth-merge job completes without issues and model-quality-monitoring job exits out with code 1 with this error:

Exception in thread "main" com.amazonaws.sagemaker.dataanalyzer.exception.Error: Exception while invoking preprocessor script.

I have included print statements in my preprocessor.py script and I presume the script was never run, because nothing was printed to my cloudwatch logs. So, I think container (or processing job) failed even before the script was run. So, what's stopping the container from using the script and how do I fix this error.

To reproduce
I have created an model-quality-monitoring schedule as mentioned in this notebook from sagemaker-examples-repo, with few changes to suit my use case. I have a regression problem(or target variable is continuous) and here is all the code, relevant files and screenshots.

Here's the code I am using to schedule the monitoring job and also create alarm in CloudWatch:
## code is almost entirely from the example notebook from sagemaker-examples-repository linked above.
import boto3
from sagemaker import get_execution_role, Session
from sagemaker.model_monitor import ModelQualityMonitor
from sagemaker.model_monitor import EndpointInput
# create clients
sm_boto_client = boto3.client('sagemaker')
sm_session = Session()
cw_boto_client = boto3.Session().client("cloudwatch")

# initiating relevant variables
endpoint_name = 'ch-mqm-ep-try31'
try_num = 34

# make a Monitoring schedule name
ch_monitor_schedule_name = f"ch-mqm-sched-try{try_num}-sor"
# Create an enpointInput instance
endpointInput = EndpointInput(endpoint_name=endpoint_name,
                              destination="/opt/ml/processing/input_data",
                              inference_attribute='data')

# create ModelQualityMonitor instance
ch_mqm_try34_sor = ModelQualityMonitor(
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=30,
    max_runtime_in_seconds=1800,
    sagemaker_session=sm_session)

# create the schedule now
mon_sched_suffix = f"ch-mqm-ep-try{try_num}-sor"
ch_mqm_try34_sor.create_monitoring_schedule(
    monitor_schedule_name=ch_monitor_schedule_name,
    endpoint_input=endpointInput,
    record_preprocessor_script=preprocessor_s3_uri,
    output_s3_uri=f's3://api-trial/{mon_sched_suffix}_monitoring_schedule_outputs',
    problem_type="Regression",
    ground_truth_input=f's3://api-trial/{endpoint_name}_groundtruths',
    constraints='s3://api-trial/ch-mqm-ep-25-5_mqm_artifacts/constraints.json',
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True)

# creating cloudwatch alarm now
alarm_name = f"MODEL_QUALITY_R2_try{try_num}_sor"

alarm_desc = (
    "Trigger an CloudWatch alarm when the rmse score drifts away from the baseline constraints"
)

mdoel_quality_r2_drift_threshold = (
    0.7238122  ##Setting this threshold purposefully low to see the alarm quickly.
)

metric_name = "r2"
namespace = f"aws/sagemaker/Endpoints/model-metrics-try{try_num}-sor"

# put the alarm in cloudwatch now
cw_boto_client.put_metric_alarm(
    AlarmName=alarm_name,
    AlarmDescription=alarm_desc,
    ActionsEnabled=True,
    MetricName=metric_name,
    Namespace=namespace,
    Statistic="Average",
    Dimensions=[
        {"Name": "Endpoint", "Value": endpoint_name},
        {"Name": "MonitoringSchedule",
         "Value": ch_monitor_schedule_name}
    ],
    Period=600,
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=mdoel_quality_r2_drift_threshold,
    ComparisonOperator="LessThanOrEqualToThreshold",
    TreatMissingData="breaching",
)
Here's my "preprocessor.py" file:
from json import loads as json_loads
from base64 import b64decode as base64_b64decode

def preprocess_handler(inference_record):
    # input_data = json_loads(inference_record.endpoint_input.data)
    # input_data = {f"feature{str(i).zfill(2)}": val for i, val in enumerate(input_data)}
    print(inference_record)
    output_data = json_loads(inference_record.endpoint_output.data)
    output_data = base64_b64decode(output_data.encode('utf-8')).decode('utf-8')
    output_data = {"prediction0": output_data}
    print(output_data, '\n')
    return {**output_data}
Here's a couple of lines form the merged ground truth file:
{"eventVersion":"0","groundTruthData":{"data":170300.0,"encoding":"CSV"},"captureData":{"endpointOutput":{"data":"MjAyNTkyLjA=","encoding":"BASE64","mode":"OUTPUT","observedContentType":"text/html; charset=utf-8"}},"eventMetadata":{"eventId":"0faaafcf-5b6c-4dd9-8831-f734eed9aaa0","inferenceId":"60b48ef7-4609-4322-ac5d-e72ccd5d9492","inferenceTime":"2021-09-26T06:40:40Z"}}
{"eventVersion":"0","groundTruthData":{"data":152600.0,"encoding":"CSV"},"captureData":{"endpointOutput":{"data":"MTQ5NzU2LjA=","encoding":"BASE64","mode":"OUTPUT","observedContentType":"text/html; charset=utf-8"}},"eventMetadata":{"eventId":"7400f3e7-0b00-47c2-b39e-efd9cc155a45","inferenceId":"907f0ada-a61b-476e-8a6c-a6d284a249da","inferenceTime":"2021-09-26T06:40:42Z"}}

Expected behavior
Expected behaviour from that the job is that it use the preprocessing.py script I provided while creating the schedule, calculates R2(or R^2) metric and accordingly update the alarm in CloudWatch.

Screenshots or logs
Here's the screenshot of both the jobs:
image

Here's the screenshot of log of the failed model-quality-monitoring job:
image
Please tell me if any other logs are needed.

System information
A description of your system. Please provide:
(I am doing all this from SageMaker Studio, scheduling the jobs and running all the scripts. And I am also not using any custom docker images.)

  • SageMaker Python SDK version: 2.59.3 (on sagemaker studio)
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): sagemaker-model-monitor-analyzer (for a scheduled model-quality-monitoring job)
  • Framework version: (I don't know the version of the container, it had no docker tag)
  • Python version: 3.7.10 (default, Jun 4 2021, 14:48:32) (sys.version printed this on studio)
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context

  1. Since the error is about preprocessing script note being run, here it is in S3 with the code I have shown above:
    image
  2. I have also tried another scheduling job with naming my script preprocessing.py, and that generated same error too.
  3. in the details of job, on sagemaker console, you can see the script being fetched for scheduled-monitoring job here:
    image

How do I fix this error?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions