-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
I am trying to run a Processing job and the job fails part way with this error message:
Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
When I check the logs for the job in cloudwatch, the only logs I see are the print statements I added to help debug the issue, the last statement in the logs is only part way in a loop and there are no other messages / errors other than the InternalServerError when I tail the logs from where I launch the Processing job.
This has been happening over and over and I am not sure how to figure out what is wrong with the job.
To reproduce
Run processing job.
Expected behavior
Job completes without errors or shows where it failed with a helpful stack trace of what went wrong instead of repeatedly saying InternalServerError: We encountered an internal error. Please try again.
Screenshots or logs
» ./run-job.py # helper script to launch the Processing job
<sagemaker.processing.Processor object at 0x1132a2e50>
12345.dkr.ecr.us-east-1.amazonaws.com/container-image /app/process.py some-args
Job Name: job-name-2021-06-30-18-45-25-969
Inputs: [{'InputName': 'processing-input', 'S3Input': {'S3Uri': 's3://bucket/output/model.tar.gz', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs: [{'OutputName': 'processing-output', 'S3Output': {'S3Uri': 's3://bucket/job-name-2021-06-30-18-45-25-969/output/processing-output', 'LocalPath': '/opt/ml/processing/output/', 'S3UploadMode': 'Continuous'}}]
..............................................
Traceback (most recent call last):
File "./bin/run-job.py", line 94, in <module>
etl_processor.run(outputs=[po], inputs=inputs, wait=not args.no_wait, logs=not args.no_wait)
File "/Users/path/venv/lib/python3.7/site-packages/sagemaker/processing.py", line 164, in run
self.latest_job.wait(logs=logs)
File "/Users/path/venv/lib/python3.7/site-packages/sagemaker/processing.py", line 728, in wait
self.sagemaker_session.logs_for_processing_job(self.job_name, wait=True)
File "/Users/path/venv/lib/python3.7/site-packages/sagemaker/session.py", line 3134, in logs_for_processing_job
self._check_job_status(job_name, description, "ProcessingJobStatus")
File "/Users/path/venv/lib/python3.7/site-packages/sagemaker/session.py", line 2638, in _check_job_status
actual_status=status,
sagemaker.exceptions.UnexpectedStatusException: Error for Processing job job-name-2021-06-30-18-45-25-969: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.47.2
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A
- Framework version: N/A
- Python version: 3.8
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Additional context
Running using a ml.m5.2xlarge instance (metrics show plenty of room left in CPU, disk and memory usage) and a custom container. Job takes a shelve file, does some data stuff and then generates a bunch of small files that are written to disk (data read and written leverages paths backed by ProcessingInput and ProcessingOutput)
