Skip to content

Conversation

giuseppeporcelli
Copy link
Contributor

@giuseppeporcelli giuseppeporcelli commented Apr 30, 2020

…odel mode.

Issue #, if available:

Description of changes:
I have fixed the handler service to allow including the 'code' dir (where user modules are stored) to the Python path. This is required for importing the custom user modules when the container is used in multi-model mode.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ajaykarpur
Copy link
Contributor

@giuseppeporcelli Retried the sagemaker-pytorch-inference build and it looks like the same test timed out again:

=================================== FAILURES ===================================
________________________________ test_mnist_cpu ________________________________

sagemaker_session = <sagemaker.session.Session object at 0x7f4b80346320>
image_uri = '142577830533.dkr.ecr.us-west-2.amazonaws.com/sagemaker-test:1.4.0-pytorch-sagemaker-pytorch-inference-04131d15-2e47-4fe3-83da-1bf0e5551b62'
instance_type = 'ml.c4.xlarge'

    @pytest.mark.cpu_test
    def test_mnist_cpu(sagemaker_session, image_uri, instance_type):
        instance_type = instance_type or 'ml.c4.xlarge'
>       _test_mnist_distributed(sagemaker_session, image_uri, instance_type, model_cpu_tar, mnist_cpu_script)

test-toolkit/integration/sagemaker/test_mnist.py:28: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test-toolkit/integration/sagemaker/test_mnist.py:65: in _test_mnist_distributed
    endpoint_name=endpoint_name)
.tox/py36/lib/python3.6/site-packages/sagemaker/model.py:515: in deploy
    data_capture_config_dict=data_capture_config_dict,
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:2872: in endpoint_from_production_variants
    return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:2404: in create_endpoint
    self.wait_for_endpoint(endpoint_name)
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:2651: in wait_for_endpoint
    desc = _wait_until(lambda: _deploy_done(self.sagemaker_client, endpoint), poll)
.tox/py36/lib/python3.6/site-packages/sagemaker/session.py:3602: in _wait_until
    time.sleep(poll)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

signum = 14, frame = <frame object at 0x7f4b7b551238>

    def handler(signum, frame):
>       raise TimeoutError('timed out after {} seconds'.format(limit))
E       integration.sagemaker.timeout.TimeoutError: timed out after 1800 seconds

test-toolkit/integration/sagemaker/timeout.py:44: TimeoutError

@giuseppeporcelli
Copy link
Contributor Author

I'm not able to replicate the issue locally. Can I have access to the logs of the endpoint being created and see why the deployment is not working? Thanks.

Copy link
Contributor

@ajaykarpur ajaykarpur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was just a flaky test. The builds are now passing; approved.

@ajaykarpur ajaykarpur merged commit f16ba1a into aws:master May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants