Unable to update existing endpoint with newly trained model #101

professoroakz · 2018-03-19T10:47:55Z

Hello!

I am investigating the Sagemaker API for use in production (without notebooks). I am able to train a model, create an endpoint and delete the endpoint without any problems with the API.

However, in a very common situation where I have a newly trained model on new data, I would like to be able to update/change the model that is currently serving in the specified endpoint and not have to update other services. In production, I would like to update the model serving without any downtime.

Currently when I try to do this operation, simply train a new model and deploy to an endpoint using deploy with:

    def deploy(self):
        self.estimator.deploy(
                initial_instance_count=1000,
                instance_type=ml.c4.xlarge,
                endpoint_name="iris"
            )

I get the following error:
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: Cannot create already existing endpoint "arn:aws:sagemaker:eu-west-1:166488713907:endpoint/iris".

Am I missing something here? Do I have to / can I do this operation manually with the boto3 api instead?

Thank you

The text was updated successfully, but these errors were encountered:

winstonaws · 2018-03-19T16:32:02Z

Unfortunately, the feature for allowing you to update existing endpoints directly with .deploy is still on our backlog. We'll look again at its prioritization. In the meantime, you can try the workaround described in this issue: #58

professoroakz · 2018-04-15T20:21:46Z

I've made this work, the implementation is pretty straightforward. I'd contribute if I wasn't busy with building our ML infra around Sagemaker. Here's what I did:

    def deploy(self):
        """ Deploy a new model """
        self.logger.info(
            'Deploying a new model with name %s to new endpoint with name %s' %
            (self.config.endpoint_name, self.config.train_data_location)
        )

        if self.config.update_endpoint:
            self.update()
            return

        try:
            self.estimator.deploy(
                initial_instance_count=self.config.initial_instance_count,
                instance_type=self.config.instance_type,
                endpoint_name=self.config.endpoint_name
            )
        except RuntimeError:
            self.logger.info(
                '%s %s %s' % (
                    'raise RuntimeError: Estimator has not been fit yet,',
                    'AWS Expects to train & deploy in same step.',
                    'Please copy job name from AWS Sagemaker and set in model config.'
                )
            )

    def update(self):
        """ Deploy a new model to existing endpoint """
        try:
            self.create_endpoint_configuration()
        except botocore.exceptions.ClientError:
            pass

        self.update_endpoint()

    def postdeploy(self):
        """ Deploy a trained model, create corresponding endpoint configuration and endpoint """
        self.create_model_from_job()
        self.create_endpoint_configuration()
        self.session.create_endpoint(
            endpoint_name=self.config.endpoint_name,
            config_name=self.endpoint_config_name,
        )

    def create_model_from_job(self):
        """ Create a model from the trained Tensorflow Model """
        self.logger.info(
            'Creating a new model with name %s from training job %s' %
            (self.config.model_name, self.training_job_name)
        )

        self.session.create_model_from_job(
            training_job_name=self.training_job_name,
            name=self.config.model_name,
            role=self.config.role
        )

    def create_endpoint_configuration(self):
        self.logger.info(
            'Creating new endpoint config with name: %s, instance count: %d instance_type: %s' % (
                self.endpoint_name,
                self.config.initial_instance_count,
                self.config.instance_type,
            )
        )

        self.endpoint_config_name = self.session.create_endpoint_config(
            name=self.endpoint_name,
            model_name=self.config.model_name,
            initial_instance_count=self.config.initial_instance_count,
            instance_type=self.config.instance_type
        )

    def update_endpoint(self):
        """ Updates an existing endpoint with EndpointName
            and updates its corresponding Endpoint
            configuration with a new EndpointConfigName
        """
        self.logger.info(
            'Updating endpoint with endpoint name: %s with train job name: %s' %
            (self.config.model_name, self.training_job_name)
        )
        self.client.update_endpoint(
            EndpointName=self.config.endpoint_name, EndpointConfigName=self.endpoint_config_name
        )

ChoiByungWook · 2019-02-13T21:17:42Z

This feature was added within this PR: #606. Updating the endpoint can be done by specifying update_endpoint to be True within the deploy method, usage case can be found here : example

ygcao · 2020-02-17T18:27:10Z

@ChoiByungWook what is the availability impact for the in-place updating? And the example link is looking broken. Thanks!

itsderek23 · 2020-04-01T17:10:19Z

I'm still seeing this error w/1.55.0 when trying to deploy a PyTorch Model:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateEndpointConfig operation: Cannot create already existing endpoint configuration

Example code:

pytorch_model = PyTorchModel(
        model_data = env.setting('model_data_path'),
        name = env.setting('model_name'),
        framework_version = '1.4.0',
        role = env.setting("aws_role"),
        env = {"DEPLOY_ENV": env.current_env()},
        entry_point = 'deploy/sagemaker/serve.py')

predictor = pytorch_model.deploy(
        instance_type = env.setting('instance_type'),
        update_endpoint = True,
        initial_instance_count = 1)

laurenyu · 2020-04-01T17:29:39Z

can you try specifying endpoint_name to be something else in the deploy call?

itsderek23 · 2020-04-01T18:10:10Z

Hi @laurenyu - seems like I get the same error including a new endpoint_name in the call:

predictor = pytorch_model.deploy(
        endpoint_name = env.setting('model_name')+"-1",
        instance_type = env.setting('instance_type'),
        update_endpoint = True,
        initial_instance_count = 1)

laurenyu · 2020-04-01T20:01:36Z

could you open a new issue in this repo? (sorry for the inconvenience, but it'll help with our internal tracking and making sure we respond)

hubtub2 · 2020-09-02T07:04:30Z

Still not working on my side. Same error as the original bug report, even when using update_endpoint = true. Did anyone open a new issue?

laurenyu · 2020-09-02T22:14:08Z

@hubtub2 the behavior around this changed with v2.0+, so it's probably best if you open a new issue and include your specific code and Python SDK version

professoroakz changed the title ~~Unable to update existing endpoint with new trained model~~ Unable to update existing endpoint with newly trained model Mar 19, 2018

ChoiByungWook added the feature request label Mar 21, 2018

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

remove credentials import (aws#101)

4e495bc

ChoiByungWook closed this as completed Feb 13, 2019

dberenbaum mentioned this issue Sep 28, 2023

example-get-started-experiments: redeploy to same endpoint iterative/example-repos-dev#265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to update existing endpoint with newly trained model #101

Unable to update existing endpoint with newly trained model #101

professoroakz commented Mar 19, 2018 •

edited

winstonaws commented Mar 19, 2018

professoroakz commented Apr 15, 2018 •

edited

ChoiByungWook commented Feb 13, 2019

ygcao commented Feb 17, 2020

itsderek23 commented Apr 1, 2020

laurenyu commented Apr 1, 2020

itsderek23 commented Apr 1, 2020

laurenyu commented Apr 1, 2020

hubtub2 commented Sep 2, 2020

laurenyu commented Sep 2, 2020

Unable to update existing endpoint with newly trained model #101

Unable to update existing endpoint with newly trained model #101

Comments

professoroakz commented Mar 19, 2018 • edited

winstonaws commented Mar 19, 2018

professoroakz commented Apr 15, 2018 • edited

ChoiByungWook commented Feb 13, 2019

ygcao commented Feb 17, 2020

itsderek23 commented Apr 1, 2020

laurenyu commented Apr 1, 2020

itsderek23 commented Apr 1, 2020

laurenyu commented Apr 1, 2020

hubtub2 commented Sep 2, 2020

laurenyu commented Sep 2, 2020

professoroakz commented Mar 19, 2018 •

edited

professoroakz commented Apr 15, 2018 •

edited