This repository has been archived by the owner on May 23, 2024. It is now read-only.

Wait tfs before starting gunicorn #192

Merged

liangma8712 merged 4 commits into aws:master from liangma8712:master

Mar 23, 2021

Contributor

liangma8712 commented Mar 19, 2021

Issue #, if available:

Description of changes:
Wait for TFS before starting gunicorn.

Tested the batch transform job using below notebook.
https://aws.amazon.com/blogs/machine-learning/performing-batch-inference-with-tensorflow-serving-in-amazon-sagemaker/

Use SAGEMAKER_TFS_WAIT_TIME to adjust wait time if needed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.


          Wait tfs before starting gunicorn

a30e3f1

liangma8712 requested a review from schenqian

March 19, 2021 22:09

sagemaker-bot commented Mar 19, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: a30e3f1
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository


          Fix build issue

2bc0626

sagemaker-bot commented Mar 20, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 2bc0626
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented Mar 20, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 2bc0626
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented Mar 20, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 2bc0626
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented Mar 20, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 2bc0626
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

schenqian requested a review from mitroviv

March 20, 2021 02:15

mitroviv reviewed

View reviewed changes

docker/build_artifacts/sagemaker/serve.py Outdated

@@ @@ -60,6 +62,7 @@ def __init__(self): @@
                       # Use this to specify memory that is needed to initialize CUDA/cuDNN and other GPU libraries
                       self._tfs_gpu_margin = float(os.environ.get("SAGEMAKER_TFS_FRACTIONAL_GPU_MEM_MARGIN", 0.2))
                       self._tfs_instance_count = int(os.environ.get("SAGEMAKER_TFS_INSTANCE_COUNT", 1))
+                      self._tfs_wait_time = int(os.environ.get("SAGEMAKER_TFS_WAIT_TIME", 600))

mitroviv Mar 21, 2021

Please make the time unit part of the environment variable name and the field name.

Contributor Author

liangma8712 Mar 22, 2021

sure, will add it in next revision.

docker/build_artifacts/sagemaker/serve.py Outdated

+                      while True:
+                          try:
+                              tfs_ready_count = 0
+                              for i in range(self._tfs_instance_count):

mitroviv Mar 21, 2021

nitpick (feel free to ignore):
Please consider using a bit more descriptive variable name instead of i (perhaps tfs_index or tfs_ordinal etc.)

docker/build_artifacts/sagemaker/serve.py Outdated

Comment on lines 336 to 337

		tfs_url = "http://localhost:{}/v1/models/{}/metadata" \
		.format(self._tfs_rest_port[i], self._tfs_default_model_name)

mitroviv Mar 21, 2021

This looks suspicious - shouldn't there be a list/dict of corresponding model names (potentially different) for each TF server?

Contributor Author

liangma8712 Mar 22, 2021

All these TF servers are using the same model here. If it is multi-model endpoint, the tensorflow server is not started during container initialization.

docker/build_artifacts/sagemaker/serve.py Outdated

Comment on lines 339 to 340

		response = requests.get(tfs_url)
		logging.info(response)

mitroviv Mar 21, 2021

Unless response already includes server/endpoint metadata please consider logging some additional information to help customers identify which server returned which response.

Contributor Author

liangma8712 Mar 22, 2021

We logged the server info on line 338 for this purpose.

docker/build_artifacts/sagemaker/serve.py Outdated

Comment on lines 345 to 346

		except requests.exceptions.ConnectionError:
		time.sleep(30)

mitroviv Mar 21, 2021

Please consider adding some configuration for this (including relevant time unit name in the (env) variable / field names) - otherwise it's just a hard-coded magic number.

Contributor Author

liangma8712 Mar 22, 2021

sure, will add it in next revision.

sagemaker-bot commented Mar 22, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 2bc0626
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository


          refactor wait_for_model and use it for single model endpoint

cd3fc1f

sagemaker-bot commented Mar 23, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: cd3fc1f
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mitroviv approved these changes

View reviewed changes


          Merge branch 'master' into master

438f4e4

sagemaker-bot commented Mar 23, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 438f4e4
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-bot commented Mar 23, 2021

AWS CodeBuild CI Report

CodeBuild project: sagemaker-tensorflow-serving-container-pr
Commit ID: 438f4e4
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

liangma8712 merged commit 2921d8a into aws:master

liangma8712 mentioned this pull request

Batch Transform function starts sending image inference requests before model is actually loaded #189

Closed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.