Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

Wait tfs before starting gunicorn #192

Merged
merged 4 commits into from
Mar 23, 2021
Merged

Conversation

liangma8712
Copy link
Contributor

Issue #, if available:

Description of changes:
Wait for TFS before starting gunicorn.

Tested the batch transform job using below notebook.
https://aws.amazon.com/blogs/machine-learning/performing-batch-inference-with-tensorflow-serving-in-amazon-sagemaker/

Use SAGEMAKER_TFS_WAIT_TIME to adjust wait time if needed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: a30e3f1
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 2bc0626
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 2bc0626
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 2bc0626
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 2bc0626
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@schenqian schenqian requested a review from mitroviv March 20, 2021 02:15
@@ -60,6 +62,7 @@ def __init__(self):
# Use this to specify memory that is needed to initialize CUDA/cuDNN and other GPU libraries
self._tfs_gpu_margin = float(os.environ.get("SAGEMAKER_TFS_FRACTIONAL_GPU_MEM_MARGIN", 0.2))
self._tfs_instance_count = int(os.environ.get("SAGEMAKER_TFS_INSTANCE_COUNT", 1))
self._tfs_wait_time = int(os.environ.get("SAGEMAKER_TFS_WAIT_TIME", 600))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the time unit part of the environment variable name and the field name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will add it in next revision.

while True:
try:
tfs_ready_count = 0
for i in range(self._tfs_instance_count):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (feel free to ignore):
Please consider using a bit more descriptive variable name instead of i (perhaps tfs_index or tfs_ordinal etc.)

Comment on lines 336 to 337
tfs_url = "http://localhost:{}/v1/models/{}/metadata" \
.format(self._tfs_rest_port[i], self._tfs_default_model_name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious - shouldn't there be a list/dict of corresponding model names (potentially different) for each TF server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these TF servers are using the same model here. If it is multi-model endpoint, the tensorflow server is not started during container initialization.

Comment on lines 339 to 340
response = requests.get(tfs_url)
logging.info(response)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless response already includes server/endpoint metadata please consider logging some additional information to help customers identify which server returned which response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We logged the server info on line 338 for this purpose.

Comment on lines 345 to 346
except requests.exceptions.ConnectionError:
time.sleep(30)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding some configuration for this (including relevant time unit name in the (env) variable / field names) - otherwise it's just a hard-coded magic number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will add it in next revision.

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 2bc0626
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: cd3fc1f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 438f4e4
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-tensorflow-serving-container-pr
  • Commit ID: 438f4e4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants