-
Notifications
You must be signed in to change notification settings - Fork 101
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
@@ -60,6 +62,7 @@ def __init__(self): | |||
# Use this to specify memory that is needed to initialize CUDA/cuDNN and other GPU libraries | |||
self._tfs_gpu_margin = float(os.environ.get("SAGEMAKER_TFS_FRACTIONAL_GPU_MEM_MARGIN", 0.2)) | |||
self._tfs_instance_count = int(os.environ.get("SAGEMAKER_TFS_INSTANCE_COUNT", 1)) | |||
self._tfs_wait_time = int(os.environ.get("SAGEMAKER_TFS_WAIT_TIME", 600)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the time unit part of the environment variable name and the field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will add it in next revision.
while True: | ||
try: | ||
tfs_ready_count = 0 | ||
for i in range(self._tfs_instance_count): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick (feel free to ignore):
Please consider using a bit more descriptive variable name instead of i (perhaps tfs_index or tfs_ordinal etc.)
tfs_url = "http://localhost:{}/v1/models/{}/metadata" \ | ||
.format(self._tfs_rest_port[i], self._tfs_default_model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks suspicious - shouldn't there be a list/dict of corresponding model names (potentially different) for each TF server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these TF servers are using the same model here. If it is multi-model endpoint, the tensorflow server is not started during container initialization.
response = requests.get(tfs_url) | ||
logging.info(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless response already includes server/endpoint metadata please consider logging some additional information to help customers identify which server returned which response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We logged the server info on line 338 for this purpose.
except requests.exceptions.ConnectionError: | ||
time.sleep(30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider adding some configuration for this (including relevant time unit name in the (env) variable / field names) - otherwise it's just a hard-coded magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will add it in next revision.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Issue #, if available:
Description of changes:
Wait for TFS before starting gunicorn.
Tested the batch transform job using below notebook.
https://aws.amazon.com/blogs/machine-learning/performing-batch-inference-with-tensorflow-serving-in-amazon-sagemaker/
Use SAGEMAKER_TFS_WAIT_TIME to adjust wait time if needed.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.