Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy Sagemaker "Serverless" option fails with error "Image size 13279248421 is greater than supported" #691

Closed
6 tasks
jeve7 opened this issue Feb 14, 2024 · 7 comments
Assignees
Labels

Comments

@jeve7
Copy link

jeve7 commented Feb 14, 2024

Describe the bug
I am trying to deploy the latest version (5.5.0) in a DEV environment so I am selecting "Serverless" for Sagemaker (SagemakerInitialInstanceCount = 0). The deployment is failing with the message: "Image size 13279248421 is greater than supported size 10737418240" when is creating the Sagemaker endpoint. I guess this problem is new in 5.5.0 since I did the same previously using 5.4.5 and it worked fine.

To Reproduce

  • Start new QnABot deployment.
  • In CloudFormation parameters configure SagemakerInitialInstanceCount with 0 (Please note that the CloudFormation parameter label says: "Optional: If EmbeddingsApi is SAGEMAKER, provide initial instance count. Set to '0' to enable Serverless Inference (for cold-start delay tolerant deployments only).").
  • Configure "Stack failure options" as "Preserve successfully provisioned resources" (To avoid deletion of failed SagemakerEmbeddingStack stack).
  • Start deployment.
  • After some time the stack deployment will fail with this error: "Image size 13279248421 is greater than supported size 10737418240"

Expected behavior
Deployment should work and Sagemaker will be configured as Serverless (Same as in 5.4.5) or the label/documentation is updated and Serverless is no longer an option meaning 1 Sagemaker server is the smallest available footprint.

Please complete the following information about the solution:

  • Version: 5.5
  • Region: ca-central-1
  • Was the solution modified from the version published on this repository? No
  • If the answer to the previous question was yes, are the changes available on GitHub? N/A
  • Have you checked your service quotas for the services this solution uses? N/A
  • Were there any errors in the CloudWatch Logs? There are errors in CloudFormation Stack (Must configure rollback to not delete on fail).

Screenshots
CloudFormation message:

QnABot-SM-Exception

@jeve7 jeve7 added the bug label Feb 14, 2024
@dougtoppin
Copy link
Member

@jeve7 thanks for your report, we will take a look at it and get back to you

@bios6
Copy link
Member

bios6 commented Feb 15, 2024

Hi @jeve7 ,

So I just deployed the v5.5.0 version by cloning the github repo and the SagemakerEmbeddingsStack deployment succeeded for me. Are referencing the model here? :

ModelDataUrl: { 'Fn::Sub': 's3://${BootstrapBucket}/${BootstrapPrefix}/ml_model/e5-large.tar.gz' },

Also I believe in a previous issue you mentioned you were migrating from v5.4.5 so could it be you have modified something that could be causing this? I would recommend trying out with a fresh new deployment to see if that succeeds for you and follow the readme when deploying. This should differentiate if it's an issue with some modified changes you might have.

@jeve7
Copy link
Author

jeve7 commented Feb 15, 2024

Interesting... thanks for the info. I did a brand new deployment twice using the public template from here: https://docs.aws.amazon.com/solutions/latest/qnabot-on-aws/step-1-launch-the-stack.html. I clicked the "Launch" link and that opens CloudFormation. I changed the region to "ca-central-1" and move forward. It failed twice with the same message when I selected "SagemakerInitialInstanceCount = 0". The second time I changed the rollback setting to be able to see the real error in the Sagemaker stack.
It did worked when I changed "SagemakerInitialInstanceCount =1".
The template is already "cooked" and everything appears to be in a bucket somewhere. I didn't touch the repo in any way.

@rpilic
Copy link

rpilic commented Feb 15, 2024

I also have experienced the same issue. @bios6 it wasn't clear from your response, but did you try setting SagemakerInitialInstanceCount = 0? The ability to set the qnabot to serverless mode is important for cost savings in a non-production environment.

@fhoueto-amz
Copy link
Member

Hi @jeve7
The latest update to the embedding model image has a size greater than 10GB which is a limit of the sagemaker serverless container. Our current recommendation is to use an instance and not the serverless. We are reviewing this to determine what will be our way forward.

@jeve7
Copy link
Author

jeve7 commented Feb 16, 2024

Sounds good, thanks for the info @fhoueto-amz.

@bios6
Copy link
Member

bios6 commented Mar 28, 2024

Closing this as Serverless is not deployable. Our documentation in the next release will be updated to mention that.

@bios6 bios6 closed this as completed Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants