Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: make it possible to keep docker container warm #239

Open
jandockx opened this issue Dec 22, 2017 · 60 comments
Open

Feature request: make it possible to keep docker container warm #239

jandockx opened this issue Dec 22, 2017 · 60 comments

Comments

@jandockx
Copy link

@jandockx jandockx commented Dec 22, 2017

I understand from other issues that a new docker container is started for each request. This makes some experiments or automated tests undoable in practice. SAM Local is much too slow in the context where more then 1 request is to be handled.

I suspect that hot reloading depends on this feature.

I think it would be a good idea to make it possible to choose, while this project evolves further, to forego hot reloading, but to keep the docker container warm.

Something like

sam local start-api -p <PORT> --profile <AWS PROFILE> --keep-it-warm

This would broaden the applicability of sam local enormously.

Thank you for considering this suggestion. This looks like an awesome project.

@aldegoeij

This comment has been minimized.

Copy link

@aldegoeij aldegoeij commented Dec 28, 2017

+1 Python container takes too long to start for simple debugging...

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Jan 5, 2018

+1. This currently makes local automated testing painful at best.

Thanks for the continued work on this project!

@dannymcpherson

This comment has been minimized.

Copy link

@dannymcpherson dannymcpherson commented Feb 1, 2018

Have there been any eyes on this? The benefit would be so huge.

@cagoi

This comment has been minimized.

Copy link

@cagoi cagoi commented Apr 19, 2018

+1

1 similar comment
@hobotroid

This comment has been minimized.

Copy link

@hobotroid hobotroid commented Apr 27, 2018

+1

@daveykane

This comment has been minimized.

Copy link

@daveykane daveykane commented Jun 4, 2018

+1

1 similar comment
@doitadrian

This comment has been minimized.

Copy link

@doitadrian doitadrian commented Jun 16, 2018

+1

@CRogers

This comment has been minimized.

Copy link

@CRogers CRogers commented Jun 16, 2018

+1, even a simple hello world java8 lambda takes 3/4 seconds for each request!

@CRogers

This comment has been minimized.

Copy link

@CRogers CRogers commented Jun 18, 2018

My sketch proposal to make warm containers work and maintain all the existing nice hot reload/memory usage etc functionality around them:

Currently, the container is simply run with handler argument and the event passed in via an environment variable. The containers logs are then piped to the console stdout/stderr and it just records how much memory is used.

Instead, we can start the container with bash as the entrypoint and -c "sleep infinity" as the argument, so it runs effectively nothing and keeps container alive. We record the container id in an (expiring) dict so we can reuse it again. When we want to run the lambda we run docker exec that runs the previously used lambda entrypoint and the correct environment. Since we run one lambda per container we can still record memory usage. If we key the running containers by the version of the lambda code we're running we can ensure hot reload still works. As always with caches the invalidation would be the interesting part - you probably want to kill out of date containers and kill containers when the tool exits.

@monofonik

This comment has been minimized.

Copy link

@monofonik monofonik commented Jul 8, 2018

+1

@luisvsm

This comment has been minimized.

Copy link

@luisvsm luisvsm commented Aug 15, 2018

+1 Very interested in this feature

@luketn

This comment has been minimized.

Copy link

@luketn luketn commented Aug 24, 2018

+1 Yes please!

@nodeit

This comment has been minimized.

Copy link

@nodeit nodeit commented Sep 6, 2018

+1, throwing my hat in the ring on this too

@jfuss

This comment has been minimized.

Copy link
Contributor

@jfuss jfuss commented Sep 6, 2018

As a note: Please use the reaction feature on the top comment. We do look at issues sorted by thumbs up (as well as other reactions). Commenting +1 does not good for that and adds noise to the issue.

@scoates

This comment has been minimized.

Copy link

@scoates scoates commented Sep 6, 2018

@jfuss I agree (and had done this). Any feedback from your team would be helpful here, though. The closest thing we had to knowing if this is on your radar (before your comment) was duplicate issue consolidation and labeling.

@ejoncas

This comment has been minimized.

Copy link

@ejoncas ejoncas commented Sep 24, 2018

+1, this would be very beneficial for people using java + spring boot.

@thoratou

This comment has been minimized.

Copy link

@thoratou thoratou commented Oct 6, 2018

+1, around 1s for golang case

@kevanpng

This comment has been minimized.

Copy link

@kevanpng kevanpng commented Oct 11, 2018

I did an experiment with container reuse. This is just with a lambda in python, I'm developing on ubuntu 16.04. In summary, docker container spinning up only takes an extra second. So it is not worth making the feature for container reuse. Link to my code https://github.com/kevanpng/aws-sam-local .

For a fixed query, both my colleague and I have 4s invocation time on sam local. His is a windows machine. With giving the profile flag and the container reuse, it goes down to 2.5s in my ubuntu.

My colleague is running on mac and when he tried the same query with lambda reuse and profile flag, he still had 11-14 seconds to run.

Maybe it could be that docker is slow on mac?

@peachepe

This comment has been minimized.

Copy link

@peachepe peachepe commented Oct 11, 2018

1 second is a world's difference when building an API and you expect to serve more than 1 request.

I think it's well worth the feature.

@sanathkr

This comment has been minimized.

Copy link
Contributor

@sanathkr sanathkr commented Oct 11, 2018

@kevanpng Hey I was looking through your code to understand what exactly you did.. So basically, you create the container once with a fixed name, run the function, and on next invocation look for container with same name and simply container.exec_run instead of creating it from scratch again. Is my summary correct?

I am super surprised Docker container creation makes this big of a difference. We can certainly look deeper into this if it is becoming usability blocker.

@scoates

This comment has been minimized.

Copy link

@scoates scoates commented Oct 11, 2018

@sanathkr. Thanks for looking at this. FWIW, it's a huge usability blocker for me:

~/src/faculty/buildshot$ time curl -s http://127.0.0.1:3000/ >/dev/null # SAM container via Docker

real	0m6.891s
user	0m0.012s
sys	0m0.021s
~/src/faculty/buildshot$ time curl -s http://127.0.0.1:5000/ >/dev/null # regular python app via flask dev/debug server (slow)

real	0m0.039s
user	0m0.012s
sys	0m0.019s

And the Instancing.. is quick. It's Docker (and the way Docker is used here) that's slow. The (slow) werkzeug-based dev server is ~175x faster than waiting around for Docker. And this is for every request, not just startup. (And yes, this is from my Mac.)

@sanathkr

This comment has been minimized.

Copy link
Contributor

@sanathkr sanathkr commented Oct 11, 2018

@scoates Thanks for the comparison. Its not apples-to-apples to compare vanilla Flask to Docker-based app. But the 6 second duration with SAM CLI is definitely not what I would expect..

  • Did you have the Docker container already downloaded?
  • Also, can you start SAM CLI with --skip-pull-image flag? This will prevent the CLI to ask Docker for latest image version on every invoke. Do share your numbers again with this flag set.

Thinking ahead:
I think we need to add more instrumentation to SAM CLI codebase in order to understand the parts that contribute to the high latency. It could be cool if we can run the instrumented code in a Travis build with every PR so we can assess the performance impact of new code changes. We also need to run this on variety of platforms to understand the real difference between Mac/Ubuntu.

@sanathkr

This comment has been minimized.

Copy link
Contributor

@sanathkr sanathkr commented Oct 11, 2018

I did some more profiling by crudely commenting out parts of the codebase. Also this is not run multiple times. So the numbers are ballpark estimates. I ran sam init and ran sam local-start-api on a simple HelloWorld Lambda function created by the init template.

Platform: MacOSX
Docker version: 18.06.0

WARNING: Very crude measurements.

Total execution time (sam local start-api): 2.67 seconds
Skip pull images (sam local start-api --skip-pull-image): 1.45 seconds
Create container, run it, and return immediately without waiting for function terminate: 1.05 seconds
Create container, don't run it: 0.2 seconds
SAM CLI code overhead (don't create container at all): 0.045 seconds

Based on the above numbers, I arrived at a rough estimate for each step of the invoke path by assuming:

Total execution = SAM CLI overhead + Docker Image pull + Create container + Run Container + Run function

Then, here is how much each steps took:

SAM CLI Overhead: 0.045 seconds
Docker Image Pull Check: 1.3 seconds
Create Container: 0.15 seconds
Run container: 0.85 seconds
Run function: 0.45 seconds

The most interesting part is Create vs Run container durations. Run is 5x of Create. So it is better if we optimized for the Run duration.

If we were to do a warm start, then we would be saving some fraction of the 0.85 seconds it took to run the container. We should be keeping the runtime process up and running inside the container and re-run just the function in-place. Otherwise we aren't going to save much.

@scoates

This comment has been minimized.

Copy link

@scoates scoates commented Oct 17, 2018

Hi. Sorry for the late reply. I was traveling last week and forgot to get to this when I returned.

I agree absolutely that apigw and flask aren't apples-to-apples, and crude measurements are definitely where we're at right now.

With --skip-pull-image, I still get request starts in the 5+ second range. Entirely possible there's slow stuff in my code (though it's small, so I'm not sure where that would come from; it really does seem like docker). Here are the relevant bits of a request (on a warm start; this is several requests into sam local start-api --skip-pull-image):

[ 0.00] 2018-10-16 20:18:44 Starting new HTTP connection (1): 169.254.169.254
[ 1.01] 2018-10-16 20:18:45 Requested to skip pulling images ...
[ 0.00]
[ 0.00] 2018-10-16 20:18:45 Mounting /Users/sean/src/faculty/buildshot/buildshot/build as /var/task:ro inside runtime container
[!5.32] START RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a Version: $LATEST
[ 0.00] Instancing..
[ 0.00] [DEBUG]	2018-10-17T00:18:50.714Z	13e564e9-1160-4c0e-b1e2-b31bbadd899a	Zappa Event: {'body': None, 'httpMethod': 'GET', 'resource': '/', 'queryStringParameters': None, 'requestContext': {'httpMethod': 'GET', 'requestId': 'c6af9ac6-7b61-11e6-9a41-93e8deadbeef', 'path': '/', 'extendedRequestId': None, 'resourceId': '123456', 'apiId': '1234567890', 'stage': 'prod', 'resourcePath': '/', 'identity': {'accountId': None, 'apiKey': None, 'userArn': None, 'cognitoAuthenticationProvider': None, 'cognitoIdentityPoolId': None, 'userAgent': 'Custom User Agent String', 'caller': None, 'cognitoAuthenticationType': None, 'sourceIp': '127.0.0.1', 'user': None}, 'accountId': '123456789012'}, 'headers': {'X-Forwarded-Port': '3000', 'Host': 'localhost:3000', 'X-Forwarded-Proto': 'http', 'Accept': '*/*', 'User-Agent': 'curl/7.54.0'}, 'stageVariables': None, 'path': '/', 'pathParameters': None, 'isBase64Encoded': True}
[ 0.00]
[ 0.00] [INFO]	2018-10-17T00:18:50.731Z	13e564e9-1160-4c0e-b1e2-b31bbadd899a	127.0.0.1 - - [17/Oct/2018:00:18:50 +0000] "GET / HTTP/1.1" 200 15 "" "curl/7.54.0" 0/16.916
[ 0.00]
[ 0.00] END RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a
[ 0.00] REPORT RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a Duration: 4684 ms Billed Duration: 4700 ms Memory Size: 128 MB Max Memory Used: 42 MB
[ 0.58] 2018-10-16 20:18:51 127.0.0.1 - - [16/Oct/2018 20:18:51] "GET / HTTP/1.1" 200 -

The [ 0.xx] prefix is returned by a util I have that shows elapsed time between stdout lines. Here's the important part, I think:

[!5.32] START RequestId: 13e564e9-1160-4c0e-b1e2-b31bbadd899a Version: $LATEST
[ 0.00] Instancing..

I acknowledge that Instancing.. might just not be output until it's complete, so that by itself isn't a valid measurement point. Just wanted to pass on that I'm seeing 5s of lag in my requests.

I'm not sure how to measure much deeper than that.

More info:

$ docker --version
Docker version 18.06.1-ce, build e68fc7
$ uname -a
Darwin sarcosm.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64 i386 MacBookPro11,4 Darwin
$ sam --version
SAM CLI, version 0.5.0

I also agree that if I can get this down to sub-1s request times, it's probably usable. 5s+ is painful, still, though.

(Edit: adding in case anyone looking for Zappa info stumbles on this. I'm using an experimental fork of the Zappa handler runtime. This doesn't really apply to Zappa-actual. At least not right now.)

@hoang-innomizetech

This comment has been minimized.

Copy link

@hoang-innomizetech hoang-innomizetech commented Jun 10, 2019

Does anyone have any workaround to speed up the execution time?

@humank

This comment has been minimized.

Copy link

@humank humank commented Jul 19, 2019

+1, considering when you run lambda function call, and invoke each other in local, there will be several entire lambda runtime creation life-cycle there, and it's takes long time.

@vtirta

This comment has been minimized.

Copy link

@vtirta vtirta commented Jul 19, 2019

+1

@jfuss

This comment has been minimized.

Copy link
Contributor

@jfuss jfuss commented Jul 30, 2019

Hey all,

I just opened a PR that should really help the pain of 'slow'/'warm' invokes. I would love for people to chime in on the PR (I have deeper details of caveats and approach in the description of #1305). For any willing parties that want to give it a try and install the pr from source, I would love your feedback. We feel it is important for this change to be tested with real workflows to make sure we are hitting the need of you all. Thanks! 🎉

@frankh

This comment has been minimized.

Copy link

@frankh frankh commented Aug 7, 2019

I've also made an attempt at this (#1319) - differs in that it keeps the container running constantly and invokes using docker exec instead of starting/stopping the container constantly.

lowers api gateway lambda latency from ~3s to ~0.2s

@thecaddy

This comment has been minimized.

Copy link

@thecaddy thecaddy commented Aug 10, 2019

+1

@kastork

This comment has been minimized.

Copy link

@kastork kastork commented Aug 15, 2019

The speed/time discussions here are all right on point - micronaut has largely solved the startup time issues that traditionally have made lambdas with JVM functions somewhat impractical, but the fresh-container-per-invocation is still a large impact.

But there's another consideration that I've recently run into, and that is the case where your lambda caches data for use between warm-starts.

For example, my app makes AWS SDK calls to secrets manager and other things that I want to avoid when possible. So on a cold start I make those calls, but on a warm start I don't. The current situation makes testing such things impossible in local.

@hoang-innomizetech

This comment has been minimized.

Copy link

@hoang-innomizetech hoang-innomizetech commented Aug 22, 2019

+100

@ivan589

This comment has been minimized.

Copy link

@ivan589 ivan589 commented Aug 23, 2019

+1

1 similar comment
@fondberg

This comment has been minimized.

Copy link

@fondberg fondberg commented Sep 11, 2019

+1

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Oct 15, 2019

FWIW, I'm going to be pushing out some changes to https://github.com/lambci/docker-lambda in the next week that should support this out of the box.

Basically, each container can be configured (via an env var) to expose an HTTP interface, specifically the Lambda API – which means (after the first invoke), each invoke can hit a warm Lambda. Current testing on my Mac gives around 433 req/sec (2.3ms/req).

Assuming you have an index.js with a handler() in your current directory, you can try it out with:

docker run --rm -v $PWD:/var/task -e STAY_OPEN=1 -p 3000:3000 lambci/lambda:nodejs8.10-beta index.handler

(it won't print anything, but if it stays running... it's working)

Then you can invoke it (multiple times) using:

aws lambda invoke --endpoint http://localhost:3000 --no-sign-request --function-name doesnotmatter --payload '{}' output.json

Or simply just:

curl -d '{}' http://localhost:3000/2015-03-31/functions/myfunction/invocations

Very likely that the env var name will change, and probably the default port too – so don't get attached to those.

Will update you when I've pushed everything up 👍

@dgergely

This comment has been minimized.

Copy link

@dgergely dgergely commented Nov 11, 2019

@mhart
Hi!
any update on this?
Thanks!

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 11, 2019

@dgergely yup, plenty – 1.2k lines of Go, C#, Java, Python and JS to be precise: lambci/docker-lambda#218

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 13, 2019

Alrighty, 1.8k lines later, all runtimes are supported. I still want to clean up some code and make sure stdout/stderr are supported correctly in the new model.

I've pushed all images to -beta tags. Eg, lambci/lambda:ruby2.5-beta, lambci/lambda:dotnetcore2.1-beta, etc

The env var to use is DOCKER_LAMBDA_STAY_OPEN and the port by default is 9001 – going to make this configurable soon.

So basically:

docker run --rm \
  -e DOCKER_LAMBDA_STAY_OPEN=1 \
  -p 9001:9001 \
  -v $PWD:/var/task \
  lambci/lambda:ruby2.5-beta \
  lambda_function.lambda_handler

You should then see:

Lambda API listening on port 9001...

And then you can invoke using:

curl -d '{}' http://localhost:9001/2015-03-31/functions/myfunction/invocations

OR

aws lambda invoke --endpoint http://localhost:9001 --no-sign-request --function-name myfunction --payload '{}' output.json

If you don't supply DOCKER_LAMBDA_STAY_OPEN then everything should function as it currently does (some very minor changes for some runtimes, but none of them should be breaking)

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 13, 2019

Due to the way I've implemented it, sharing the api server among all the runtimes, it's slower than the custom implementation in my earlier test. I now get around 70 req/s. But this is still much faster than cold starting each time. It would be possible to optimize this, but potentially at the cost of maintainability.

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 17, 2019

Support for warm invokes has been pushed to all docker-lambda runtimes 🎉

The documentation above still stands, ie invoke with:

docker run --rm \
  -e DOCKER_LAMBDA_STAY_OPEN=1 \
  -p 9001:9001 \
  -v $PWD:/var/task \
  lambci/lambda:ruby2.5 \
  lambda_function.lambda_handler

All runtimes also have support for X-Amz-Log-Type: Tail (--log-type Tail if invoking from the aws CLI), as well as X-Amz-Invocation-Type: DryRun (--invocation-type DryRun) and X-Amz-Invocation-Type: Event (--invocation-type Event).

All old images are available at lambci/lambda:20191117-<runtime>, eg lambci/lambda:20191117-dotnetcore2.1 – in case ppl are encountering issues with the new images. However, I tried hard to ensure that sam invoke local should still function largely as it does today.

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 17, 2019

So, all that's left now is support for these warm invokes in aws-sam-cli 😸

@ranjan-purbey

This comment has been minimized.

Copy link

@ranjan-purbey ranjan-purbey commented Nov 22, 2019

Our team is just getting started with lambda but one of the initial roadblocks we encountered was the slow response from local API Gateway instance created using aws-sam-cli. On a system with 16GB memory, each invocation takes ~7 seconds. This makes development really painful.
Any estimates on how long before the feature is integrated into SAM CLI?

@ranjan-purbey

This comment has been minimized.

Copy link

@ranjan-purbey ranjan-purbey commented Nov 22, 2019

@mhart on running the docker container directly using the command you suggested above, the container needs to be restarted after every code change in order to reflect the changes. Any workarounds?

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 22, 2019

@ranjan-purbey use something like https://facebook.github.io/watchman/ – just restart the process whenever one of your files change

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 22, 2019

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Nov 23, 2019

@ranjan-purbey I added some documentation for developing and restarting whenever there are changes to your code: https://github.com/lambci/docker-lambda/#developing-in-stay-open-mode

@mhart

This comment has been minimized.

Copy link
Contributor

@mhart mhart commented Dec 1, 2019

I've actually added a watch mode to docker-lambda itself, instead of needing to rely on external file watchers to do the job for you. Just need to pass in DOCKER_LAMBDA_WATCH=1 to activate.

Update documentation here: https://github.com/lambci/docker-lambda#developing-in-stay-open-mode

You can also manually reload the handler by passing SIGHUP to the container.

@langboost

This comment has been minimized.

Copy link

@langboost langboost commented Dec 19, 2019

While waiting on the fix, it's helpful to know that the docker pull command (as @sanathkr mentioned above) can be skipped, and accounts for about half of the waiting.

My personal experience with local api gateway testing is that I can shave about 5 seconds off per request by simply passing --skip-pull-image on launch.

sam local start-api --skip-pull-image

That's a very simple fix you can make to your dev workflow to save some pain for now. Thanks @sanathkr !

@literakl

This comment has been minimized.

Copy link

@literakl literakl commented Feb 5, 2020

@ranjan-purbey I added some documentation for developing and restarting whenever there are changes to your code: https://github.com/lambci/docker-lambda/#developing-in-stay-open-mode

Great work. I do not understand if this is intended for single API method, or complete API? Do I have to start Docker for every API method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.