Fix socket timeouts during local debugging #422

felixbuenemann · 2018-05-20T21:08:40Z

I was running into a problem where sam local invoke --debug-port 9222 myFunction or sam local start-api --debug-port 9222 would crash with a socket.timeout about 60 seconds after attaching to the debug session if there was no log output from the container.

This PR works around the problem by ignoring socket.timeout in the docker attach code and is essential to make debugging via debug port usable – at least on macOS.

Description of changes:

When attaching to a docker container socket.timeout is now ignored so that the attach loop keeps running during remote debug sessions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This fixes socket.timeout errors about 60 seconds after attaching to a node.js remote debug session.

felixbuenemann · 2018-05-20T21:11:46Z

The patch was tested locally with SAM CLI, version 0.3.0 on macOS 10.13.4 (17E202) with Python 2.7.15 from homebrew.

sanathkr · 2018-05-31T20:30:10Z

Thanks for the change @felixbuenemann. The problem statement definitely makes sense but I don't know ignoring timeout will fix the problem. If the underlying socket timed out, wouldn't you need to re-connect somehow?

sanathkr · 2018-05-31T20:38:33Z

Ideally, you would set the Timeout value when creating Docker client to be something large - ~10mins is sufficient?

Turns out we create the Docker client three places (only one is actually necessary) - https://github.com/awslabs/aws-sam-cli/search?q=from_env&unscoped_q=from_env. We should refactor this to create the Docker client only in invoke_context.py and pass to container_manager.py which will pass to downstream class. When creating the client using docker.from_env() you should set the timeout to be something large when debugging is enabled. If _debug_port is set, then you can assume that debugging is enabled.

This sounds like a lot but it is actually pretty simple. Do you think you can make this change? I can guide you through the process. This will not only fix your problem but also remove the minor tech debt we incurred due to three places where we created docker client.

felixbuenemann · 2018-06-01T09:08:34Z

Thanks for the change @felixbuenemann. The problem statement definitely makes sense but I don't know ignoring timeout will fix the problem. If the underlying socket timed out, wouldn't you need to re-connect somehow?

The statement is inside a while True endless loop, so it will simply keep trying until the docker daemon responds. This worked fine for me.

I'm not sure how a 10 minute timeout will help, because a debugging session could also take an hour, so any arbitrary value would likely be wrong.

felixbuenemann · 2018-06-01T09:16:29Z

Btw. instead of recovering from socket timeouts (which are how long we wait on blocking operations, not network errors), we could also call socket.settimeout(None) before entering the while loop.

oniice · 2018-06-01T09:37:16Z

I've been getting a time out error when running my Lambda, but it oddly goes away if I add something like a console.log (before that no console output is done by the lambda)

Manually added this patch on my machine, and it fixes the issue!

Here is the error thrown when running....

Traceback (most recent call last):
  File "/usr/local/bin/sam", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/samcli/commands/local/invoke/cli.py", line 44, in cli
    docker_network, log_file, skip_pull_image, profile)  # pragma: no cover
  File "/usr/local/lib/python2.7/site-packages/samcli/commands/local/invoke/cli.py", line 76, in do_cli
    stderr=context.stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/commands/local/lib/local_lambda.py", line 80, in invoke
    stdout=stdout, stderr=stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/lambdafn/runtime.py", line 98, in invoke
    container.wait_for_logs(stdout=stdout, stderr=stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/container.py", line 187, in wait_for_logs
    self._write_container_output(logs_itr, stdout=stdout, stderr=stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/container.py", line 206, in _write_container_output
    for frame_type, data in output_itr:
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/attach_api.py", line 98, in _read_socket
    payload_type, payload_size = _read_header(socket)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/attach_api.py", line 170, in _read_header
    data = read_exactly(socket, 8)
  File "/usr/local/lib/python2.7/site-packages/docker/utils/socket.py", line 44, in read_exactly
    next_data = read(socket, n - len(data))
  File "/usr/local/lib/python2.7/site-packages/docker/utils/socket.py", line 30, in read
    return socket.recv(n)
socket.timeout: timed out

felixbuenemann · 2018-06-01T09:40:13Z

@oniice Yeah, that's a good point. The container could simply not be writing to stdout or stderr since it's busy and not logging anything, so making the timeout dependent on wether debug_port ist set as sugggested by @sanathkr is a bad idea.

felixbuenemann · 2018-06-01T09:56:03Z

I looked at the py-docker codebase and it looks like docker.from_env(timeout=None) should work. Then again the current solution localizes the fix to only the problematic region instead of setting a higher timeout everywhere.

So while setting the timeout through docker.from_env() would decouple us from the internal implementation, it would mean we would have a higher timeout on all api calls. Since the current code already hijacks the socket from the docker client through private methods, it is already very tightly coupled to the implementation.

@sanathkr Let me know what you think!

felixbuenemann · 2018-06-01T10:01:17Z

@oniice Could you please reformat your backtrace to use triple-backticks (fenced code block) instead of single-backticks so it becomes more readable?

benprew · 2018-06-15T00:43:42Z

I see the same issue for long-running lambdas that don't output any log messages, my backtrace is the same as @oniice and manually applying this patch fixes it.

sanathkr · 2018-06-28T22:08:27Z

@felixbuenemann

Sorry for the late reply. Thanks for clarifying. Timeout actually does not close the socket but simply reports that no new data is available. It makes sense to swallow the timeout then and keep polling.

Also, from other reports, I understand that this can happen even in non-debugging case when your function runs for several minutes without printing anything to the console. Without this patch, such invocations will timeout & fail.

As much as I don't like the unbounded infinite loop, there is value in doing it. This is especially okay for a CLI tool which isn't used in a production environment.

Can we just add a LOG.debug to the except timeout block, so we can have visibility into timeout issues for future debugging.

felixbuenemann · 2018-07-01T14:15:52Z

@sanathkr Can we just add a LOG.debug to the except timeout block, so we can have visibility into timeout issues for future debugging.

I've added a LOG.debug call in e2a96f5.

sanathkr

Fantastic! Thanks for making this change and sticking through all the discussion 🎉 A one line solution is usually the hardest to reason about :)

Sheep it 🐑

jfuss · 2018-07-02T13:48:08Z

@felixbuenemann Thanks for keeping with this! Congrats on the first PR :)

Fix socket timeouts during local debugging

274dda8

This fixes socket.timeout errors about 60 seconds after attaching to a node.js remote debug session.

sanathkr added the stage/in-review label May 31, 2018

jfuss mentioned this pull request Jun 19, 2018

Debugging Golang code in SAM CLI #281

Closed

Add debug log on docker socket timeout

e2a96f5

sanathkr approved these changes Jul 2, 2018

View reviewed changes

Merge branch 'develop' into patch-1

d2511f2

sanathkr added stage/accepted Accepted and will be fixed area/docker and removed stage/in-review labels Jul 2, 2018

jfuss merged commit c8344b4 into aws:develop Jul 2, 2018

felixbuenemann deleted the patch-1 branch July 2, 2018 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix socket timeouts during local debugging #422

Fix socket timeouts during local debugging #422

felixbuenemann commented May 20, 2018

felixbuenemann commented May 20, 2018

sanathkr commented May 31, 2018

sanathkr commented May 31, 2018 •

edited

felixbuenemann commented Jun 1, 2018 •

edited

felixbuenemann commented Jun 1, 2018

oniice commented Jun 1, 2018 •

edited

felixbuenemann commented Jun 1, 2018 •

edited

felixbuenemann commented Jun 1, 2018 •

edited

felixbuenemann commented Jun 1, 2018

benprew commented Jun 15, 2018

sanathkr commented Jun 28, 2018 •

edited

felixbuenemann commented Jul 1, 2018

sanathkr left a comment

jfuss commented Jul 2, 2018

Fix socket timeouts during local debugging #422

Fix socket timeouts during local debugging #422

Conversation

felixbuenemann commented May 20, 2018

felixbuenemann commented May 20, 2018

sanathkr commented May 31, 2018

sanathkr commented May 31, 2018 • edited

felixbuenemann commented Jun 1, 2018 • edited

felixbuenemann commented Jun 1, 2018

oniice commented Jun 1, 2018 • edited

felixbuenemann commented Jun 1, 2018 • edited

felixbuenemann commented Jun 1, 2018 • edited

felixbuenemann commented Jun 1, 2018

benprew commented Jun 15, 2018

sanathkr commented Jun 28, 2018 • edited

felixbuenemann commented Jul 1, 2018

sanathkr left a comment

Choose a reason for hiding this comment

jfuss commented Jul 2, 2018

sanathkr commented May 31, 2018 •

edited

felixbuenemann commented Jun 1, 2018 •

edited

oniice commented Jun 1, 2018 •

edited

felixbuenemann commented Jun 1, 2018 •

edited

felixbuenemann commented Jun 1, 2018 •

edited

sanathkr commented Jun 28, 2018 •

edited