Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix socket timeouts during local debugging #422

Merged
merged 3 commits into from Jul 2, 2018

Conversation

felixbuenemann
Copy link
Contributor

I was running into a problem where sam local invoke --debug-port 9222 myFunction or sam local start-api --debug-port 9222 would crash with a socket.timeout about 60 seconds after attaching to the debug session if there was no log output from the container.

This PR works around the problem by ignoring socket.timeout in the docker attach code and is essential to make debugging via debug port usable – at least on macOS.

Description of changes:

When attaching to a docker container socket.timeout is now ignored so that the attach loop keeps running during remote debug sessions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This fixes socket.timeout errors about 60 seconds after attaching to a node.js remote debug session.
@felixbuenemann
Copy link
Contributor Author

The patch was tested locally with SAM CLI, version 0.3.0 on macOS 10.13.4 (17E202) with Python 2.7.15 from homebrew.

@sanathkr
Copy link
Contributor

Thanks for the change @felixbuenemann. The problem statement definitely makes sense but I don't know ignoring timeout will fix the problem. If the underlying socket timed out, wouldn't you need to re-connect somehow?

@sanathkr
Copy link
Contributor

sanathkr commented May 31, 2018

Ideally, you would set the Timeout value when creating Docker client to be something large - ~10mins is sufficient?

Turns out we create the Docker client three places (only one is actually necessary) - https://github.com/awslabs/aws-sam-cli/search?q=from_env&unscoped_q=from_env. We should refactor this to create the Docker client only in invoke_context.py and pass to container_manager.py which will pass to downstream class. When creating the client using docker.from_env() you should set the timeout to be something large when debugging is enabled. If _debug_port is set, then you can assume that debugging is enabled.

This sounds like a lot but it is actually pretty simple. Do you think you can make this change? I can guide you through the process. This will not only fix your problem but also remove the minor tech debt we incurred due to three places where we created docker client.

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jun 1, 2018

Thanks for the change @felixbuenemann. The problem statement definitely makes sense but I don't know ignoring timeout will fix the problem. If the underlying socket timed out, wouldn't you need to re-connect somehow?

The statement is inside a while True endless loop, so it will simply keep trying until the docker daemon responds. This worked fine for me.

I'm not sure how a 10 minute timeout will help, because a debugging session could also take an hour, so any arbitrary value would likely be wrong.

@felixbuenemann
Copy link
Contributor Author

Btw. instead of recovering from socket timeouts (which are how long we wait on blocking operations, not network errors), we could also call socket.settimeout(None) before entering the while loop.

@oniice
Copy link

oniice commented Jun 1, 2018

I've been getting a time out error when running my Lambda, but it oddly goes away if I add something like a console.log (before that no console output is done by the lambda)

Manually added this patch on my machine, and it fixes the issue!

Here is the error thrown when running....

Traceback (most recent call last):
  File "/usr/local/bin/sam", line 11, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args[1:], **kwargs)
  File "/usr/local/lib/python2.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/samcli/commands/local/invoke/cli.py", line 44, in cli
    docker_network, log_file, skip_pull_image, profile)  # pragma: no cover
  File "/usr/local/lib/python2.7/site-packages/samcli/commands/local/invoke/cli.py", line 76, in do_cli
    stderr=context.stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/commands/local/lib/local_lambda.py", line 80, in invoke
    stdout=stdout, stderr=stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/lambdafn/runtime.py", line 98, in invoke
    container.wait_for_logs(stdout=stdout, stderr=stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/container.py", line 187, in wait_for_logs
    self._write_container_output(logs_itr, stdout=stdout, stderr=stderr)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/container.py", line 206, in _write_container_output
    for frame_type, data in output_itr:
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/attach_api.py", line 98, in _read_socket
    payload_type, payload_size = _read_header(socket)
  File "/usr/local/lib/python2.7/site-packages/samcli/local/docker/attach_api.py", line 170, in _read_header
    data = read_exactly(socket, 8)
  File "/usr/local/lib/python2.7/site-packages/docker/utils/socket.py", line 44, in read_exactly
    next_data = read(socket, n - len(data))
  File "/usr/local/lib/python2.7/site-packages/docker/utils/socket.py", line 30, in read
    return socket.recv(n)
socket.timeout: timed out

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jun 1, 2018

@oniice Yeah, that's a good point. The container could simply not be writing to stdout or stderr since it's busy and not logging anything, so making the timeout dependent on wether debug_port ist set as sugggested by @sanathkr is a bad idea.

@felixbuenemann
Copy link
Contributor Author

felixbuenemann commented Jun 1, 2018

I looked at the py-docker codebase and it looks like docker.from_env(timeout=None) should work. Then again the current solution localizes the fix to only the problematic region instead of setting a higher timeout everywhere.

So while setting the timeout through docker.from_env() would decouple us from the internal implementation, it would mean we would have a higher timeout on all api calls. Since the current code already hijacks the socket from the docker client through private methods, it is already very tightly coupled to the implementation.

@sanathkr Let me know what you think!

@felixbuenemann
Copy link
Contributor Author

@oniice Could you please reformat your backtrace to use triple-backticks (fenced code block) instead of single-backticks so it becomes more readable?

@benprew
Copy link

benprew commented Jun 15, 2018

I see the same issue for long-running lambdas that don't output any log messages, my backtrace is the same as @oniice and manually applying this patch fixes it.

@sanathkr
Copy link
Contributor

sanathkr commented Jun 28, 2018

@felixbuenemann

Sorry for the late reply. Thanks for clarifying. Timeout actually does not close the socket but simply reports that no new data is available. It makes sense to swallow the timeout then and keep polling.

Also, from other reports, I understand that this can happen even in non-debugging case when your function runs for several minutes without printing anything to the console. Without this patch, such invocations will timeout & fail.

As much as I don't like the unbounded infinite loop, there is value in doing it. This is especially okay for a CLI tool which isn't used in a production environment.

Can we just add a LOG.debug to the except timeout block, so we can have visibility into timeout issues for future debugging.

@felixbuenemann
Copy link
Contributor Author

@sanathkr Can we just add a LOG.debug to the except timeout block, so we can have visibility into timeout issues for future debugging.

I've added a LOG.debug call in e2a96f5.

Copy link
Contributor

@sanathkr sanathkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Thanks for making this change and sticking through all the discussion 🎉 A one line solution is usually the hardest to reason about :)

Sheep it 🐑

@sanathkr sanathkr added stage/accepted Accepted and will be fixed area/docker and removed stage/in-review labels Jul 2, 2018
@jfuss jfuss merged commit c8344b4 into aws:develop Jul 2, 2018
@jfuss
Copy link
Contributor

jfuss commented Jul 2, 2018

@felixbuenemann Thanks for keeping with this! Congrats on the first PR :)

@felixbuenemann felixbuenemann deleted the patch-1 branch July 2, 2018 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docker stage/accepted Accepted and will be fixed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants