Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pebble): catch socket.timeout exception in pebble.Client.exec() #1155

Merged
merged 3 commits into from Mar 26, 2024

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Mar 20, 2024

Tried to run some container.exec in a test charm with sleep, with some other long-running commands like apt-update, with or without timeout, but could not reproduce the issue.

Changed according to John's opinion: catch socket.timeout exception and continue polling till the end of the deadline.

Also updated docstring for model.Container.exec() and pebble.Client.exec().

Closes #826.

@IronCore864 IronCore864 changed the title fix(pebble): catch socket.timeout exception in pebble.Client.exec() and Container.exec() fix(pebble): catch socket.timeout exception in pebble.Client.exec() Mar 20, 2024
@IronCore864 IronCore864 marked this pull request as ready for review March 20, 2024 09:49
Copy link
Collaborator

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the investigation. Can we also add a regression test for this? For example, alongside test_wait_change_timeout in test_pebble.py, which makes the response raise socket.timeout -- which should fail before this change and succeed after? Thanks!

ops/pebble.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of minor suggestions.

CHANGES.md Outdated Show resolved Hide resolved
ops/pebble.py Show resolved Hide resolved
test/test_pebble.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tonyandrewmeyer tonyandrewmeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some other issues raised in #826 that I think we should either comment on somewhere (there or in this PR) or adjust. In particular:

  1. The Unix socket has code for providing a timeout, but I don't think there's any way for a charm (or even the Pebble client) to provide that, so it's unclear why it's there. Am I missing some way that it can be set, or some reason to provide it even though it's never used?
  2. The Pebble client has a timeout that's used for (I think) everything other than exec, but Charm's basically can't use this, because it's not exposed to the model. Is that deliberate? Should a charm be able to change this timeout?
  3. It seems like in the original report that the timeout was happening after ~13s, but it should presumably have been 5s. Have we checked that we do see a timeout after 5s and not a longer time?

ops/pebble.py Outdated Show resolved Hide resolved
@IronCore864
Copy link
Contributor Author

There are also some other issues raised in #826 that I think we should either comment on somewhere (there or in this PR) or adjust. In particular:

  1. The Unix socket has code for providing a timeout, but I don't think there's any way for a charm (or even the Pebble client) to provide that, so it's unclear why it's there. Am I missing some way that it can be set, or some reason to provide it even though it's never used?
  2. The Pebble client has a timeout that's used for (I think) everything other than exec, but Charm's basically can't use this, because it's not exposed to the model. Is that deliberate? Should a charm be able to change this timeout?
  3. It seems like in the original report that the timeout was happening after ~13s, but it should presumably have been 5s. Have we checked that we do see a timeout after 5s and not a longer time?

To answer the 3rd question first, the 5s timeout is pebble.Client.timeout, and as John mentioned in the issue, the intent for it is "If we have sent a request to pebble, and it hasn't responded within 5s, loop around again and poll some more."

It's not the timeout for executing the command created by Container.exec().wait(). If we run container.exec(["sleep", "60"]).wait() without a timeout, it will run for 60s for sure. The way to set timeout for the actual command is container.exec(["sleep", "60"], timeout=3).wait(), and this will throw some exception like ops.pebble.ChangeError: cannot perform the following tasks: exec command "sleep" (timed out after 3s: context deadline exceeded).

The code in the original issue doesn't set timeout for exec and is waiting forever, but socket.timeout is raised, that is why we added except socket.timeout. As for why it's 13 seconds, I'm not sure, seems weird, and I could not reproduce socket.timeout.


John did mention that the 5s timeout to pebble can't be configured and I checked, it's true, it can't be set in the entire OF code:

When creating a model, metadata is used, then passed to Unit, which is created by model, using _ModelCache. Then unit creates containers, container initializes pebble client. However, no where in the whole chain is pebble timeout set, so it's always the default 5s.

I think if we want to expose pebble timeout as a configuration item, one way to do so is to add something like pebble-timeout in [ContainerMeta](https://github.com/canonical/operator/blob/main/ops/charm.py#L1688) and tweak the code where it creates containers, so that charmers can set pebble timeout in charmcraft.yaml, containers section.

I don't know if this change is worth it, because A, 5s seems to be an OK value, and B, users probably only care about container.exec(timeout=xxx) which is supported, I don't know why they want to tweak the pebble timeout.

Is it worth adding this feature? Need your input here @benhoyt @tonyandrewmeyer.


Another thing John mentioned was the _UnixSocketConnection timeout issue that we set it even if charmers don't ask for it explicity, but I don't think it's true anymore, we only set it if it's provided. Should we comment in the issue?

ops/pebble.py Show resolved Hide resolved
@tonyandrewmeyer
Copy link
Contributor

tonyandrewmeyer commented Mar 22, 2024

To answer the 3rd question first, the 5s timeout is pebble.Client.timeout, and as John mentioned in the issue, the intent for it is "If we have sent a request to pebble, and it hasn't responded within 5s, loop around again and poll some more."

This is badly named, in my opinion, but we can't do anything about that now.

If I understand correctly, if I do Client(timeout=5).autostart_services(timeout=30), then:

  • A request is sent to Pebble, with a timeout of 5s saying "do this change"
  • A request is sent to Pebble, with a timeout of 5s saying "wait for that change to finish" - this also sends a timeout of 4s to Pebble's wait API. If the request gives a 503, then we try again (until we have used up 30s waiting for the change to happen, in which case we raise TimeoutError - the last attempt might wait less than 4s if there isn't that much left to get to 30s), otherwise we raise an APIError.

The wait API is a bit odd in that we are basically polling, but it seems to have been added to avoid polling.

If I do Client(timeout=5).exec(timeout=30, ...) then:

  • A request is sent to Pebble, with a timeout of 5s saying "run this command" and that includes the 30s timeout as a parameter for Pebble's context.
  • (Bunch of websocket stuff for handling I/O)
  • A request is sent to Pebble, with a timeout of 5s saying "wait for that command to finish" - this also sends a timeout of 4s to Pebble's wait API. If the request gives a 503, then we try again (until we have used up 31s waiting for the change to happen, in which case we raise TimeoutError - the last attempt might wait less than 4s if there isn't that much left to get to 31s), otherwise we raise an APIError.

So they are basically the same underneath.

It's not the timeout for executing the command created by Container.exec().wait(). If we run container.exec(["sleep", "60"]).wait() without a timeout, it will run for 60s for sure.

Yeah, I get that there are a bunch of different timeouts here, the pebble.Client.timeout is mostly an implementation detail around how granular the timeout checks are, and the methods that are expected to be slow (exec, autostart_services, restart_services, etc) all have their own timeout that's passed when they are run (and that's a timeout waiting for Pebble to do whatever you've asked).

Although the ticket summary is about the leaked exception, I think the broader problem (of the two issues combined) is "Pebble timeouts are confusing", and it'd be good to solve that, rather than just the specific leaking exception.

As for why it's 13 seconds, I'm not sure, seems weird, and I could not reproduce socket.timeout.

I would guess that the 13 comes from three attempts - we tried 2 times to get the change result from Pebble, waiting 4s each time, and that all worked as designed, and then on the 3rd attempt instead of the Pebble call timing out, the socket connection timed out after 5s (4+4+5=13).

If so I guess this is fine, because now we'll be catching that socket error and it'll time out after 30s.

I think if we want to expose pebble timeout as a configuration item,

I think we probably do not want to expose this, since it's not what users would think of a timeout. I think we do want to explain in the pebble.Client docs what that timeout attribute actually is.

Another thing John mentioned was the _UnixSocketConnection timeout issue that we set it even if charmers don't ask for it explicity, but I don't think it's true anymore, we only set it if it's provided. Should we comment in the issue?

I looked at this a bit more closely and it's clearer to me now. _UnixSocketConnection takes the timeout argument because it's a subclass of http.client.HTTPConnection, and that has a timeout argument. We do set the UNIX socket to use this timeout (but do it later than we should - you're really meant to set it before connect()). It's passed to _UnixSocketConnection via the urllib.request handling mechanism (that's how the 5s value gets everywhere), so it is always set by Pebble.

@IronCore864
Copy link
Contributor Author

This is badly named, in my opinion, but we can't do anything about that now.

Although the ticket summary is about the leaked exception, I think the broader problem (of the two issues combined) is "Pebble timeouts are confusing", and it'd be good to solve that, rather than just the specific leaking exception.

I agree that "Pebble timeouts are confusing" -- at least this is what I was thinking when I was reading the code.

I think it's too complicated to be completely refactored, also I don't know what the impact circle would be. Still, maybe we can do some minimal refactoring to make it better, even just slightly so?

Off the top of my head, I have the following ideas:

  1. Remove timeout: float = 5.0 from pebble.Client.__init__() because it's not configurable anyway. Make it a hard-coded constant value.
  2. Rename pebble.Client.timeout to something more meaningful, maybe pebble.Client.connection_timeout or pebble.Client.socket_timeout, to make it clear that it's not "the timeout for executing a command in the container". I think this parameter is used for urllib.request.OpenerDirector.open, which says "The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt." So I think maybe pebble.Client.connection_timeout is a better name to be consistent with urllib.request?
  3. Refactor the _wait_change_using_wait method a bit to make it more readable.

@tonyandrewmeyer
Copy link
Contributor

  1. Remove timeout: float = 5.0 from pebble.Client.__init__() because it's not configurable anyway. Make it a hard-coded constant value.

Even though this isn't accessible directly by charms, this would break backwards compatibility so we can't really do it. I don't think there's much we can do other than make the documentation clearer on what this value is (unless we do want to expose it to charms, but as we discussed above, I don't think so).

  1. Rename pebble.Client.timeout to something more meaningful, maybe pebble.Client.connection_timeout or pebble.Client.socket_timeout, to make it clear that it's not "the timeout for executing a command in the container". I think this parameter is used for urllib.request.OpenerDirector.open, which says "The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt." So I think maybe pebble.Client.connection_timeout is a better name to be consistent with urllib.request?

I think a different name would have been great, but this would again break backwards compatibility, so we can't really do it (in ops 2.x).

@benhoyt
Copy link
Collaborator

benhoyt commented Mar 25, 2024

Yeah, and can see how the "Client.timeout" name is confusing. I agree with Tony's assessment above:

I think if we want to expose pebble timeout as a configuration item,

I think we probably do not want to expose this, since it's not what users would think of a timeout. I think we do want to explain in the pebble.Client docs what that timeout attribute actually is.

I don't think we should expose the ability for charmers to change this (no one's asked for that, and 5s has worked well). I'm on board with explaining in the pebble.Client docstring what it means though!

@IronCore864
Copy link
Contributor Author

IronCore864 commented Mar 25, 2024

According to our discussion, I have added a little explanation in pebble.Client docstring. Please take another look.

Copy link
Collaborator

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Let's just remove the 3 extraneous files added under docs (do we need to add something more to .gitignore?)

ops/pebble.py Outdated Show resolved Hide resolved
@IronCore864 IronCore864 force-pushed the container-exec-socket-timeout branch from da85952 to 2d84ce7 Compare March 25, 2024 23:48
@IronCore864 IronCore864 force-pushed the container-exec-socket-timeout branch from 2d84ce7 to bbd134b Compare March 25, 2024 23:58
@IronCore864
Copy link
Contributor Author

Looks good to me. Let's just remove the 3 extraneous files added under docs (do we need to add something more to .gitignore?)

Sorry, it was a bad merge; the history is too complicated so I rebased everything to fix it.

ops/model.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tonyandrewmeyer tonyandrewmeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this!

@IronCore864 IronCore864 merged commit e8305f6 into canonical:main Mar 26, 2024
28 checks passed
@IronCore864 IronCore864 deleted the container-exec-socket-timeout branch March 26, 2024 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExecProcess.wait_output can leak a 'socket.timeout' exception
4 participants