Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make gears.object:emit_signal fail proof (Fix #3667) #3669

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Aire-One
Copy link
Member

@Aire-One Aire-One commented Aug 6, 2022

I guess we need to do something similar in the CAPI, but I'm not sure how 🤷

@codecov
Copy link

codecov bot commented Aug 6, 2022

Codecov Report

Merging #3669 (84712ba) into master (b7bac1d) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3669   +/-   ##
=======================================
  Coverage   90.65%   90.65%           
=======================================
  Files         854      854           
  Lines       54774    54788   +14     
=======================================
+ Hits        49653    49666   +13     
- Misses       5121     5122    +1     
Flag Coverage Δ
gcov 90.65% <100.00%> (+<0.01%) ⬆️
luacov 93.45% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
lib/gears/object.lua 88.18% <100.00%> (+0.09%) ⬆️
spec/gears/object_spec.lua 92.37% <100.00%> (+0.94%) ⬆️
tests/examples/screen/template.lua 96.52% <0.00%> (-0.22%) ⬇️

@ghost
Copy link

ghost commented Aug 7, 2022

I guess we need to do something similar in the CAPI

I think this is already done. The CAPI wraps function calls in pcall via luaA_dofunction when emitting a signal. I believe it achieve the same as what you're doing here, am I wrong?

@sclu1034
Copy link
Contributor

sclu1034 commented Aug 8, 2022

The premise for this is (from #3667 (comment))

this "user level fix" wouldn't work if this is the callback itself that throw the error. So the proposed PR is still required to correctly mitigate the issue.

But it's not correct that only a change in the library could fix this. The user can wrap "the callback itself" in a pcall, just how you wrapped test in your example.

And I really don't think this should be done by Awesome, especially when it brings something heavy like gears.protected_call into the UI pipeline. gears.protected_call can take more than 5x as long compared to a plain function call, depending on the Lua version. And that will rack up if it's done for every single signal handler.

env LUA=lua5.1 RUNS=10000000 bash run.sh

"print('Hello')"
real    0m3.451s
user    0m3.423s
sys     0m0.007s

"pcall(print, 'Hello')"
real    0m4.521s
user    0m4.501s
sys     0m0.007s

"protected_call(print, 'Hello')"
real    0m17.791s
user    0m17.697s
sys     0m0.011s

For the internal stuff, we have tests (and plenty users running the system) to make sure we don't panic. For user-provided callbacks, like the above, the user can wrap them (only) when they need to. But it gives the user the choice between a performance penalty or potential panics.

@Aire-One Aire-One changed the title Fix 3667 Make gears.object:emit_signal fail proof (Fix #3667) Aug 9, 2022
@Elv13
Copy link
Member

Elv13 commented Oct 2, 2022

@Aire-One What's your take on this? Should we close this PR? Or should we make a new variant of emit signal with the extra pcalls? There are definitively some places where an error in a signal handler can cause a transaction state to be messed up. However should we just wrap those emit_signal in a pcall rather than have a new method?

@sclu1034
Copy link
Contributor

sclu1034 commented Oct 3, 2022

There are definitively some places where an error in a signal handler can cause a transaction state to be messed up.

Continuing despite failure can mess things up just as much, and the messed up transaction from the swallowed error could silently run along and fail somewhere else, making debugging much harder.
Panicking immediately has the benefit that it can't be missed or ignored.

And since it would be the user's callback that panics, they can wrap it themselves, if that is acceptable for their use case.

@actionless
Copy link
Member

i believe it's smth what should be rather better-explained in documentation, not fixed in code

@Aire-One
Copy link
Member Author

Aire-One commented Oct 4, 2022

should we make a new variant of emit signal with the extra pcalls?

No, it wouldn't solve the initial issue.

And I really don't think this should be done by Awesome, especially when it brings something heavy like gears.protected_call into the UI pipeline. gears.protected_call can take more than 5x as long compared to a plain function call, depending on the Lua version.

Yeah, I understand your concern.

I personally would rather have this performance drop than a silent breakage. As pointed out earlier, we already do the same at the C level with luaA_dofunction wrapping the Lua function with a pcall and I don't think anyone ever mentioned a performance issue because of that. (*)

And as you stated, this performance cost is the worst with Lua 5.1 because of the extra business to get around xpcall limitations. So I'm taking these numbers with a grain of salt.
(I'm not sure the impact it actually has on the result, but I'm curious to see the benchmark for other Lua versions.)

There are definitively some places where an error in a signal handler can cause a transaction state to be messed up.

Continuing despite failure can mess things up just as much, and the messed up transaction from the swallowed error could silently run along and fail somewhere else, making debugging much harder.
Panicking immediately has the benefit that it can't be missed or ignored.

I disagree with this. Just as the initial issue shown it: there is no panic. It was even worse, since the supposedly one-time timer was called indefinitely.

since it would be the user's callback that panics, they can wrap it themselves, if that is acceptable for their use case.

i believe it's smth what should be rather better-explained in documentation, not fixed in code

I guess it's ok-ish, but not idiot-proof, even for us, we could introduce a badly written signal callback that silently breaks something.
(Plus, it still can lead to something that looks like a bug from our API (again, see the initial issue))

(*) The fact that the C API already does it for years consolidate my idea we have to do it in Lua as well.

@Elv13
Copy link
Member

Elv13 commented Oct 5, 2022

swallowed error could silently run along and fail somewhere else, making debugging much harder.

Right now, when we do this, we don't swallow the errors, the red popup with stacktrace is still shown, we just make sure the cleanup code isn't skipped.

And since it would be the user's callback that panics, they can wrap it themselves, if that is acceptable for their use case.

The problem is that it causes a lot of problems. Things like dbus messages should have a reply or the client can get locked. Database transaction have similar problem, where things pending GC because they were not properly finished can affect the new transactions. You can in theory exhaust file descriptors if during an animation. It can also cause memory leaks:

local global_cache = {}

client.connect_signal("unmanage", function(c)
    module.emit_signal("request::something", "context", {client=c})
    global_cache[c] = nil
end)

So there are cases where you do want such pcall, it's not universally bad.

As for if we should do it everywhere or not, I don't have a super strong opinion.

@sclu1034
Copy link
Contributor

sclu1034 commented Oct 8, 2022

swallowed error could silently run along and fail somewhere else, making debugging much harder.

Right now, when we do this, we don't swallow the errors, the red popup with stacktrace is still shown, we just make sure the cleanup code isn't skipped.

No, it does not. gears.protected_call does not create a notification, it only prints to stderr. And from what I can tell from interacting with users, the majority doesn't monitor that (most don't even know it's a thing).

And the linked issue even exhibits the "fail somewhere else" component: The callback is the one that panics, but single_shot breaks, only due to the implementation detail of the ordering of the signal handlers.

It can also cause memory leaks:

local global_cache = {}

client.connect_signal("unmanage", function(c)
    module.emit_signal("request::something", "context", {client=c})
    global_cache[c] = nil
end)

That can be fixed easily:

-    module.emit_signal("request::something", "context", {client=c})
+    gears.protected_call(module.emit_signal, module, "request::something", "context", {client=c})

The same can be done within the D-Bus messaging or any database query code.
And only this way, by putting the pcall at the same level as the clean-up code, can actually be guaranteed that the clean-up always runs. Because in most cases there will be more code than just a call to emit_signal, and that other code might panic, too.
Unless you want to go through our entire code base, eliminate/wrap all our asserts and any place where we choose to panic. But that would be a huge change to the entire error handling strategy.

There are definitively some places where an error in a signal handler can cause a transaction state to be messed up.

Continuing despite failure can mess things up just as much, and the messed up transaction from the swallowed error could silently run along and fail somewhere else, making debugging much harder.
Panicking immediately has the benefit that it can't be missed or ignored.

I disagree with this. Just as the initial issue shown it: there is no panic. It was even worse, since the supposedly one-time timer was called indefinitely.

AwesomeWM doesn't crash there, because gears.timer already wraps emit_signal in protected_call.

If the order of these two blocks was reversed, ensuring that the single_shot handler always runs first, a panic in callback would have never been able to affect the unrelated single_shot:

awesome/lib/gears/timer.lua

Lines 207 to 216 in b16f628

if args.callback then
if args.call_now then
args.callback()
end
ret:connect_signal("timeout", args.callback)
end
if args.single_shot then
ret:connect_signal("timeout", function() ret:stop() end)
end

And as you stated, this performance cost is the worst with Lua 5.1 because of the extra business to get around xpcall limitations. So I'm taking these numbers with a grain of salt.
(I'm not sure the impact it actually has on the result, but I'm curious to see the benchmark for other Lua versions.)

protected_call performs much better on later versions, but those versions as a whole perform worse than 5.1:

Lua print('hello') pcall(print, 'hello') protected_call(print, 'hello')
5.1 1.154 1.489 5.1
5.2 6.752 7.017 8.063
5.3 6.318 6.683 7.428
5.4 5.984 6.268 7.163

(different system than the previous benchmarks)

As pointed out earlier, we already do the same at the C level with luaA_dofunction wrapping the Lua function with a pcall and I don't think anyone ever mentioned a performance issue because of that.

I'm not talking about performance "issues", but more like a "nuisance". Adding a bunch of pcalls obviously wouldn't make things come to a crawl. But it can result in slight latency increases, especially when done with O(n) when a O(1) could suffice.
I.e. adding pcalls only where needed, as outlined above, should be negligible. But just throwing it out there "just in case" and separately for every handler will add up to what may become a slight, but tangible change.
Input latency is especially sensitive in that regard.

Additionally, the bigger picture in this context is the fact that we do not have anything to track performance in the long run. Therefore, when changes like this come months or even years apart, they would be considered and decided in isolation. But even though when each change is justifiable in itself, the results will add up, unnoticed.

That can either be remedied by occasionally doing performance optimizations (except I'm not aware of that every being done in recent times, or any profiling in the code base that the optimization could be based on) or by making sure the change is only done where absolutely necessary in the first place (which I'm arguing to do here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

single_shot in gears.timer not working with a lua error() in the callback
4 participants