Proposal: stop trying to handle OOM #493

cole-miller · 2023-04-12T03:57:35Z

Currently, dqlite and raft make an effort to handle allocation failures. I would like to explore the idea of unconditionally doing pthread_exit when this happens, or at least something closer to that extreme in the space of possible error-handling strategies. I am not militant about this, but I think it would have some real benefits and in any case it would be good to get clear on what value RAFT_NOMEM et al. are providing (if only for my benefit 🙂).

Possible advantages of trying less hard to handle allocation failure

The allocation failure handling code is a pain to test (though not impossible), and I wouldn't be at all surprised if there are bugs there. The simpler our overall error-handling strategy is, the easier it is for us to implement it consistently and without introducing subtle bugs, and the better we can describe it to users of dqlite.

Possible disadvantages

I'm pretty sure no current user of dqlite depends on it gracefully handling allocation failure. Anybody who's using go-dqlite has already signed up for unrecoverable errors when memory is exhausted. But if dqlite gets more users in the future, via the C client, that have different requirements, then "OOM -> pthread_exit" might become a problem. And it would be pretty annoying to have to backtrack from the more brutal error-handling strategy to the more graceful one.

I could be convinced either way. Thoughts?

freeekanayaka · 2023-04-12T07:06:29Z

Currently, dqlite and raft make an effort to handle allocation failures. I would like to explore the idea of unconditionally doing pthread_exit when this happens, or at least something closer to that extreme in the space of possible error-handling strategies.

Maybe we could offer that as an option, e.g. raft_set_abort_upon_oom()? However, I'd not remove the possibility for the user to handle it gracefully, see below.

I am not militant about this, but I think it would have some real benefits and in any case it would be good to get clear on what value RAFT_NOMEM et al. are providing (if only for my benefit slightly_smiling_face).

If we report an OOM failure (like any other failure) then the user as the choice of deciding what to do. If they want to abort using pthread_exit, that's easy to do. However they might also want do handle the error gracefully, for example in an embedded device with limited memory that should ideally not go down, or maybe because some malicious client is trying to attack the server in some way.

Like all other errors, leaving the choice to the user offers the most flexibility, especially for a low-level piece like libraft.

Possible advantages of trying less hard to handle allocation failure

The allocation failure handling code is a pain to test (though not impossible)

The current testing approach for OOM is basically to have parameterized tests that inject OOM failures progressively. You run the same test multiple times, but each time the memory allocator injects a failure at a different spot (e.g. in the first run the first call to malloc fails right way, in the second run the first call to malloc succeeds but the second fails, etc). That should already cover quite some ground, but we can surely improve it.

and I wouldn't be at all surprised if there are bugs there.

Yeah, we should definitely improve this. However, in most (if not all) cases I believe the bug wouldn't be related to the type of failure (OOM), but to the fact that a failure occurs at all and is not properly handled. In other words if we find such a bug, it's worth being fixed independently from the type of failure, because if later on we modify that particular buggy code in ways that it can legitimately produce other types of failures (non OOM), then the bug is almost certainly going to be still there.

The simpler our overall error-handling strategy is, the easier it is for us to implement it consistently and without introducing subtle bugs, and the better we can describe it to users of dqlite.

Perhaps until we feel more confident about the robustness of our error handling for this particular failure, we could turn on the abort-upon-oom option (raft_set_abort_upon_oom) in the projects for which we are direct raft users, e.g. dqlite and LXD.

Possible disadvantages

I'm pretty sure no current user of dqlite depends on it gracefully handling allocation failure.

Agreed. But that might change in the future, plus dqlite is not the only consumer of libraft.

Anybody who's using go-dqlite has already signed up for unrecoverable errors when memory is exhausted. But if dqlite gets more users in the future, via the C client, that have different requirements, then "OOM -> pthread_exit" might become a problem. And it would be pretty annoying to have to backtrack from the more brutal error-handling strategy to the more graceful one.

Right. I'd leave the door open and make abort-upon-oom an opt-in options, so we can (eventually) meet the needs of both audiences.

MathieuBordere added the Feature New feature, not a bug label Jun 12, 2023

cole-miller closed this as completed Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: stop trying to handle OOM #493

Proposal: stop trying to handle OOM #493

cole-miller commented Apr 12, 2023 •

edited

Loading

freeekanayaka commented Apr 12, 2023

Possible advantages of trying less hard to handle allocation failure

Possible disadvantages

Proposal: stop trying to handle OOM #493

Proposal: stop trying to handle OOM #493

Comments

cole-miller commented Apr 12, 2023 • edited Loading

Possible advantages of trying less hard to handle allocation failure

Possible disadvantages

freeekanayaka commented Apr 12, 2023

Possible advantages of trying less hard to handle allocation failure

Possible disadvantages

cole-miller commented Apr 12, 2023 •

edited

Loading