Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution of failpoint should not block deactivation #64

Closed
serathius opened this issue Apr 1, 2024 · 6 comments
Closed

Execution of failpoint should not block deactivation #64

serathius opened this issue Apr 1, 2024 · 6 comments
Assignees

Comments

@serathius
Copy link
Member

Useful for etcd-io/etcd#17680

If I use a http and setup a failpoint with a long sleep (e.g. 1s), deactivating it runs into a problem.
Execution and failpoint change run under the same lock. Deactivation will be really delayed as it's racing against execution for the same lock. For example in etcd-io/etcd#17680, I want to inject a sleep into watch write loop, results in continuous execution of the endpoint to sleep. As result it can take a long time to deactivate it.

I would expect that failpoint execution only uses lock to copy the failpoint term, and then execution is done outside of critical section, allowing deactivation to be immediate.

@ahrtr
Copy link
Member

ahrtr commented Apr 8, 2024

@henrybear327 are you interested and have bandwidth to take care of this issue? Thanks

@ahrtr
Copy link
Member

ahrtr commented Apr 8, 2024

I noticed this issue long time ago, but just did not get time to have a deep dive on it.

@henrybear327
Copy link
Contributor

@henrybear327 are you interested and have bandwidth to take care of this issue? Thanks

@ahrtr please assign it to me as I would like to give it a try! Thank you!

@henrybear327
Copy link
Contributor

I will be putting up a PoC in the next couple of days :)

henrybear327 added a commit to henrybear327/gofail that referenced this issue Apr 18, 2024
There are 2 main flows of the gofail library: namely enable/disable and
execution (`Acquire`) of the failpoints.

Currently, a mutex is protecting both flows, thus only one action can
make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under
a dedicated `RWMutex`.

The existing `failpointsMu` will only be protecting the main shared
data structures, such as `failpoints` map.

Notice that in our current implementation, the execution of the same
failpoint is still sequential (there is a lock within `eval` on the
term being executed)

Reference:
- etcd-io#64
henrybear327 added a commit to henrybear327/gofail that referenced this issue Apr 18, 2024
There are 2 main flows of the gofail library: namely enable/disable and
execution (`Acquire`) of the failpoints.

Currently, a mutex is protecting both flows, thus only one action can
make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under
a dedicated `RWMutex`.

The existing `failpointsMu` will only be protecting the main shared
data structures, such as `failpoints` map.

Notice that in our current implementation, the execution of the same
failpoint is still sequential (there is a lock within `eval` on the
term being executed)

Reference:
- etcd-io#64

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/gofail that referenced this issue May 3, 2024
There are 2 main flows of the gofail library: namely enable/disable and
execution (`Acquire`) of the failpoints.

Currently, a mutex is protecting both flows, thus only one action can
make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under
a dedicated `RWMutex`.

The existing `failpointsMu` will only be protecting the main shared
data structures, such as `failpoints` map.

Notice that in our current implementation, the execution of the same
failpoint is still sequential (there is a lock within `eval` on the
term being executed)

Reference:
- etcd-io#64

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/gofail that referenced this issue May 3, 2024
There are 2 main flows of the gofail library: namely enable/disable and
execution (`Acquire`) of the failpoints.

Currently, a mutex is protecting both flows, thus only one action can
make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under
a dedicated `RWMutex`.

The existing `failpointsMu` will only be protecting the main shared
data structures, such as `failpoints` map.

Notice that in our current implementation, the execution of the same
failpoint is still sequential (there is a lock within `eval` on the
term being executed)

Reference:
- etcd-io#64

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/gofail that referenced this issue May 14, 2024
There are 2 main flows of the gofail library: namely enable/disable and
execution (`Acquire`) of the failpoints.

Currently, a mutex is protecting both flows, thus only one action can
make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under
a dedicated `RWMutex`.

The existing `failpointsMu` will only be protecting the main shared
data structures, such as `failpoints` map.

Notice that in our current implementation, the execution of the same
failpoint is still sequential (there is a lock within `eval` on the
term being executed)

Reference:
- etcd-io#64

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/etcd that referenced this issue May 15, 2024
Reference:
- https://github.com/etcd-io/etcd/pull/17719/files
- etcd-io/gofail#64

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/etcd that referenced this issue May 15, 2024
Reference:
- etcd-io#17719
- etcd-io/gofail#64

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/etcd that referenced this issue May 17, 2024
Because of etcd-io/gofail#64 being merged,
we can rollback the change.

Reference:
- etcd-io#17719

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
@henrybear327
Copy link
Contributor

@ahrtr I think we can close this issue!

@ahrtr
Copy link
Member

ahrtr commented May 17, 2024

@ahrtr I think we can close this issue!

Thanks.

@ahrtr ahrtr closed this as completed May 17, 2024
henrybear327 added a commit to henrybear327/etcd that referenced this issue Jun 8, 2024
Because of etcd-io/gofail#64 being merged,
we can rollback the change.

Reference:
- etcd-io#17719

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
henrybear327 added a commit to henrybear327/etcd that referenced this issue Jun 9, 2024
Because of etcd-io/gofail#64 being merged,
we can rollback the change.

Reference:
- etcd-io#17719

Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants