Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aardvark,commit: acquire fs lock when performing commit to avoid race across parallel invocations. #375

Merged

Conversation

flouthoc
Copy link
Collaborator

@flouthoc flouthoc commented Aug 9, 2022

  • We should avoid overriding configs when another instance of aardvark
    is trying to commit configs on the same path.

  • On certain system a race exists where more than one aardvark instance
    are started in the frame where one instance has not yet completed
    updating its aardvark.pid causing more than one instance to get
    started and eventually causing conflits on the requested ports.

Some conditions are reported on low power hardware which tries to start a significant amount of containers and it looks like the case matches with what is described in the second point.

* We should avoid overriding configs when another instance of aardvark
  is trying to commit configs on the same path.

* On certain system a race exists where more than one aardvark instance
  are started in the frame where one instance has not yet completed
  updating its `aardvark.pid` causing more than one instance to get
  started and eventually causing conflits on the requested ports.

Some conditions are reported on `low power hardware which tries to start
a significant amount of containers` and it looks like the case matches
with what is described in the second point.

Signed-off-by: Aditya R <arajan@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flouthoc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Aug 9, 2022
// Acquire fs lock to ensure other instance of aardvark cannot commit
// or start aardvark instance till already running instance has not
// completed its `commit` phase.
let lockfile_path = Path::new(&self.config)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path for lock file is intentionally <run-root-base>/networks/aardvark.lock instead of <run-root-base>/networks/aardvark-dns/aardvark.lock to make this patch backward compatible with older aardvark versions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Luap99 @mheon just fyi we will have a lock file now for aardvark commits, but we can change path later if there is any disagreement on this.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Aug 9, 2022

@baude @mheon PTAL, could we ask the folks to try this patch on top of netavark v1.1.0 and aardvark v1.1.0.

I have played while trying to run multiple containers in parallel but it is exactly like what is mentioned in the issue that race is very rare and seen on low power hardware ( I think on the edge platforms ) for which the issue was reported against.

So while I was not able to reproduce the actual issue but I think following patch should close it.

/hold till we get ackd.

@baude
Copy link
Member

baude commented Aug 9, 2022

/hold

@baude
Copy link
Member

baude commented Aug 27, 2022

Confirmed this fixed the issue reported ... ready for merging.

LGTM

@baude
Copy link
Member

baude commented Aug 27, 2022

/hold cancel

@baude
Copy link
Member

baude commented Aug 27, 2022

@mheon @Luap99 please review and merge

@TomSweeneyRedHat
Copy link
Member

Addresses: https://bugzilla.redhat.com/show_bug.cgi?id=2116481 Once this BZ merges, please assign the BZ to Jindrich.

@mheon
Copy link
Member

mheon commented Aug 30, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Aug 30, 2022
@openshift-merge-robot openshift-merge-robot merged commit c080296 into containers:main Aug 30, 2022
16 checks passed
@Luap99
Copy link
Member

Luap99 commented Sep 5, 2022

Podman already locks all operations in the network backend so there should never be two different netavark processes at any given time.

I think the problem here is that we do not wait for aardvark setup. Currently we return before the child writes its pid. So the actual fix is to make aardvark parent wait until the child wrote the pid file.
I don't think this can fix the double start issue at all, it just made it less likely.

@baude
Copy link
Member

baude commented Sep 5, 2022

@flouthoc please consider @Luap99 feedback ... and if in agreement, submit a pr

@flouthoc
Copy link
Collaborator Author

flouthoc commented Sep 5, 2022

@Luap99 I think we can verify the pid here , just after we get ackd from ready_pipe_read that should have a similar effect https://github.com/containers/aardvark-dns/blob/main/src/commands/run.rs#L40 will open a PR for this, although a lock for commit just adds safety.

@flouthoc
Copy link
Collaborator Author

flouthoc commented Sep 6, 2022

@Luap99 Lets discuss this here: containers/aardvark-dns#220

@baude
Copy link
Member

baude commented Sep 28, 2022

/cherry-pick v1.1.0-rhel

@openshift-cherrypick-robot

@baude: new pull request created: #421

In response to this:

/cherry-pick v1.1.0-rhel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mheon
Copy link
Member

mheon commented Sep 28, 2022

/cherry-pick v1.0.1-rhel

@openshift-cherrypick-robot

@mheon: #375 failed to apply on top of branch "v1.0.1-rhel":

Applying: aardvark,commit: acquire fs lock when performing commit
Using index info to reconstruct a base tree...
M	Cargo.lock
M	Cargo.toml
M	src/dns/aardvark.rs
Falling back to patching base and 3-way merge...
Auto-merging src/dns/aardvark.rs
CONFLICT (content): Merge conflict in src/dns/aardvark.rs
Auto-merging Cargo.toml
CONFLICT (content): Merge conflict in Cargo.toml
Auto-merging Cargo.lock
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 aardvark,commit: acquire fs lock when performing commit
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick v1.0.1-rhel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants