Skip to content

Conversation

@QiWang19
Copy link
Member

@QiWang19 QiWang19 commented Nov 24, 2025

WIP PR cancelable reader and timeout go routine. we'd like to package a binary with these changes for the customer to test .

@github-actions github-actions bot added the image Related to "image" package label Nov 24, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 24, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6533

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For reference, compare earlier #421 .)

Testing and research are always good to do.

Do note that we know a bit more than at the time when #421 was filed: there seems to be something wrong with tracking the state of “the write end of the --command-fd pipe”. That pipe is not directly driven by caller-provided inputs to GPGME; so if the hypothesis is correct, and if the state tracking causes incorrectly reused / closed file descriptors, that would still affect, and probably break, CRI-O (compare “the write ends of all of these pipes are owned by conmon” in Jira ).

mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(data)
if err != nil {
return sarRejected, nil, err
// Import the keys with a 60s timeout to avoid hanging indefinitely. see issues.redhat.com/browse/OCPBUGS-57893
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really quite strongly don’t want this complexity in core policy enforcement code.

Also, I think the core complexity of the cleanup attempt here comes from separating the new…Mechanism initialization and defer mech.Close() into different goroutines. I think running all of the PolicyContext.IsRunningImageAllowed within a goroutine with a timeout would automatically handle cleanup (… or not, and leak a C thread+goroutine). In what respect is this approach better?


This is really orthogonal to the other part of the of the PR: if the approach to terminate GPGME via a callback failing works, this moving of the code into a concurrent goroutine does not need to exist (and continues to have the issues of #421 ).

Copy link
Member Author

@QiWang19 QiWang19 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the cimplexity and the still has the hang in c library. for debug wise, drop the select waiting on timeout from the core policy code base, so we can debugging on the cancelable reader.

Thanks for the note.

@mtrmac mtrmac marked this pull request as draft November 25, 2025 03:33
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 25, 2025
// FIXME: move this to per-context initialization
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(data)
// Pass the context so the cancelable reader can detect cancellation and potentially abort GPGME operations
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(ctx, data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(ctx, data)
importCtx, cancel := context.WithTimeout(ctx, importKeyTimeOut)
defer cancel()
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(importCtx, data)

@mtrmac Updated the code. Do we still need to add timeout context pass down, so the cancellable reader to work? if not hardcode a timeout here, will the caller cancel the context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid I can’t parse the question. Are you asking whether the CRI-O server already has a timeout / request cancellation logic of some kind (maybe triggering when/if the Kubelet client gives up)? That would be a CRI-O question.

And if CRI-O did pass to c/image/copy.Image a context that would in some situations be canceled, yes, I think such a context should be forwarded all the way to this code.


I would, by default, expect the timeout, if any, to be set at a pretty high-level; I was thinking one for the whole RPC operation served by CRI-O … except that that doesn’t work well for pulls, which, on slow links, may be making steady progress over many minutes, so there is no good timeout that can be set for the whole pull by default.

So, I guess, in the caller of IsRunningImageAllowed in c/image would be at least better than here prSignedBy; IsRunningImageAllowed is not really constant-time, but it is fairly bounded, and it would mean less new code in c/image/signature.

[There’s also CRI-O’s PullProgressTimeout which, last time I looked at it, made unwarranted assumptions about how c/image reports progress, and thus was ~unusable. But if it, hypothetically, did not trigger on false positives where c/image does not report progress but may spend a lot of time, it would trigger cancellation of the context and that should be visible in the GPGME mechanism.]

Signed-off-by: Qi Wang <qiwan@redhat.com>
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

image Related to "image" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants