Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxied update connections are broken in F38 #1477

Closed
fifofonix opened this issue Apr 19, 2023 · 18 comments
Closed

Proxied update connections are broken in F38 #1477

fifofonix opened this issue Apr 19, 2023 · 18 comments

Comments

@fifofonix
Copy link

Describe the bug

F37 servers configured behind a corporate proxy cease to be able to apply rpm-ostree updates once they have upgraded to F38. Newly provisioned F38 servers are similarly affected.

Related to: https://bugzilla.redhat.com/show_bug.cgi?id=2185433

Reproduction steps

  1. Configure a server like this: https://docs.fedoraproject.org/en-US/fedora-coreos/proxy/
  2. Zincati detects any updates correctly via proxy (sudo systemctl status zincati)
  3. Rpm-ostreed times out on attempt to stage release (sudo systemctl status rpm-ostreed)
rpm-ostree[52378]: libostree HTTP error from remote fedora for <https://ostree.fedoraproject.org/mirrorlist>: Timeout was reached
rpm-ostree[52378]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos failed: While pulling bfbc0cd30068bd5a7eaac5bac2f0420d01652f073fee64c8d2b0b37868c801e7: While fetching mirrorlist 'https://ostree.fedoraproject.org/mirrorlist'

Expected behavior

OS updates should continue to be applied unattended as previously.

Actual behavior

Node is stuck with inability to apply future updates.

It is possible to rollback to F37 on nodes that have upgraded. However, critically watch out for: #1473

System details

  • Nodes behind a corporate proxy
  • VMWare (Version: 38.20230414.1.0 (2023-04-14T10:19:13Z)

Butane or Ignition config

No response

Additional information

No response

@travier
Copy link
Member

travier commented Apr 19, 2023

We paused the F38 rollout: coreos/fedora-coreos-streams#700

@fifofonix
Copy link
Author

Not at all sure what the root cause issue behind the BZ is yet.

But, I wonder whether some of the historical means to configure proxies might work to avoid having to rollback a server or re-provision it (as a temporary solution): coreos/rpm-ostree#762

@jmarrero
Copy link
Member

Thank you so much for the detailed bug report @fifofonix!

It looks like the issue is with libcurl/curl packages as downgrading to libcurl/curl 7.86.0-4 solves the issue and upgrading it to 8.0.1-2 works as expected too.

I can continue to reproduce the issue consistently with the first curl/libcurl 7.87 build (7.87.0-1) and the last two ones for F38 7.87.0-6 & 7.87.0-7

I have reached out the curl maintainer in the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2185433 and we are trying to pinpoint the possible bug fix regression on the curl code.

@travier
Copy link
Member

travier commented Apr 21, 2023

@jmarrero
Copy link
Member

verified that the update works for me with:

rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2023-eec1379708

@dustymabe
Copy link
Member

@dustymabe dustymabe changed the title OS Updates initiated from behind a Corporate Proxy Stop Working With F38 (at least up to 38.20230417.1.0) Proxied update connections are broken in F38 Apr 21, 2023
@dustymabe dustymabe added status/pending-testing-release Fixed upstream. Waiting on a testing release. status/pending-next-release Fixed upstream. Waiting on a next release. labels Apr 21, 2023
@jlebon
Copy link
Member

jlebon commented Apr 21, 2023

So how do we want to tackle the semi-broken testing release (38.20230414.2.0)? Presumably, there are nodes on there that use a proxy that will stay stuck there. Should we send a coreos-status email with instructions to either rollback, reprovision, or do an override replace with the fixed libcurl?

We should also consider marking the release as a deadend some time after the rollout for the fixed testing release has finished. Nodes that aren't stuck will have upgraded. Nodes that are stuck will get the MOTD. (We could mark as a deadend immediately; IIRC it doesn't prevent upgrades from happening, but the MOTD would be incorrect on nodes that aren't actually stuck.)

@dustymabe
Copy link
Member

The fix for this went into testing stream release 38.20230414.2.1. Please try out the new release and report issues.

@dustymabe dustymabe removed the status/pending-testing-release Fixed upstream. Waiting on a testing release. label Apr 21, 2023
@dustymabe
Copy link
Member

dustymabe commented Apr 21, 2023

So how do we want to tackle the semi-broken testing release (38.20230414.2.0)? Presumably, there are nodes on there that use a proxy that will stay stuck there. Should we send a coreos-status email with instructions to either rollback, reprovision, or do an override replace with the fixed libcurl?

Yeah we should probably send a status post. I hesitate to instruct people to rollback at this point because it's a major rollback and there might be some things that don't work as a result (even if they were just going to immediately re-upgrade).

override replace sounds the nicest, but we'd have to make sure they removed the override after upgrade.. I guess we could just give them instructions to make it ephemeral so it wouldn't have to be override removed.

We should also consider marking the release as a deadend some time after the rollout for the fixed testing release has finished. Nodes that aren't stuck will have upgraded. Nodes that are stuck will get the MOTD. (We could mark as a deadend immediately; IIRC it doesn't prevent upgrades from happening, but the MOTD would be incorrect on nodes that aren't actually stuck.)

If the MOTD is all that we'd get from that I think I'd vote to not to do this. We'd have to deadend every 38 next release so far too.

I imagine behind a proxy isn't a huge part of our userbase or else we would have had a user report this problem before it got to testing? Though maybe all of them just aren't running next like @fifofonix is (thank you @fifofonix!).

@fifofonix
Copy link
Author

Interested in bringing a handful of next nodes (and one testing node) back into the fold. What is the ephemeral way to do this rpm-ostree override replace? I presume one advantage of this method is that it would also allow addressing newly provisioned nodes on the now defunct release as well as nodes that have upgraded to this state (of which I have a few)?

@dustymabe
Copy link
Member

dustymabe commented Apr 24, 2023

on x86_64 maybe try something like (untested):

sudo systemctl stop zincati
sudo rpm-ostree usroverlay
sudo rpm -Uvh https://kojipkgs.fedoraproject.org//packages/curl/7.87.0/8.fc38/x86_64/curl-7.87.0-8.fc38.x86_64.rpm7.87.0-8.fc38.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/curl/7.87.0/8.fc38/x86_64/libcurl-minimal-7.87.0-8.fc38.x86_64.rpm
sudo systemctl start zincati

The update rollout window for the new release starts this morning so you might not see an update happen immediately.

@fifofonix
Copy link
Author

Slight modification to above did yield an update to a 'stuck' node - for the time being pending the start of the new rollout the update was to an equally 'stuck' 38.20230417.1.0 - but the point is this proves how to move forward.

sudo systemctl stop zincati
sudo rpm-ostree usroverlay
sudo rpm -Uvh https://kojipkgs.fedoraproject.org//packages/curl/7.87.0/8.fc38/x86_64/curl-7.87.0-8.fc38.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/curl/7.87.0/8.fc38/x86_64/libcurl-minimal-7.87.0-8.fc38.x86_64.rpm
sudo systemctl restart rpm-ostreed
sudo systemctl start zincati

@dustymabe
Copy link
Member

for the time being pending the start of the new rollout the update was to an equally 'stuck' 38.20230417.1.0

Indeed. We haven't rolled out this fix to next yet.

@fifofonix
Copy link
Author

fifofonix commented Apr 25, 2023

Note, that for some reason this solution did not work for me on the sole testing node that I had let upgrade to the latest 'stuck' version. The sudo rpm step just hung as if it was having connection issues. I ended up choosing to rollback the node via GRUB prompt and then the server promptly upgraded to the latest testing version without issue. A side effect of the rollback/upgrade is that it validates closure of: #1473

@dustymabe
Copy link
Member

dustymabe commented Apr 25, 2023

The sudo rpm step just hung as if it was having connection issues.

This could make sense if you require a proxy to get to the internet. I guess we'd need to modify the steps to say "download and copy over x,y RPMs to the affected nodes" before running the rpm -Uvh.

@dustymabe
Copy link
Member

The fix for this went into next stream release 38.20230430.1.0. Please try out the new release and report issues.

@dustymabe
Copy link
Member

This issue never affected our stable stream.

@dustymabe dustymabe removed the status/pending-next-release Fixed upstream. Waiting on a next release. label May 3, 2023
@fifofonix
Copy link
Author

Note that if trying to fix/path an affected next server you may need to follow the steps outlined above more than once. For example, upgrade from 38.20230322.1.0 will yield broken 38.20230417.1.0 but repeating steps above again will then reach fixed 38.20230430.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants