Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no cloud agents: gcp #67

Closed
dustymabe opened this issue Oct 25, 2018 · 21 comments
Closed

no cloud agents: gcp #67

dustymabe opened this issue Oct 25, 2018 · 21 comments
Labels
cloud* related to public/private clouds

Comments

@dustymabe
Copy link
Member

dustymabe commented Oct 25, 2018

In #12 we decided that we'd like to try to not ship cloud agents. This ticket will document investigation and strategy for shipping without a cloud agent on the google cloud platform.

See also #41 for a discussion of how to ship cloud specific bits using ignition.

@jdoss
Copy link
Contributor

jdoss commented Oct 30, 2018

See https://pagure.io/cloud-sig/issue/292#comment-538459 for information about how GCE needs the google-compute-engine-oslogin package installed from the Google Cloud Compute Repo to start the network. This most likely will impact shipping Fedora CoreOS on GCP without this package installed.

@ajeddeloh
Copy link
Contributor

GCE wanted google-oslogin added to CL to remain at primary tier OS, so we implemented that. It shouldn't be needed for networking and the rest of the GCE stuff we can do in a container like we do on CL (although it might be worth revisiting it since we haven't in a while).

oslogin itself though needs to be implemented in the os since it messes with nsswitch, pam, and sshd. We'll need to conditionally enable it on gce (could be done with Ignition, and 3.0.0 will make it easier to optionally disable it)

@dustymabe
Copy link
Member Author

FYI: rpm package reviews for oslogin rpm: https://pagure.io/fedora-server/issue/5#comment-538460

The current discussion in today's meeting was that we would possibly include the oslogin rpm and just conditionally enable it on gce.

@ajeddeloh
Copy link
Contributor

A little background on oslogin:
On mutable distros there's this script which the agent uses to toggle oslogin on and off. For CL we decided we didn't want to ship that script (seems somewhat brittle if a user modifies those files themselves) and instead enable via a systemd oneshot that runs early on first boot.*

"Normal" fedora probably wants the google_oslogin_control script. I don't know if we want that for FCOS though (for similar reasons to why we don't ship it in CL). This means we'd need two seperate rpms unless dnf/rpm has something like gentoos INSTALL_MASK functionality.

*We should be able to do it all with Ignition with spec 3.0.0 (no systemd unit necessary). Trying to do it with the 2.x.y spec is what led me to discover that files, directories, and links are not declarative.

@dustymabe
Copy link
Member Author

:(

so how do we implement that functionality without the google_oslogin_control script? are we going to have to continuously manage our version of the implementation? Could we somehow convince google to change the script to be more compatible with what we need?

I guess it's worth asking.. Do we need to ship google_oslogin at all or can we get by without it (which is the topic of this ticket anyway, right?)?

@ajeddeloh
Copy link
Contributor

so how do we implement that functionality without the google_oslogin_control script?

On CL we don't; we say "you shouldn't be toggling host bits other than when provisioning". I don't know if that's the path we want to take for FCOS or not.

There's also the question of what is the default configuration and what does that look like with a managed /etc. If we ship with oslogin disabled but enable it with Ignition by default, that'll show up as a change in /etc. I don't know if we want that or not.

Do we need to ship google_oslogin

That's something we need to discuss with the GCE folks. For CL they said it was a requirement to be a first tier OS.

@dustymabe
Copy link
Member Author

FYI: rpm package reviews for oslogin rpm: https://pagure.io/fedora-server/issue/5#comment-538460

reviews were approved.. packages should make their way into Fedora soon. Thanks @Conan-Kudo

@Conan-Kudo
Copy link

@dustymabe dustymabe added the cloud* related to public/private clouds label Dec 13, 2018
@dustymabe dustymabe added the jira for syncing to jira label Jan 9, 2019
@smarterclayton
Copy link

smarterclayton commented Jan 30, 2019

The biggest blocker I've hit for using with OpenShift so far is forwarded IPs (set from instance metadata https://github.com/GoogleCloudPlatform/compute-image-packages/blob/master/google_compute_engine/distro_lib/ip_forwarding_utils.py#L78) - we have to use NLB for our front ends for masters, and so without the route being read from instance metadata and then set NLB health checks never go green.

E.g. for a forwarding rule the above reads curl -H "Metadata-Flavor:Google" "http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/forwarded-ips/0" and sets:

ip route add to local 35.222.92.223 dev eth0 proto 66 scope host

@bgilbert
Copy link
Contributor

Enabling OS Login requires modifying several monolithic files (nsswitch.conf, /etc/pam.d/sshd, sshd_config) only on GCP, which is inconvenient.

The sshd_config changes are specifically to add an AuthorizedKeysCommand. The current plan for #139 is to implement our own AuthorizedKeysCommand to read authorized_keys.d fragments, and that command could chain to a script which conditionally runs the OS Login AuthorizedKeysCommand when enabled on GCP.

@bgilbert bgilbert changed the title no cloud agents: gce no cloud agents: gcp Mar 20, 2019
@cevich
Copy link

cevich commented Apr 2, 2019

What about the google-startup-script (and shutdown script) "agents"? I have external integrated VM management service I depend on that uses it. My service downloads and starts it's own agent by injecting runtime-specific details through the google-startup-script service. It can't work through cloud-init or similar, it's specific to that google-startup-script API. Is this use-case covered?

@bgilbert
Copy link
Contributor

bgilbert commented Apr 2, 2019

@cevich We don't plan to support those startup/shutdown scripts. Fedora CoreOS should be configured by passing an Ignition config in userdata. That config can download and install your agent. Or, if you'd prefer to continue using your existing script, the Ignition config can install a ConditionFirstBoot systemd unit which runs the script.

@cevich
Copy link

cevich commented Apr 2, 2019

You're assuming an open-source service, and that they care about supporting a one-off special case for an non-standard OS, on a major cloud platform. As a user of a a third-party service like this, that makes my choice effectively:

  • Ask them nicely and pray
  • Re-implement my entire stack for the sake of a "new OS"

There's nearly zero incentive for third-parties to add the required special-case support when every other OS happily plays along whether or not agent-services are a good idea. It doesn't even have to be GCE-specific, there are plenty of other third-parties which require host-agents. If the OS makes it difficult to integrate, the OS will simply be placed on the "not supported" list and loose out in the long-run. I don't think that's the desire of the community here, whatever the specific philosophy over "agents" is.

@bgilbert
Copy link
Contributor

bgilbert commented Apr 2, 2019

@cevich Fedora CoreOS is continuing the Container Linux philosophy of providing an opinionated, minimal, and reasonably legacy-free OS for running containers. Part of being opinionated is that not everyone will agree with our opinions, and that's okay. If you want flexibility beyond what Fedora CoreOS is prepared to provide, other distros (including Fedora Cloud Base) could be a great choice.

Fedora CoreOS doesn't support cloud-init, whose design has unfixable race conditions. It strongly discourages installing software in the host, in favor of running all user software in containers. It favors immutable infrastructure and reprovisioning rather than configuration management. So existing tooling will already need to be adapted to work well with Fedora CoreOS.

As to this bug, the principle is that provisioning setup for a machine should always be encapsulated into the Ignition config, rather than passed via a platform-specific agent.

@cevich
Copy link

cevich commented Apr 3, 2019

Thanks for the details and explanation. For me that means I won't use this for testing container run-times and related tooling...ironically as that is. That said, being SO strongly opinionated seems (IMHO) to make this OS overly difficult to use. History provides plenty of examples where difficult-use inventions, are simply not used and therefor ultimately fail. I think the principals here are "cool", and would like to see it be successful. IMHO, that probably necessitates additional flexibility of opinions.

@bgilbert
Copy link
Contributor

bgilbert commented Apr 3, 2019

That said, being SO strongly opinionated seems (IMHO) to make this OS overly difficult to use.

I hope you'll give Fedora CoreOS a try when we're a little further along; it's easier to use than you might think. 😃 (We don't have much documentation right now, which is a problem, but we're working on that.)

Fedora CoreOS's opinions are pretty closely aligned with Container Linux, and they've served that community pretty well for several years now. We're trying to make things easier, not harder, honest.

@cevich
Copy link

cevich commented Apr 3, 2019

I know you are. I'm just thinking of all the "agents" out there which must run with privileges, on the host, and may not be conducive to being written into container images. This will especially be a problem in cases where the software or service is closed-source/proprietary. The sad fact is, many environments are like this, especially in government and health-care. Requiring little bits of "internal malware" if you will...because management always knows best.

Possibly not an issue for Fedora, but as that rolls down into CentOS/elsewhere it will become a monumental obstacle to adoption. As in my case, the user's choice may literally be: "Ask nice and pray" 😞 We have exactly zero control/influence with what third-parties do, especially with cloud APIs that we also have no control over.

@darkmuggle
Copy link
Contributor

I recently did a deep dive on the agent for GCE for some work getting Openshift to run in GCE. The use-case that I needed to solve was the L4 (aka Network Load Balancers)

I came up with a proof-of-concept [1] which only runs the Network Configuration and the Clock Skew daemon.
podman run -d --privileged=true --net=host quay.io/behoward/gce-container works as expected.

@dustymabe dustymabe removed the jira for syncing to jira label Sep 5, 2019
@bgilbert
Copy link
Contributor

For the record, the new Go-based GCP agent is here and the new OSLogin repo is here.

@Conan-Kudo
Copy link

Yeah, I found out they were reworking this so I had stopped my rebase work. I guess it's ready now to be put into Fedora...

Not that I like Go at all for this (I really, really, really don't), but at least this means it's shippable for FCOS.

@dustymabe
Copy link
Member Author

I broke the OS Login part out into #648. I'm going to close this ticket since we've got a GCP image now and no agent seems to be going fine. We can start new discussions in new tickets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud* related to public/private clouds
Projects
None yet
Development

No branches or pull requests

8 participants