Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split operator to separate out cloud provider specific code #9920

Closed
tgraf opened this issue Jan 21, 2020 · 20 comments · Fixed by #10972
Closed

Split operator to separate out cloud provider specific code #9920

tgraf opened this issue Jan 21, 2020 · 20 comments · Fixed by #10972
Assignees
Labels
area/operator Impacts the cilium-operator component kind/feature This introduces new functionality.
Projects

Comments

@tgraf
Copy link
Member

tgraf commented Jan 21, 2020

Summary

The cilium-operator currently imports cloud-provider specific SDKs which bloat up the binary size and lead to unnecessary complexity when operated outside of the context of that cloud-provider.

Details

TBD

@tgraf tgraf created this issue from a note in 1.8.0 (Core Agent) Jan 21, 2020
@tgraf tgraf added kind/feature This introduces new functionality. area/operator Impacts the cilium-operator component labels Jan 21, 2020
@tgraf tgraf mentioned this issue Feb 6, 2020
10 tasks
@tklauser tklauser moved this from Core Agent to Load-balancing in 1.8.0 Feb 24, 2020
@tklauser tklauser moved this from Load-balancing to Core Agent in 1.8.0 Feb 24, 2020
@errordeveloper
Copy link
Contributor

Per discussion in the meeting today, the first task here will be to define more details. Just a few initial thoughts below.

Two path I can think of so far are these (I think we mentioned this on the call):

  1. provider-specific code runs in a separate process and there is some kind of an API, perhaps gRPC
  • packaging: could be a separate pod/service, or a sidecar
  • general considerations:
    • API needs to be defined
      • even if that would be a mapping of Go library we have now, supposedly it becomes a contract that we will need to maintain going forward
    • API access needs to be secure
    • remote API would introduce loose coupling, thereby making it potentially more extendible and enabling separate release cycles
  1. provider-specific code is part of the main operator binary and is called as a Go library
  • packaging: a separate version of the binary and image are built for each provider, the right image is picked during installation
  • general considerations:
    • current Go library API can be used (possibly with some minor changes)
    • supposedly it's can also be seen as a contract, but I believe it's easier for us to evolve it as it tightly coupled

@errordeveloper
Copy link
Contributor

From the first look at the operator code, I think it would make sense to separate out some packages, as it's currently a single (and relatively large) main package.

@errordeveloper
Copy link
Contributor

I also noted that operator currently depends on pkg/probe, which uses bpf.Map, and that appears to be non-intentional.

@errordeveloper
Copy link
Contributor

@tgraf do we mostly care about the size of binary and the image, or we might want to isolate the dependencies also?
(I don't know if it is possible to have separate go.mod files one repo, but even if we had to extract the operator's main package into provider-specific repo, that still could be doable. Just wonder if that matters at all.)

I am mostly leaning towards the route where there are just separate binaries/images, it just seems like the path of least resistance to start with.

Since this was referenced from #10056, I assume the binary size matters to us primarily for the impact on memory usage, and not so much for image download time. If that's the case, we can probably keep the image the same, just have separate binaries, which would make the build changes simpler.

@errordeveloper
Copy link
Contributor

Thinking more about the route with separate binaries, it looks like there are two way:

  1. separate main packages
  2. one main package, different allocators are included based on build tags (to illustrate: for provider in none aws azure; do go build -tags "$provider" -o "cilium-operator-${provider}" ; done)

@tklauser
Copy link
Member

Since this was referenced from #10056, I assume the binary size matters to us primarily for the impact on memory usage, and not so much for image download time. If that's the case, we can probably keep the image the same, just have separate binaries, which would make the build changes simpler.

Exactly, afaics binary size mainly matters because it increases RSS of the process. Regarding image size, I think operator size is not the main factor and we also have some other leverage there to reduce size, see e.g. #10542

@tklauser
Copy link
Member

2. one `main` package, different allocators are included based on build tags (to illustrate: `for provider in none aws azure; do go build -tags "$provider" -o "cilium-operator-${provider}" ; done`)

I'm not very familiar with the operator yet but I'd guess this way would be easier to implement given the current build system. It would probably also need less duplicated code than the separated main solution.

@aanm
Copy link
Member

aanm commented Mar 20, 2020

@tgraf do we mostly care about the size of binary and the image, or we might want to isolate the dependencies also?
(I don't know if it is possible to have separate go.mod files one repo, but even if we had to extract the operator's main package into provider-specific repo, that still could be doable. Just wonder if that matters at all.)

I am mostly leaning towards the route where there are just separate binaries/images, it just seems like the path of least resistance to start with.

Since this was referenced from #10056, I assume the binary size matters to us primarily for the impact on memory usage, and not so much for image download time. If that's the case, we can probably keep the image the same, just have separate binaries, which would make the build changes simpler.

it's simpler to have a single go.mod file in the entire repo, the main concern is around the binary size and the dependency isolation for each binary. For example, a cilium-operator dedicated for gce does not need to import any packages from aws for example.

@errordeveloper
Copy link
Contributor

errordeveloper commented Mar 20, 2020

it's simpler to have a single go.mod file in the entire repo, the main concern is around the binary size and the dependency isolation for each binary. For example, a cilium-operator dedicated for gce does not need to import any packages from aws for example.

Sure, thanks for clarification! I just wanted to double-check, as sometimes dependency sub-tree conflicts can be an issue, but I guess SDKs do maintain good dependency hygiene and don't introduce conflicting versions of k8s.io/client-go 😉

@errordeveloper
Copy link
Contributor

These are the numbers I got for the 64-bit macOS binaries (du operator / du -h operator):

  • with AWS & with Azure: 169016 / 83M
  • with AWS & without Azure: 164472 / 80M
  • without AWS & with Azure: 123608 / 60M
  • without AWS & without Azure: 118208 / 58M

@errordeveloper
Copy link
Contributor

errordeveloper commented Mar 23, 2020

I must say that refactoring operator main package into a few sub-packages is not very easy, as we are using quite a few global variables, and I am not sure whether to get rid of some of those globals, or do something else... E.g. it would be possible to move code inside of the main package without break it apart, or maybe create a sub-package with global variables?

Aside from IP allocators inside of the operator main package, there is also one additional offender:

package groups
// Empty imports to register providers
import (
_ "github.com/cilium/cilium/pkg/policy/groups/aws" // AWS import to be able to register the provider.
)

It appears that this pkg/policy/groups is only used by the operator main package, albeit pkg/policy and it's other sub-packages are in ported in various places.

@aanm
Copy link
Member

aanm commented Mar 23, 2020

I must say that refactoring operator main package into a few sub-packages is not very easy, as we are using quite a few global variables, and I am not sure whether to get rid of some of those globals, or do something else... E.g. it would be possible to move code inside of the main package without break it apart, or maybe create a sub-package with global variables?

Which global variables?

Aside from IP allocators inside of the operator main package, there is also one additional offender:

package groups
// Empty imports to register providers
import (
_ "github.com/cilium/cilium/pkg/policy/groups/aws" // AWS import to be able to register the provider.
)

It appears that this pkg/policy/groups is only used by the operator main package, albeit pkg/policy and it's other sub-packages are in ported in various places.

@errordeveloper I guess we need to look it case-by-case. For example, that file can easily be fixed by adding a go build tag to it for which only build of cilium-operator-aws would have it set.

@errordeveloper
Copy link
Contributor

Which global variables?

It's the nodeManager and it's ciliumNodeUpdateImplementation that puzzled me yesterday evening.

var nodeManager *ipam.NodeManager

type ciliumNodeUpdateImplementation struct{}

nodeManager, err = ipam.NewNodeManager(instances, &ciliumNodeUpdateImplementation{}, iMetrics, option.Config.ParallelAllocWorkers,

I'll have another go at these now, I have an idea.

@tgraf
Copy link
Member Author

tgraf commented Mar 24, 2020

@errordeveloper An upcoming refactoring in the GCP IPAM PR will remove the global nodeManager for the node IPAM. We can defer this work until that has happened. Same for ciliumNodeUpdateImplementation, the global variable has been done out of laziness.

Are there any other blockers that you have identified? How complex is the work overall?

@errordeveloper
Copy link
Contributor

@tgraf is that refactoring work on a branch yet? I started having ago at it, I am not very far from finishing refactoring operator/node_manager.go as operator/tasks/node_manager/node_manager.go. The way node manager is done is the only thing here, the rest looks fairly trivial, I think I should be able to open a PR later today.

The intention so far is to have operator/task/aws_allocator and operator/task/azure_allocator, and introduce build tags. It makes sense to have two iterations - the first iteration would keep current behaviour+packaging and just introduce packages and build tags. And in the second iteration I'd change behaviour+packaging - start building 3 binaries and and remove the IPAM flag. The reason I'd rather have two iterations is because we'd need to make a decision about defaults and compatibility.

@errordeveloper
Copy link
Contributor

errordeveloper commented Mar 24, 2020

@tgraf I have got to a stage where operator/node_manager.go and operator/k8s_node.go are moved under operator/tasks/node_manager and somewhat tidied. My branch mostly compile, but I realised it's a bit more convoluted now, I'll wait for your PR and would be very keen to review it :)

errordeveloper added a commit that referenced this issue Mar 24, 2020
- Move AWS policy groups provider registration into operator,
  as it's not used anywhere else
- Add `operator_aws` build tag to exclude this dependency in
  non-AWS builds (see #9920)

Signed-off-by: Ilya Dmitrichenko <errordeveloper@gmail.com>
tgraf pushed a commit that referenced this issue Mar 24, 2020
- Move AWS policy groups provider registration into operator,
  as it's not used anywhere else
- Add `operator_aws` build tag to exclude this dependency in
  non-AWS builds (see #9920)

Signed-off-by: Ilya Dmitrichenko <errordeveloper@gmail.com>
@errordeveloper
Copy link
Contributor

errordeveloper commented Apr 3, 2020

Since #10758 was merged, we need to discuss how are we going to ship the operator.

Open option would be to ship large binary that we ship now, and smaller binaries along the side. We can make Helm select small binaries and eventually deprecate the large binaries.

Another, slightly different option would be to have a wrapper binary (perhaps a shell script) that calls the right binary based on what flag was passed in.

Alternatively we can just start shipping new small binaries, namely 3 of them, and make sure Helm chart does the right thing. We would have to document this in the change log and upgrade guide as a breaking change to those who aren't using the official Helm chart or manifests that we ship in the repo.

From a UX perspective, it seems nicer if we can avoid an instant breaking change and provide a window of 1 or 2 releases, but in a way this can be consider as a low level element and it would only affect those who aren't using the Helm chart, so I'm wouldn't be too worried if we prefer the quicker route.

@errordeveloper
Copy link
Contributor

After thinking about it for a while, I think that personally I'd probably prefer to go with the approach where agent fat agent is the default for non-Helm installs, so the image will have old large binary as well as new smaller ones, and the smaller ones will have to have IPAM flag removed. Helm installs will make use of new smaller binaries.

@joestringer
Copy link
Member

Which non-helm installs are you thinking about? I can think of the quick-install.yaml and that's all. AFAIK everything else (including any cloud integration) relies on Helm today.

@tgraf
Copy link
Member Author

tgraf commented Apr 7, 2020

After thinking about it for a while, I think that personally I'd probably prefer to go with the approach where agent fat agent is the default for non-Helm installs, so the image will have old large binary as well as new smaller ones, and the smaller ones will have to have IPAM flag removed. Helm installs will make use of new smaller binaries.

Yes, agreed. This also ensures we are not breaking anything.

Open option would be to ship large binary that we ship now, and smaller binaries along the side. We can make Helm select small binaries and eventually deprecate the large binaries.

This makes sense to me. We already have helm options for several cloud providers (azure.enabled=true, eni=true) to do the right thing by default. This can simply optimize the container image for the operator as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/feature This introduces new functionality.
Projects
No open projects
1.8.0
  
Core Agent
Development

Successfully merging a pull request may close this issue.

5 participants