Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RA2 Core]: build a consensus on Kubernetes cluster management and document the result #449

Closed
CsatariGergely opened this issue Oct 23, 2019 · 45 comments · Fixed by #570
Closed
Assignees
Labels
Archive Archive Item

Comments

@CsatariGergely
Copy link
Collaborator

In the discussions of #383 there was a disagreement on two topics in the area of Kubernetes cluster management:

  • Is CaaS Manager part of NFVO and if we care about the result in RA2
  • Scope of one Kubernetes cluster. One VNF, group of VNF from the same vendor, group of VNF-s

The resolution of this issue shall clarify these and document the result.

@ASawwaf
Copy link
Collaborator

ASawwaf commented Oct 23, 2019

@CsatariGergely , I like the 3 issues that you opened but we repeat our self ( sorry for that )
we need to define first CaaS , CaaS manager and CIM and role of its function, its components to be able to define its position is part from NFVO, part of MANO

From my side, we can consider CIM part of ETSI MANO , for positioning of CaaS , CaaS manager , i see this is just a stack for a conceptual

...

@CsatariGergely
Copy link
Collaborator Author

@CsatariGergely , I like the 3 issues that you opened but we repeat our self ( sorry for that )
we need to define first CaaS , CaaS manager and CIM and role of its function, its components to be able to define its position is part from NFVO, part of MANO

From my side, we can consider CIM part of ETSI MANO , for positioning of CaaS , CaaS manager , i see this is just a stack for a conceptual

...

With this isses I just list the topics where we had disagreements in the discussion in #383 . If you think you would like to work on any of them please volunteer in a comment.

@tomkivlin
Copy link
Collaborator

My thoughts...
The original CNTT scope was the interfaces/capabilities from the NFVI that are consumed by VNFs/applications, plus the interfaces/capabilities provided by VIM that are consumed by VNFMs/NFVOs. (Link)

Extending this to Kubernetes suggests, I think, the following:

  1. The interfaces/capabilities that are consumed by VNFs/applications are now provided by the container runtime (CIS in ETSI terms) rather than the NFVI - the VNFs/applications should never be interacting with the Kubernetes cluster level directly (e.g. the control plane, k8s API server, etc.).
  2. The interfaces/capabilities that are consumed by VNFMs and NFVOs (for when indirect mode is used for VNF management) are provided by the Kubernetes control plane (CISM in ETSI terms, I think?) plus maybe some other services such as logging/monitoring??
  3. The interfaces/capabilities that are consumed by NFVOs (for purposes other than indirect VNF management) are only for Kubernetes cluster LCM I reckon - i.e. to the "CaaS Manager" (no ETSI equivalence) or whatever capability manages the lifecycle of the CISM.

I would suggest that number 3 in the list above is not in scope of RA2, but that 1 and 2 are. Which I think follows the proposed chapter 3/4 structure outlined by @CsatariGergely?

This would also mean that chapter 7 takes the approach of documenting the requirements on a CISM/CIS to support the automated LCM that a "CaaS Manager" might provide, rather than documenting how the LCM happens.

@tomkivlin
Copy link
Collaborator

I will try and build a similar diagram to the one I linked to above, to see if we can get agreement on the scope of RA2 - I think that will address this issue then?

@tomkivlin
Copy link
Collaborator

Here's my first stab - let me know what you think. Don't worry about the interface names, we can change as needed.
https://github.com/cntt-n/CNTT/blob/tomkivlin-patch-2/doc/ref_arch/kubernetes/figures/ch01_cntt_scope_k8s.png

FYI if you want to modify, the PowerPoint is here: https://github.com/cntt-n/CNTT/blob/tomkivlin-patch-2/doc/ref_arch/kubernetes/figures/k8s-ref-arch-figures.pptx

@pgoyal01
Copy link
Collaborator

@tomkivlin IMHO, the Container Infrastructure Service Instance (CISI), and the VM/BM, are part of the NFVI. CISI are abstractions that host containers.

Container Infrastructure Service (CIS) provides the infrastructure resources managed by CISM; the resources being exposedusing APIs: CRI, CSI, CNI (for Kubernetes). See IFA029 Section 6.2.2.

Alternate Diagram:
image

@tomkivlin
Copy link
Collaborator

Hi Pankaj, CIS and CISI being part of NFVI yes I understand that, makes sense. I can't disagree enough about containers being part of NFVI though. The definition of container that we're currently using even specifies that it is a running instance of a piece of software with all dependencies. For me a container is part of the application, not the infrastructure. Would you be ok with NFVI including CISI and CIS but not the pods/containers themselves?

@pgoyal01
Copy link
Collaborator

Tom, Thanks. CISI, as you know, is a kubernetes node that hosts the pods and containers. My attempt was to keep the diagram consistent with the ETSiI where only the VNFs and VNFCs are shown separately even though the software is executing in the VM. But looking at it from the perspective where the workload software is shown separately form the machine it is executing on, I would have to agree with you to move the pods/containers out of the NFVI. Sending you the ppt file in an email.

@CsatariGergely
Copy link
Collaborator Author

We have three options: pre IFA029, and the two selected alternatives of IFA029.
Here is the figure what I would use for the pre IFA029:
image

And here are the two alternatives from IFA029:
7.2.4.2
image
7.2.4.4
image

More descriptions about this is in IFA029

@pgoyal01
Copy link
Collaborator

@CsatariGergely There are 2 others: 7.2.4.3 and 7.2.4.5. Maybe we need a discussion session on choosing one for CNTT but keeping in mind the decision has impact on VNFM (e.g., ONAP).

7.2.4.3.
image

7.2.4.5
image

@tomkivlin
Copy link
Collaborator

I don't think we should be trying to answer the question of if CISM is part of the VIM or not, that's a procurement decision not an architectural one, in my opinion. The capability is distinct, so let's draw it as such. We can then clearly delineate what is in scope or not.

@CsatariGergely
Copy link
Collaborator Author

@CsatariGergely There are 2 others: 7.2.4.3 and 7.2.4.5. Maybe we need a discussion session on choosing one for CNTT but keeping in mind the decision has impact on VNFM (e.g., ONAP).

7.2.4.3.
image

7.2.4.5
image

", Option 2 and 4 are determined to be excluded from the target architecture. " Option 2 if 7.2.4.3 and option 4 is 7.2.4.5

@pgoyal01
Copy link
Collaborator

@tomkivlin we all agree that the CISM functionality is distinct from say a VIM that manages only VMs as virtual resources. But, IMHO, it is a valid architectural discussion on the distribution of capabilities to components and if an existing component capabilities needs to be enhanced or the component needs to be replaced with another, etc.

@tomkivlin
Copy link
Collaborator

tomkivlin commented Oct 30, 2019

@pgoyal01 good point.

I wonder if it would help to list the capabilities I think are in scope of RA2? We can then draw them into a diagram without needing to get into the discussion of whether or not k8s is part of VIM, VNFM, etc. I've deliberately used non-IFA029 terms in the table so we / others / (or just I) don't get confused as there seem to be others used in some of the diagrams you've provided above.

Component/Interface In Scope of RA2?
Virtual or physical compute, storage and network infrastructure used by Kubernetes nodes (NFVI) Yes
Virtual or physical management of the above infrastructure (VIM / +PIM?) Yes
Kubernetes node OS Yes
Kubernetes container runtime Yes
Kubernetes worker node services (kubelet, kube-proxy) Yes
Kubernetes configuration store (etcd) Yes
Kubernetes master node services (API server, controller-managers, DNS, CNI, etc.) Yes
Kubernetes master nodes (Kubernetes control plane, etcd, DNS, CNI, etc. Yes
Multi-cluster lifecycle management capability No*
Kubernetes objects (pods, config maps, volumes, etc.) No**

* The operations and lifecycle management chapter will address the requirements on the cluster, of being able to perform lifecycle management, but won't document the management capability itself.
** The pods, associated containers and other Kubernetes objects are considered application constructs for the purposes of RA2 (following the definition of a container)

What do you think?

@pgoyal01
Copy link
Collaborator

@tomkivlin Agree largely with the Components that you have listed. The question is w.r.t. the application constructs -- the Kubernetes objects. Other can chime in here but I think we need these as part of the RA-2.

Are we going to discuss multi-cluster support (not LCM)?
Architectural support for both CNFs and VNFs?

@petorre
Copy link
Collaborator

petorre commented Oct 31, 2019

@tomkivlin Agree. Generally in scope should be what is in Platform (other Kubernetes services or more functionality to build MVP PaaS) and out of scope what comes from Application(s) function and control (which should not duplicate platform functionality but beyond that don't need to be prescriptive on How).

@tomkivlin
Copy link
Collaborator

From 31.10 meeting: continue to describe in Kubernetes terms the scope / architecture before then mapping back to ETSI terms.

@tomkivlin
Copy link
Collaborator

tomkivlin commented Nov 4, 2019

Here is my proposal for the scope of RA2.

image

Regarding a couple of the contentious areas, the purpose of CNTT is to aid the consistency of infrastructure platforms to make the verification and certification of network software simpler and more efficient and so I think we need to be careful not to include what may well be in any other Kubernetes Reference Architecture but be sure it's right for CNTT RA2.

Other reasons are:

  • CNF Components (Pods, Volumes, ConfigMaps, etc.): The terminology in chapter 1 currently states that a container is a "lightweight and portable executable image that contains software and all of its dependencies". I would argue that because this "image" is defined / managed by the software vendor and not by the "infrastructure" provider that it is not in scope of RA2. My thinking is then that these Kubernetes objects are logical constructs and instead we should include the actual infrastructure interfaces in the scope, rather than the API objects that are used to manage them. Note - the API spec (and therefore the spec of the Pod object, for example) is in scope, as that is part of the Kubernetes Master Node Services capability.
  • Kubernetes Application Package Management: my preference is that this is not included in the RA2 specification, as Helm doesn't consist of any infrastructure elements and I don't believe it is analogous to OpenStack Heat (which I see more like a closed-loop controller-manager in Kubernetes - it takes a desired state Heat Orchestration Template, compares that to the observed state, calculates the delta(s) and then passes instruction to the relevant OpenStack schedulers). Helm however is not a closed loop orchestration engine in of itself - at its simplest, it is a client application that consumes the Kubernetes API and therefore is something that the software vendors might choose to use for the management of their software, or not. For me this is more analogous to a component of the NFVO or VNFM that is combining HOTs to manage complex "packages". I don't believe it aids the CNTT purpose to include this. This bullet links to [RA2 Core]: Build a consensus about the usage of Helm and document the result #451.
  • Kubernetes cluster lifecycle management: As per my comment above I think that this LCM capability - the creating, updating, upgrading, deleting of Kubernetes clusters, is not a capability that is in scope of the RA2 (why I've made it dotted line) as it is not a capability that would be used in the management or execution of a CNF or any other application. It would only be used by an entity (NFVO, manual, other) to create Kubernetes clusters ready for a VNFM to consume for the running of the software.

@TamasZsiros
Copy link
Collaborator

On the Kubernetes cluster lifecycle management not being part of RA2:
My understanding is that the RM describes generic infra LCM:
https://cntt-n.github.io/CNTT/doc/ref_model/chapters/chapter09.html

If we view it as CNTT's high level goal to ensure the consistency of infrastructure platforms AND for various reasons (multi-tenancy, separation, edge) we see an increased number of clusters (compared to e.g. Open Stack), then perhaps it would be better to include it. Otherwise vendors will present different NFVI stacks with differing LCM capabilities, and since the CNFs will depend on separation and multi-tenancy capabilities, they will also implicitly depend on infra LCM capabilities.

I understand this is additional complexity, and perhaps we should draw a line to what extent we describe this, but at least I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

@tomkivlin
Copy link
Collaborator

tomkivlin commented Nov 5, 2019

I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

I am tending towards this position too - I see the cluster LCM as being increasingly important (I'm aware that's a difference from my above comment!)

@TamasZsiros
Copy link
Collaborator

Leaving Helm out would just mean that the VNF Manager needs to contend with an abstraction level that is lower (K8s API vs. Helm chart) for describing what is the target state/configuration for the CNF. I would argue that this would result in the VNF Manager getting more complex (compared to what it could be), which is not aligning well to the general trend of pushing down functionality from VNFM to K8s (for example scaling).

So in my view having a package manager in the CaaS (or to be maybe more exact: support an entity in the CaaS that operates on a compact, declarative descriptor as opposed to K8s API calls) brings us closer to the "ideal world" where the VNFM is very slim or non-existent.

@pgoyal01
Copy link
Collaborator

pgoyal01 commented Nov 5, 2019

@tomkivlin Is the intent only to support CNFs or both VNFs and CNFs as is the likely scenario for the foreseeable future?

@TamasZsiros
Copy link
Collaborator

"Perhaps a very slim generic VNFM which is essentially a Helm client?"
@tomkivlin so how about suggesting an [optional] Helm v3 client in VNFM (which is typically proprietary anyway)? This way the CaaS stays clean, and a vendor can decide for or against using Helm in the VNFM?

@tomkivlin
Copy link
Collaborator

@TamasZsiros that would be my preference, yes.

@tomkivlin
Copy link
Collaborator

@tomkivlin Is the intent only to support CNFs or both VNFs and CNFs as is the likely scenario for the foreseeable future?

Within this RA2 it is CNFs only - it's a Kubernetes Reference Architecture. I think there is a discussion to be had within CNTT about how we want to deal with the following scenarios:

  • Reference Architecture to support VNFs and CNFs at the same time
  • Reference Architecture to support VM-based and Kubernetes-based CNFs at the same time (no particular reason why a CNF couldn't include some VM-based components)
  • etc.

But I think that's out of the scope of RA2.

@peterwoerndle
Copy link
Collaborator

@tomkivlin the new figure addresses my comments, thanks.

@pgoyal01
Copy link
Collaborator

pgoyal01 commented Nov 5, 2019

@tomkivlin Maybe we need a discussion about the scope of RA-2 at the Technical Steering Committee. I see RA-1 supporting VNFs while RA-2 supporting both VNFs and CNFs and migrating to CNFs in the future.

@peterwoerndle
Copy link
Collaborator

@pgoyal01 are you referring to a VNF in the sense of a VM-based application? Generally the term "Kubernetes-based application / VNF" would not prevent to deploy a VM-based VNF on top of RA2 as long as Kubernetes is used to manage the workload. My preference would be to start with the established container management in Kubernetes and add the support for VMs using kubevirt, virtlets, RancherVM, ... in a later revision. From a northbound interface point of view it should not make a major difference in the RA.

@pgoyal01
Copy link
Collaborator

pgoyal01 commented Nov 5, 2019

@peterwoerndle Agree on "..as long as Kubernetes is used to manage the workload. "
Since we would be in the hybrid world (VNFs and CNFs) with VNFs dominating initially, may I suggest that we include "support for VMs using kubevirt, virtlets, RancherVM, .." from the start.

@tomkivlin
Copy link
Collaborator

@pgoyal01 @peterwoerndle I'm comfortable including the management VMs through Kubernetes in RA2, but I worry that there isn't a mature production-ready option available today that we can standardise on. Another option for the future might be the use of the Operator framework and Custom Resources (similar to Cluster API, but not just for managing Kubernetes clusters).

I also think, as I mentioned above, there needs to be a distinction between VMs managed by Kubernetes (for me, that is a CNF that uses VMs) and VNFs that use VMs. If we are suggesting that VMs are managed through the Kubernetes API for VNFs, are we suggesting Kubernetes becomes a VIM in ETSI NFV v3?? That feels like a lot of change, compared to allowing CNFs to use VMs and whatever we suggest becoming part of ETSI NFV v4 (in time)...

@CsatariGergely
Copy link
Collaborator Author

On the Kubernetes cluster lifecycle management not being part of RA2:
My understanding is that the RM describes generic infra LCM:
https://cntt-n.github.io/CNTT/doc/ref_model/chapters/chapter09.html

If we view it as CNTT's high level goal to ensure the consistency of infrastructure platforms AND for various reasons (multi-tenancy, separation, edge) we see an increased number of clusters (compared to e.g. Open Stack), then perhaps it would be better to include it. Otherwise vendors will present different NFVI stacks with differing LCM capabilities, and since the CNFs will depend on separation and multi-tenancy capabilities, they will also implicitly depend on infra LCM capabilities.

I understand this is additional complexity, and perhaps we should draw a line to what extent we describe this, but at least I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

I do not see how the LCM of the infra is visible for a VNF. Somehow I feel that adding the LCM part is a bit too big problem domain for the first release. ..

@CsatariGergely
Copy link
Collaborator Author

"Perhaps a very slim generic VNFM which is essentially a Helm client?"
@tomkivlin so how about suggesting an [optional] Helm v3 client in VNFM (which is typically proprietary anyway)? This way the CaaS stays clean, and a vendor can decide for or against using Helm in the VNFM?

I would not include VNFM to the RA.

@tomkivlin
Copy link
Collaborator

I would not include VNFM to the RA.

Nor would I. I think the suggestion is that we don't include Helm in this RA and instead have a statement that it is a VNFM component and up to the VNFM vendor to decide whether they include it or not.

@tomkivlin
Copy link
Collaborator

I do not see how the LCM of the infra is visible for a VNF.

Yes you're right, I'm changing my mind again back to my original position. We just need to be sure we address the points Tamas has made about multitenancy etc.

@tomkivlin
Copy link
Collaborator

tomkivlin commented Nov 6, 2019

I've added in "Kubernetes-based Application Artefact Storage" to cover:

  • Container Registry
  • Helm Chart Repository

image

@tomkivlin
Copy link
Collaborator

From Technical Steering Meeting 6/11/19: VM management by Kubernetes is in scope. I will clarify in the diagram.

@tomkivlin
Copy link
Collaborator

Here's the update following today's steering meeting. If there are no objections I will draft a PR updating chapter 1 based on this diagram and discussions that have been had.

To clarify, I have added an interface between the Kubernetes Master Node Services and the NFVI - this is to cover an example such as kubevirt that uses Custom Resources to interact with libvirt on nodes. I had also added "or custom controller (e.g. CRDs, operators)" in the interface between Kubernetes Master Node Services and the VIM - to cover those examples that would use a provider that communicates with a VIM, rather than a lower level hypervisor service.

image

@CsatariGergely
Copy link
Collaborator Author

From Technical Steering Meeting 6/11/19: VM management by Kubernetes is in scope. I will clarify in the diagram.

I would not yet add kubevirt/virlet or any similar to RA2 in this release yet. I think it is enough if we sort out containers first.

@CsatariGergely
Copy link
Collaborator Author

Here's the update following today's steering meeting. If there are no objections I will draft a PR updating chapter 1 based on this diagram and discussions that have been had.

To clarify, I have added an interface between the Kubernetes Master Node Services and the NFVI - this is to cover an example such as kubevirt that uses Custom Resources to interact with libvirt on nodes. I had also added "or custom controller (e.g. CRDs, operators)" in the interface between Kubernetes Master Node Services and the VIM - to cover those examples that would use a provider that communicates with a VIM, rather than a lower level hypervisor service.

image

Even is we add hypervisors with CRI interface (what is a good name for these in general?) I think the interface is not from the master node to the NFVI.
According to my understanding:

  • The control communication is between the master node and the worker node (what is needed in case of containers with CRI interface also)
  • hypervisor with CRI interface will communicate with the libvirt of the Kubernetes Worker Machine

@tomkivlin
Copy link
Collaborator

I would not yet add kubevirt/virlet or any similar to RA2 in this release yet. I think it is enough if we sort out containers first.

Let's add it as a header and placeholder, but agreed it's not a priority item.

@tomkivlin
Copy link
Collaborator

Even is we add hypervisors with CRI interface (what is a good name for these in general?) I think the interface is not from the master node to the NFVI.
According to my understanding:

  • The control communication is between the master node and the worker node (what is needed in case of containers with CRI interface also)
  • hypervisor with CRI interface will communicate with the libvirt of the Kubernetes Worker Machine

That's one type, and you're probably right about the communication channels - I will double check and update when I raise a PR for chapter 1 (will start on that today - let's move some of this more detailed discussion to a PR).

@peterwoerndle
Copy link
Collaborator

I agree with @CsatariGergely comments with regards to the CRI. @tomkivlin having a dedicated PR on this may also help to schedule it properly for a version of the document (if we decide to not take it into the first version of RA2)

@tomkivlin tomkivlin added the Archive Archive Item label Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Archive Archive Item
Projects
None yet
9 participants