Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider implementing a validator for Configurations #180

Closed
DazWilkin opened this issue Dec 17, 2020 · 36 comments · Fixed by #206
Closed

Consider implementing a validator for Configurations #180

DazWilkin opened this issue Dec 17, 2020 · 36 comments · Fixed by #206
Assignees
Labels
enhancement New feature or request

Comments

@DazWilkin
Copy link
Contributor

DazWilkin commented Dec 17, 2020

Is your feature request related to a problem? Please describe.

kubectl validates (some?) specs when these are applied.

Configurations (Akri CRD) are not validated and this permits user-error and other mistakes to escape.

Is your feature request related to a way you would like Akri extended? Please describe.

During development of Zeroconf, I inadvertently broker the YAML spec for the Broker:

apiVersion: akri.sh/v0
kind: Configuration
metadata:
  name: zeroconf
spec:
  protocol:
    zeroconf:
      kind: "_rust._tcp"
      port: 8888
      txtRecords:
        project: akri
        protocol: zeroconf
        component: avahi-publish
  capacity: 1
  brokerPodSpec:
    imagePullSecrets: # Container Registry secret
      - name: ghcr
    containers:
      - name: zeroconf-broker
        image: ghcr.io/dazwilkin/zeroconf-broker@sha256:993e5b8d....
    resources: <----------------------------- INCORRECTLY INDENTED AFTER EDIT
              limits: 
                "{{PLACEHOLDER}}": "1"

The result was that, while Brokers were deployed, environment variables were not mounted correctly in the Pod.

It was only after lengthy debugging that we uncovered the error:

kubectl get pod/akri-zeroconf-9e1938-pod --output=yaml

spec:
  containers:
  - image: ghcr.io/dazwilkin/zeroconf-broker@sha256:993e5b8d...
    imagePullPolicy: IfNotPresent
    name: zeroconf-broker
    resources: {}

This may (!?) have been mitigated were the Configuration validated upon apply.

Describe the solution you'd like

kubectl apply --filename=./some-akri-config.yaml

Would yield e.g.:

error: error parsing ./some-akri-config.yaml: line XX: something wrong described

Describe alternatives you've considered

Being less stupid ;-)

This is the convention with Kubernetes and would be a useful addition.

Additional context

See: #177

@DazWilkin
Copy link
Contributor Author

I've since learned that CRDs should be validated by dint of the OpenAPI spec...

Hmmm...In my example, I'd inadvertently added .spec.brokerPodSpec.resources instead of .spec.brokerPodSpec.containers[0].resources so it's unclear why this passed muster.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 18, 2020

I've since learned that CRDs should be validated by dint of the OpenAPI spec...

Hmmm...In my example, I'd inadvertently added .spec.brokerPodSpec.resources instead of .spec.brokerPodSpec.containers[0].resources so it's unclear why this passed muster.

Our Configuration CRD has some validation bits in it .. but no validation for the pod or service specs. Hopefully, someday Kubernetes will make it easier to embed (with validation) existing resources.

In the meantime, maybe we need to do something to help.

@DazWilkin
Copy link
Contributor Author

DazWilkin commented Dec 18, 2020 via email

@DazWilkin
Copy link
Contributor Author

Had a play (!) with this today:

https://github.com/DazWilkin/kubectl-akri

The concept is a kubectl plugin that would permit e.g. kubectl akri apply --filename=./something.yaml and get errors

But, I think it's not going to be appropriate as it's turtles all the way down; I'd need to implement the entire schema tree below e.g. PodSpec.

@DazWilkin
Copy link
Contributor Author

As I drove to the dog camp, I realized that it may be sufficient after all.

I can simply validate the core Container spec and assume that e.g. PodSpec is valid. Perhaps an invalid PodSpec would be caught?

In my case, I'd inadvertently moved resources from ... container into the Configuration root and this type of error (incorrect Configuration entries), would be caught.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 19, 2020

An approach that would help (but would be tied to an API version) would be to add validation to the CRD for the pod and service specs. Prometheus worked on a similar issue: kubernetes/kubernetes#54579 (comment). Perhaps we could follow their lead until Kubernetes simplifies this.

Note: I also see someone suggesting using a ValidatingAdmissionWebhook ... we could add this to the controller maybe? https://stackoverflow.com/questions/56044319/validating-a-crd-with-an-embedded-core-v1-podspec

@bfjelds bfjelds added the enhancement New feature or request label Dec 19, 2020
@DazWilkin
Copy link
Contributor Author

The Webhook appears (!) trivial.

Do you support my giving that a whirl?

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 20, 2020

The Webhook appears (!) trivial.

Do you support my giving that a whirl?

Definitely! Validation would add a lot of value!!

@DazWilkin
Copy link
Contributor Author

I spent some time learning more about Kubernetes' webhook today and the work is non-trivial.

I'll give it a little more time today, have another commitment tomorrow and will report back Wednesday.

@DazWilkin
Copy link
Contributor Author

Oh boy! 😃

K8s webhooks are poorly documented and involved but I've made progress.

I can publish a ValidatingWebhook and it gets invoked. There's a bunch of infrastructural work to get this far.

I'm slightly perplexed as I was expecting to be able to filter the webhook by e.g. apiVersion: akri.sh/v0 mutations but, I'm having to use a super-wide filter (basically *), when kubectl apply...'ing an Akri Configuration to get it to work. Will spend more time later this week on it.

Once I can scope the webhook down to these, I must then write the validator for an Akri Configuration (but that should be the simplest part).

It's here: https://github.com/DazWilkin/akri-webhook

@DazWilkin
Copy link
Contributor Author

DazWilkin commented Dec 23, 2020

OK, I got it working.... boy-oh-boy.... ValidatingWebhooks need better documentation and a blog post.

more ./zeroconf.yaml 
apiVersion: akri.sh/v0
kind: Configuration
metadata:
  name: zeroconf
spec:
  protocol:
    zeroconf:
      kind: "_rust._tcp"
      port: 8888
      txtRecords:
        project: akri
        protocol: zeroconf
        component: avahi-publish
  capacity: 1
  brokerPodSpec:
    imagePullSecrets: # Container Registry secret
      - name: ghcr
    containers:
      - name: zeroconf-broker
        image: ghcr.io/dazwilkin/zeroconf-broker@sha256:69810b62...
    resources: <-------------------- MISPLACED
          limits:
            "{{PLACEHOLDER}}": "1"

kubectl apply --filename=./zeroconf.yaml 
Error from server: error when creating "./zeroconf.yaml": admission webhook "wednesday.akri.sh" denied the request: Configuration does not include `resources.limits`

NOTE The Error is being thrown by the webhook because the Configuration is invalid.

NOTE The invalid part of the Configuration is in the PodSpec (!)

One oversight|consideration is that the incorrect (!) Configuration above is valid YAML and so, even after writing the webhook and getting it to work, the invalid Configuration passed.

What I've had to do is add manually checking for e.g. existence of .spec.container[*].resources.limits

I think an alternative would be to write a custom parser that invalidates additional (!) entries rather than permitting them.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 24, 2020

Awesome...regarding the validation, I was mulling that over when I first read about the webhook. I wondered if we couldn't deserialize and serialized the input yaml, then use the original yaml to construct some kind of verification jsonpath-like queries. As I type it though, that sounds complex.

@DazWilkin
Copy link
Contributor Author

That's not far from what I have.

I'm currently unmarshaling it into a strict and have code to validate, but using an existing schema would be better.

I wrote the handler in Golang but it should be straightforward to rewrite in Rust. I'm more familiar with Go, the examples that exist are in Go and all the types are, by definition, available in Go.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 24, 2020

I would love to take a look, but I got a 404 error when trying out the link to your repo. Is it public?

@DazWilkin
Copy link
Contributor Author

Apologies! It's now public

@DazWilkin
Copy link
Contributor Author

OK. It works. See testing

It currently only checks for {.spec.brokerPodSpec.containers[*].resources.limits} and this should possibly be {.spec.brokerPodSpec.containers[*].resources.limits.\{\{PLACEHOLDER\}\}} and then the result cast to an int.

More importantly, there should be other tests for validity but I'm not entirely sure what they should all be.

  • There could be multiple containers
  • I think one 1 container should (!?) include {{PLACEHOLDER}}
  • More than 1 container could include {.spec.brokerPodSpec.containers[*].resources}
  • There should probably always be one (more?) {.spec.protocol}

I'd like to make the Webhook dynamic on a set of (flagged) templates. See: DazWilkin/akri-webhook#4

One outstanding challenge is that JSONPath filters but it's unclear to me how to make these boolean functions.

In the above examples, I'd like: {IsInt(.spec.brokerPodSpec.containers[*].resources.limits.\{\{PLACEHOLDER\}\})}

Then, each filter could be a Boolean function and the Webhook could find the intersection of a set of such functions to determine whether the filters succeed or fail: test-1 && test-2 && .... && test-n

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 28, 2020

that is super! rather than deciding what should be set and shouldn't (though that is a good idea!!), i was thinking something more dynamic ... just check what the user is trying to add. i tried a real general approach here: https://play.golang.org/p/10OniB-QR5u. what do you think about that?

also, is there a way to host this in the existing controller?

@DazWilkin
Copy link
Contributor Author

I explained myself poorly; set-not-set was a poor attempt to explain what you've achieved in your example.

In my code, the additional extra: 1 (as JSON) would (I think) parse because the JSON parser accepts the valid JSON and drops the superfluous extra.

Somehow (don't yet understand your code), the back-and-forth, addresses this which is clever!

I'll add your approach.

Would you like to merge this into the controller code? Or do you mean combine with the controller Deployment?

I'm going to try to fix what I think is another issue with the WebThings Rust SDK today and then blog about Kubernetes' Webhooks because the documentation is woeful and it will be good for me to explain it (to myself if no-one else),

...But I'll be back on this tomorrow.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 30, 2020

cool! the back and forth should allow extra: 1 to be ignored on the deserialization, serializing the resulting struct should apply any valid settings. then, using the original yaml as a "map" to walk the deserialized-then-serialized tree should allow us to pick up any missing settings (extra: 1), mistyped (if we want) values, and unexpected set values.

my ideal would be to have this be rust code that existed in the controller binary ... we could add the service definition into the helm deployment code for the controller.

@bfjelds
Copy link
Collaborator

bfjelds commented Dec 30, 2020

I truly love today's languages and their playgrounds ... here's a quick and dirty rust version: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=f11cc741f40e72b6a40f9fa29a319783

@DazWilkin
Copy link
Contributor Author

I agree; they're awesome tools! Talking of which, I assume you're all using the alternative rust-analyzer

My plan today is to incorporate the YAML parser into the existing Golang handler.

Last night I was Googling around the k8s_openapi. IIUC the AdmissionReview and related types aren't currently exposed through OpenAPI and so I need to get my head around rust-swagger and the swagger files referenced in that issue to generate the structs to wrap your code (thank you very much that saved me a bunch of time) into a Rust-based handler.

Yes, to merging this into the controller binary and incorporating into the Helm chart.

@DazWilkin
Copy link
Contributor Author

OK, a much improved (thank you!) version of the Golang webhook now uses the @bfjelds "twice-baked" check

I'm continuing to use JSON rather than YAML.

Now, my bad YAML example results in:

Error from server: error when creating "./zeroconf.webthings.yaml":
admission webhook "wednesday.yirella.svc" denied the request: Input index (spec) is not parsed correctly:
Input index (brokerPodSpec) is not parsed correctly:
Input index (resources) is not parsed correctly:
Input (map[limits:map[{{PLACEHOLDER}}:1]]) is not consistent with expected value

I'm missing something obvious though and recreated (!) it, the akri.sh/v0/Configuration schema 😃, in Golang, I have:

package main

import (
	v1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type Configuration struct {
	APIVersion string             `json:"apiVersion"`
	Kind       string             `json:"kind"`
	Metadata   *metav1.ObjectMeta `json:"metadata"`
	Spec       Spec               `json:"spec"`
}
type Spec struct {
	BrokerPodSpec            *v1.PodSpec            `json:"brokerPodSpec,omitempty"`
	Capacity                 int                    `json:"capacity"`
	ConfigurationServiceSepc *v1.ServiceSpec        `json:"configurationServiceSpec,omitempty"`
	InstanceServiceSpec      *v1.ServiceSpec        `json:"instanceServiceSpec,omitempty"`
	Properties               map[string]string      `json:"properties,omitempty"`
	Protocol                 map[string]interface{} `json:"protocol"`
	Units                    string                 `json:"units"`
}

I used the Rust Configuration as the guide.

Q: Should these types be defined by a Swagger spec?

I see that there are additional constraints on the ServiceSpecs of at most 1 per..., these constraints aren't currently being enforced through the YAML.

@DazWilkin
Copy link
Contributor Author

Making good progress with the Rust variant:

https://github.com/DazWilkin/akri-webhook-rust

It's still a webhook but it's almost in Rust 😄

Will file an issue tomorrow about externalizing types to make them more easily accessible from other projects.

@DazWilkin
Copy link
Contributor Author

Such a doofus, I spent an hour banging my head against the wall trying to understand why serde was unwilling to unmarshall my test configuration.yaml into an Akri Configuration.

Eventually, I realized that I need to use KubeAkriConfg 😄

So, it's working. Building a container image and will test it.

@DazWilkin
Copy link
Contributor Author

Frustrating day... I broke the Golang implementation. I think by upgrading from v1beta to `v1.

The Rust implementation balks on creationTimestamp and resources and I'm unsure why.

If I exclude them, the tests pass:

serde_json::Value::Object(o) => {
    for (key, value) in o {
        println!("[check] key: {}", key);
        if key == "creationTimestamp" {
            println!("[check] creationTimestamp deserialized: {:?}", deserialized);
            return Ok(());
        }
        if key == "resources" {
            println!("[check] resources deserialized: {:?}", deserialized);
            return Ok(());
        }
        if let Err(e) = check(&value, &deserialized[key]) {
            return Err(None.ok_or(format!(
                "input key ({:?}) not equal to parsed: ({:?})",
                key, e
            ))?);
        }
    }
    Ok(())
}

In both cases, the deseralized object does not contain the entities:

The missing creationTimestamp:

Object({
        "annotations": 
            Object({
                    "kubectl.kubernetes.io/last-applied-configuration": 
                        String("{ ... }"  )
            }), 
        "generation": Number(1.0),
        "name": String("webthings"),
        "namespace": String("default"),
        "uid": String("b61d7604-97f0-46a3-adb2-adc0dedfb0c9")
})

The missing resources:

Object({
        "containers": 
            Array([
                Object({
                        "image": String("ghcr.io/..."),
                        "name": String("zeroconf-broker")
                })
            ]),
            "imagePullSecrets": Array([
                Object({
                        "name": String("ghcr")
                })
            ])
    }
)

I'm unclear as to why.

Perhaps @bfjelds you have an inkling into the issue?

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 2, 2021

I've seen missing properties in what Rust kube exposes...I think I stumbled across missing create timestamps as well.

I wouldn't think that it would be a problem for this case though...seems like an odd property for someone to include in a Configuration (i would expect that to be a read only generated property).

Not sure about resources...

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 2, 2021

PodSpec resources should be exposed ... akri is accessing and modifying them here. Curious that it is not making it through the parse/unparse/reparse. I am curious at what point it is lost (is resources present after the initial typed parse? Is it in the deserialized string?)

@DazWilkin
Copy link
Contributor Author

I was having difficulties parsing the debug output; the entire tree(s) appear to be lazily evaluated by the debugger and I was unsure how to force its hand.

I'll spend more time with it this morning.

@kate-goldenring kate-goldenring added this to Triage needed in Akri Roadmap Jan 4, 2021
@DazWilkin
Copy link
Contributor Author

User (my) error on resources. I was testing valid and invalid JSON at the same time and the resources error was correctly arising from the misplaced resources section in the (intentionally) incorrect JSON.

However, there remain issues with creationTimestamp and managedFields (see below) and I have a hunch it's because akri-shared uses a (really) old version of kube (0.23.0 December 2019 ;-)) which corresponds to this commit when kube appeared to contain its own definition of Kubernetes' ObjectMeta:

https://github.com/clux/kube-rs/blob/023c785c0c451c7c8b1a3f37e37d3f937ec44467/src/api/metadata.rs#L38

In more recent versions, this uses k8s_openapi implementation:

https://github.com/clux/kube-rs/blob/54e74fb0b2321e41e9e0d8f9165639d07519379a/kube/src/api/metadata.rs#L19

I suspect this is the cause of both issues although I'm unable to explain the *Timestamp problem.

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 5, 2021

I'm not really familiar with ManagedFields (though from this documentation it sounds like something we can document as unvalidated for now) ... i think that creationTimestamp is something that people should not be setting in their Configuration. So maybe, for the sake of better-but-not-perfect, we can say that those properties are both either errors or unvalidated.

in the meantime, it seems like a worthwhile issue to point out how old kube is.

@DazWilkin
Copy link
Contributor Author

DazWilkin commented Jan 5, 2021

I am unfamiliar with ManagedFields but the section appear to be generated by the CRD|Operator process.

Edit: I just realized from your comment that ManagedFields are created by the Webhook process.

I'm not setting creationTimestamp in the Configuration spec; it appears that this becomes fully 'materialized' before it's shipped to the validator.

Here's my input:

dazwilkin@akri:~$ more ./zeroconf.webthings.yaml 
apiVersion: akri.sh/v0
kind: Configuration
metadata:
  name: webthings
spec:
  protocol:
    zeroconf:
      kind: "_webthing._tcp"
      port: 8888
  capacity: 1
  brokerPodSpec:
    imagePullSecrets: # Container Registry secret
      - name: ghcr
    containers:
      - name: zeroconf-broker
        image: ghcr.io/dazwilkin/zeroconf-broker@sha256:69810b62...
        resources:
          limits:
            "{{PLACEHOLDER}}": "1"

And here's the AdmissionReview that the Webhook receives. The Akri Configuration is .request.object:

kind: AdmissionReview
apiVersion: admission.k8s.io/v1
request:
  uid: 2b752327-a529-4ffd-b2e2-478455e80a0d
  kind:
    group: akri.sh
    version: v0
    kind: Configuration
  resource:
    group: akri.sh
    version: v0
    resource: configurations
  requestKind:
    group: akri.sh
    version: v0
    kind: Configuration
  requestResource:
    group: akri.sh
    version: v0
    resource: configurations
  name: webthings
  namespace: default
  operation: CREATE
  userInfo:
    username: admin
    uid: admin
    groups:
      - system:masters
      - system:authenticated
  object:
    apiVersion: akri.sh/v0
    kind: Configuration
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"akri.sh/v0",...'
      creationTimestamp: "2020-12-30T17:47:26Z"
      generation: 1
      managedFields:
        - apiVersion: akri.sh/v0
          fieldsType: FieldsV1
          fieldsV1:
            f:metadata:
              f:annotations:
                ".": {}
                f:kubectl.kubernetes.io/last-applied-configuration: {}
            f:spec:
              ".": {}
              f:brokerPodSpec:
                ".": {}
                f:containers: {}
                f:imagePullSecrets: {}
              f:capacity: {}
              f:protocol:
                ".": {}
                f:zeroconf:
                  ".": {}
                  f:kind: {}
                  f:port: {}
          manager: kubectl
          operation: Update
          time: "2020-12-30T17:47:26Z"
      name: webthings
      namespace: default
      uid: b61d7604-97f0-46a3-adb2-adc0dedfb0c9
    spec:
      brokerPodSpec:
        containers:
          - image: ghcr.io/dazwilkin/zeroconf-broker@sha256:69810b62...
            name: zeroconf-broker
            resources:
              limits:
                "{{PLACEHOLDER}}": "1"
        imagePullSecrets:
          - name: ghcr
      capacity: 1
      protocol:
        zeroconf:
          kind: _webthing._tcp
          port: 8888
  oldObject:
  dryRun: false
  options:
    kind: CreateOptions
    apiVersion: meta.k8s.io/v1

@DazWilkin
Copy link
Contributor Author

my ideal would be to have this be rust code that existed in the controller binary ... we could add the service definition into the helm deployment code for the controller.

I mentioned to @kate-goldenring that I upgrade to K8s 1.20 and broke the certificate process in the Webhook. I decided this morning to try cert-manager (it is excellent). I have that working as a self-signing CA generating certs for the Webhook service; it would be trivial for others to swap our self-signing and swap-in e.g. Let's Encrypt. Using cert-manager makes the deployment process for the Webhook more streamlined (mostly Kubernetes specs) and should be straightforward to add to the Helm chart for Akri.

IIUC the Controller is not an HTTP daemon. Obviously, the Webhook is and it requires TLS. The only shared dependency with Akri is the shared member (for KubeAkriConfig etc.). I propose (respectfully and acknowledging your ideal) that

  • the Webhook be standalone in Akri (or standalone where it is)
  • Possibly located in ./webhooks/... possibly ./webhooks/validating/configuration?
  • Possibly part of a new member project webhooks

Kubernetes Webhooks are distinct processes and out-of-band; they're registered as callbacks in API server. Their development is distinct too; the resources they use aren't part of the K8s core and may evolve out-of-sync too. I'm still taking a dependency for the Rust solution on a self-generated 'crate' from an OpenAPI spec published in an issue. The requirement for TLS adds complexity and it may be easier to keep it all out of the existing "core".

Ultimately, it's your project and I'll do as you wish.

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 9, 2021

I'm totally open to separating this from controller if it makes sense. No need to force a square peg into a round whatever :)

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 9, 2021

I appreciate your thought and energy ... am really looking forward to this!!

@Britel Britel moved this from Triage needed to In progress in Akri Roadmap Jan 11, 2021
@Britel Britel assigned Britel and DazWilkin and unassigned Britel Jan 11, 2021
@DazWilkin
Copy link
Contributor Author

The Helm Chart is working: https://github.com/DazWilkin/akri-webhook-helm

Tomorrow, I'll begin merging the Rust-based Webhook and the Helm chart into Akri proper.

@bfjelds
Copy link
Collaborator

bfjelds commented Jan 13, 2021

The Helm Chart is working: https://github.com/DazWilkin/akri-webhook-helm

Tomorrow, I'll begin merging the Rust-based Webhook and the Helm chart into Akri proper.

I wish i could give more than one thumbs up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants