Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call controller #3598

Closed
anyasabo opened this issue Aug 10, 2020 · 23 comments
Closed

API call controller #3598

anyasabo opened this issue Aug 10, 2020 · 23 comments
Assignees
Labels
>feature Adds or discusses adding a feature to the product

Comments

@anyasabo
Copy link
Contributor

This issue will be updated with current status as it progresses.

Goals

We would like to be able to declaratively configure (create/update) common configurations operations in Elasticsearch that are currently API-only, such as ILM, index settings, and index templates. This is helpful for organizations that have more than one Elasticsearch cluster, but want them to be configured identically. For instance, having separate dev, staging, and production environments, or having an environment per geography. It is helpful to be able to store these settings in configuration management to ensure all environments are identical (see “infrastructure as code” philosophy). Similar functionality exists for users and roles today.

Today to have a "fully functional" cluster, users need to create their ES cluster, wait for it to become available, then interact with the API to create the extra settings. If this could be handled by ECK it would make the process much simpler.

Initial thoughts

We may want to have generic support for essentially any endpoint that supports a PUT. Users could provide a URL and the body, and we could get/set the URL as necessary. Even idempotent PUTs cause extra load on Elasticsearch, so it would be better for us to do a comparably cheaper GET first rather than blindly sending PUTs.

This resource could be modeled as a configmap, a CRD, or as part of the Elasticsearch manifest. My gut leans towards a CRD to start. Adding fields to a CRD that is already v1 is challenging because you cannot revise them easily. A configmap works well for a key/value list, which is all we really need. But it cannot be versioned easily, where at least CRDs we have the option to use webhook conversion (though we do not use the feature today and there are challenges with deploying it in all environments).

Caveat: some PUTs are not idempotent (e.g. indexing documents revs the version). SLM policies include execution status information. This may be something where we do not do a strict comparison before setting -- for instance, if a user specifies a doc to be indexed, as long as all the fields the user specified match, then we do not consider it as needing an update. This has similar problems that we have experienced in the past in that it makes it difficult to remove fields previously set by the user.

Another option would be to simply leave this as a potential footgun, and to document that users shouldn't use non-idempotent API operations.

Caveat cont'd: some operations have dependencies. For instance, a snapshot repository must be created before an SLM policy referencing it can be created. This can likely be supported by executing operations sequentially within a given resource.

Proposed CRD example

apiVersion: elasticsearchconfig.k8s.elastic.co/v1alpha1
kind: ElasticsearchConfig
metadata:
  name: quickstart-config
spec:
  operations:
  - url: "/_snapshot/my_repository"
    body: |-
      {
            "type": "fs",
            "settings": {
                "location": "my_backup_location"
            }
      }
  - url: "/_slm/policy/nightly-snapshots"
    body: |-
      {
        "schedule": "0 30 1 * * ?", 
        "name": "<nightly-snap-{now/d}>", 
        "repository": "my_repository", 
        "config": { 
          "indices": ["*"] 
        },
        "retention": { 
          "expire_after": "30d", 
          "min_count": 5, 
          "max_count": 50 
        }
      }
  elasticsearchRef:
    name: quickstart

Current status

I'll do a POC of a CRD that looks like that example.

Related

ILM CRD issue
Kibana dashboard issue (ECK repo)
Kibana dashboard issue (Kibana repo)

Elasticsearch has rest api tests that function similarly here

@anyasabo anyasabo added the >feature Adds or discusses adding a feature to the product label Aug 10, 2020
@anyasabo anyasabo self-assigned this Aug 10, 2020
@this-is-r-gaurav
Copy link

this-is-r-gaurav commented Aug 24, 2020

Hi is there any progress regarding this configuration, i need to configure the ILM policy with the operator but not able to find anything helpful. Only Solution left me for is having a wrapper command over kubeconfig, that will first deploy the operator, and then make a put call to use that policy.

@xavigpich
Copy link

xavigpich commented Sep 8, 2020

I would love to see this going upstream. In my org we have multiple Elasticsearch clusters with similar configuration. Having the ability to configure ILM and SLM policies as well as RBAC via configuration files would be of great help! Hopefully ECK will handle this shortly. Thanks for this

@shubhaat
Copy link
Contributor

@uric , @bleskes - @anyasabo and I discussed this today, and to reiterate the concerns around this (to make sure I heard this right)

  1. We are afraid that if we support some of the config API for Elasticsearch in ECK, we would be on a constant treadmill to support future functionality.
  2. We are afraid that the implementation might look different in ECE in the future (if ECE ever plans to support this?)

The proposed solution POC: #3775 resolved #1 by having a generic implementation that can be extended to any API (as long as the PUTs are idempotent).

I am not sure if #2 is a concern and I wanted to check on this and if there are other concerns. From our conversations with the Elasticsearch team, we decided to implement this in ECK and not the stack because it seemed significantly lower in terms of effort.

@bleskes
Copy link

bleskes commented Oct 2, 2020

I love the creativity in coming up with the "Sequence of API calls" direction. I have a ton of questions, out of curiosity, but I trust the team will figure it out and I'm looking forward to hearing the solutions. There is also an interesting relationship to features to the package manager being developed (which may conceptually include things like SLM/ILM policy), but I can see how this API based approach can serve as a foundation for many creative use cases. I do have one ask - can we maybe think about another name for it that better describes the scope/foundational character of the feature? There's a big gap between Declarative Config and a series of API calls you can issue against a cluster. I actually had to see the examples to understand what it is about. I think naming it right will help people understand what they can do with it and how to think about it, which will actually result in wider adoption. Just to give an idea of what I mean - "API sequences", "API fixtures".. etc.

@sebgl
Copy link
Contributor

sebgl commented Oct 2, 2020

A few things I'm wondering while looking at the draft PR:

  • Is it right for the api calls specification to live in a new CRD? I understand how it could be useful for reuse across different clusters, but so far the entire Elasticsearch spec (topology, config, plugins, keystore, etc.) lives in the Elasticsearch CRD. Should we do the same here? In general we're trying to avoid introducing many CRDs. Would a configMap also work?
  • I'm not fond of the relationship direction going from the api calls resource to the elasticsearch resource (elasticsearchRef: <name>). It feels more natural to me to have the Elasticsearch resource point to the api calls spec (apiCalls: <configMapRef>)? Similar to how it points to the keystore for example.
  • Do we want an additional controller to handle the API calls? It feels natural to me for the Elasticsearch controller to handle API calls, since it already does for orchestration needs.
  • The current approach is that, for each call, we first GET the ES rest endpoint, diff with what we expect, then PUT the update. Since what we get is sometimes a superset of what we want (ES stores additional metadata), the diff is not a strict comparison. Are we confident that this can match all/most APIs people may want to deal with here?
  • I'm thinking about an alternative approach: instead of the GET/diff/PUT approach, we could always PUT what the user specified, and store the success status of that call, with a hash of its payload, somewhere (in the status of the api calls resource?). So we don't do the call over and over again once it has been successful, and we don't need to GET/diff. If the user changes the call payload, its hash changes so we know we need to PUT again. If the call reports a user error (eg. bad request), we also know there's no need to do it again. Otherwise, we retry doing it at the next reconciliation. One difference with this approach is that it doesn't modify what users may have done with the Elasticsearch API behind ECK's back.

@anyasabo
Copy link
Contributor Author

anyasabo commented Oct 2, 2020

For the first three, that's definitely valid and I tried to address them in the initial issue. My main hesitation with adding it to the ES CRD is the amount of uncertainty around how we want to model this feature. Modifying the ES CRD is hard at this point as it is v1. We can add things to it, but if we need to remove it or modify it later it's a Big Deal. That was my main motivation for making it a separate, alpha resource. Once we've proofed out the design in practice we can still add it to the ES CRD later if we think it makes sense. If it wasn't for this concern I'd have been in favor of including it in the ES CRD.

A configmap works (since this is essentially a list of k/v pairs) but has at least two downsides:

  • cannot realistically add admission validation. webhooks would be the only option (can't add openapi for core types) and many cluster administrators would be hesitant to trust us to validate a core type. Since dropping JSON payloads in yaml is a bit awkward I think the validation will be pretty useful for users.
  • cannot use built in versioning mechanisms. This is less of an issue since we do not use mutation webhooks currently, but it is at least an option for CRDs. We would have to roll our own versioning scheme if we used configmaps.

The current approach is that, for each call, we first GET the ES rest endpoint, diff with what we expect, then PUT the update. Since what we get is sometimes a superset of what we want (ES stores additional metadata), the diff is not a strict comparison. Are we confident that this can match all/most APIs people may want to deal with here?

🤷‍♀️ I contacted the ES team to take a look but haven't heard back yet. I'll ask again on Monday.

I'm thinking about an alternative approach: instead of the GET/diff/PUT approach, we could always PUT what the user specified, and store the success status of that call, with a hash of its payload, somewhere (in the status of the api calls resource?)

That's valid. It feels a bit like an anti pattern to just store all of the state indefinitely in the CR where most controllers reconstruct their state via observation whenever possible, and only store it in etcd when this is not possible. I think I'd be open to it if we find there are more severe problems with this approach and we need to look for alternatives, but as is I prefer retrieving the state from ES.

@pebrc
Copy link
Collaborator

pebrc commented Oct 13, 2020

  1. I think I am in favour of keeping the CRD separate, for the same reasons that @anyasabo has mentioned. We are are bit uncertain about the future of this feature and a separate CRD allows us to give it a lifecycle independent from the main Elasticsearch CRD.

  2. The same argument applies to the question of the direction of the reference from the Elasticsearch cluster to the API operations resource (apiCalls as suggested by @sebgl ) or from the API operations resource to a Elasticsearch cluster via elasticsearchRef. The latter allows us to keep the Elasticsearch resource completely unaware of the existence of the API operations resource. The only immediate drawback that comes to mind is that it makes it possible for users to create multiple API operation resources that all point to the same Elasticsearch cluster all with potentially conflicting configs.

  3. As to the method of GET/diff/PUT I have doubts that the current approach will work for many APIs. This is mentioned already in the OP wrt to indexing documents and SLM. Consider also creating an ILM policy:

{
  "policy": {
    "phases": {
      "warm": {
        "min_age": "10d",
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

On create many APIs take an "anonymous" object. The name is communicated via the the path in PUT _ilm/policy/my_policy

On GET however the structure is different, the object is now returned keyed below its name and with additional metadata on the top level:

{
  "my_policy": {
    "version": 1, 
    "modified_date": 82392349, 
    "policy": {
      "phases": {
        "warm": {
          "min_age": "10d",
          "actions": {
            "forcemerge": {
              "max_num_segments": 1
            }
          }
        },
        "delete": {
          "min_age": "30d",
          "actions": {
            "delete": {}
          }
        }
      }
    }
  }
}

If I am understanding the suggested approach correctly this means the diff would always be non-empty and we would PUT this policy in an infinite cycle. The problem here seems twofold. First the request is not idempotent (a version counter is incremented on each update) but then also the object structure of the GET is not the same as in PUT .

I wonder how many Elasticsearch APIs are truly idempotent? With SLM and ILM and thus two major APIs already excluded, maybe we should consider the hashing approach @sebgl suggested after all to make this feature more useful?

PS An alternative approach to hashing, which has admittedly the drawback of accumulating all the state in annotations, would be to make this feature not fully generic.

Instead we would encode knowledge about the structure of major APIs we would support. Based off of this ILM example, we could have a map of API->JSON pointer'ish that allows the controller to extract the relevant portion of the JSON object returned on GET. So in this case (pseudo-code):

map[string][]string{
    "_ilm/policy/\1": []string{"\1", "policy"}
}

Note that the path contains a dynamic component in form of the name of the policy which I suggest to match from the last path segment of the API.

If the user specs an unknown API we would fall back to the current behaviour from the draft PR and assume GET == PUT

@barkbay
Copy link
Contributor

barkbay commented Oct 14, 2020

I was wondering how we should handle deletion. What if the user wants to delete something ? (creating objects at scale is great, being able to fix a mistake at scale is great too).
I agree that some "objects" like remote clusters can be deleted with a PUT. But some others like ILM requires a DELETE operation. 

Is it out of the scope of this feature ?

Also I think I agree with @bleskes about the naming. Modern, declarative language based, tools like Terraform, have an advanced management of the objects lifecycle. It might be hard to offer a similar experience if the user has to provide an URL and not an desired state.

@anyasabo
Copy link
Contributor Author

I was wondering how we should handle deletion. What if the user wants to delete something ? (creating objects at scale is great, being able to fix a mistake at scale is great too).
I agree that some "objects" like remote clusters can be deleted with a PUT. But some others like ILM requires a DELETE operation.

Is it out of the scope of this feature ?

My initial thinking was that yes it is out of scope (at least for now) because it is hard. I think it may be possible though to extend the operation object with an http method like so (where the default would be PUT):

spec:
  operations:
  - url: "/_snapshot/my_repository"
    body: |-
      {
            "type": "fs",
            "settings": {
                "location": "my_backup_location"
            }
      }
  - url: "/_ilm/policy/my_policy"
    method: DELETE

I think we need to be explicit about deletes -- just removing an item from the list should not trigger deleting the resource IMO.

@anyasabo
Copy link
Contributor Author

One other extension point I thought of is that adding another field containing query parameters to use only on PUTs might make sense. For instance component templates have a flag allowing you to update them. You can't put them in the url because then that will break the GETs. I suppose alternately we could strip any query parameters when doing GETs as well, but that feels less obvious to the end user.

It doesn't seem like something that needs to be added on launch, but could make sense later.

@anyasabo
Copy link
Contributor Author

anyasabo commented Nov 3, 2020

In an out of band discussion, we decided to put this approach on pause for the time being. It is feasible, but we would need to add support for specific APIs one by one as we cannot depend on consistency between different Elasticsearch APIs. This could have been caught sooner as an issue with the approach and that's on me.

We decided to investigate an alternative option of having an example k8s job that can simply send PUTs until the a success response code is returned from that API. This is more of the "bootstrap" use case rather than a configuration mgmt use case, but it should meet a lot of needs for not that much effort.

I'm not sure there is an approach that can satisfy the "reconciliation loop" model that does not require explicitly supporting various APIs. I left my branch https://github.com/anyasabo/cloud-on-k8s/tree/esdeclare up in case we decide to revisit this approach though. We would just need to weigh the benefits vs the large maintenance costs of supporting such a feature. It may also be worth further investigation with the ES clients team to see how they have resolved similar issues. My initial look was that they mostly did not, but it was very preliminary and could use further investigation.

The current state of the branch has unit tests for some specific API responses (pulled from an actual ES server) and their transformations, and the framework for e2e tests but without any e2e tests yet.

@shubhaat
Copy link
Contributor

shubhaat commented Nov 12, 2020

We decided not to call it "Declarative ES Config" and a more intelligent name is TBD.

@shubhaat shubhaat changed the title Declarative config for ES API call controller [name might change] Nov 12, 2020
@charith-elastic
Copy link
Contributor

charith-elastic commented Nov 23, 2020

If we consider the differences between having an operator vs. using Helm/Terraform/other tool to orchestrate an Elasticsearch cluster, the advantage of the operator is that it is able to constantly ensure that the cluster is in the desired state and prevent configuration drift (immediate feedback vs. delayed feedback). Ensuring that all changes to a cluster are documented (e.g. as Git commits) and having the confidence that re-creating the cluster via the last manifest brings up the same cluster are quite desirable traits from an ops perspective.

As Anya's work illustrates, building such a feature externally is a brittle proposition because it is impossible to do without intimate knowledge of the Elasticsearch API between different versions and playing catch-up constantly to keep abreast of developments in new versions. Furthermore, the need to employ polling techniques to ensure that the cluster is in the desired state is inefficient and could even end up overloading busy clusters. Therefore, I think the best place to implement this feature is in Elasticsearch itself.

I experimented a bit with the "fire-and-forget" alternative.

  • Add a subcommand to the operator called postprovision (name TBD) that takes a YAML or JSON containing a set of API calls to execute. This can be executed as a Kubernetes Job.
  • When the API calls are executed successfully, the Elasticsearch cluster is annotated to indicate that state.
  • Users can define an optional readiness gate in the Elasticsearch definition. This makes the cluster unavailable to clients until the job has completed successfully.

This work is available in the https://github.com/charith-elastic/cloud-on-k8s/tree/feature/api-conf branch.

---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: hulk
spec:
  version: 7.10.0
  nodeSets:
    - name: default
      count: 3
      config:
        node.store.allow_mmap: false
      podTemplate:
        spec:
          readinessGates:
            - conditionType: eck.k8s.elastic.co/post-provision
---
apiVersion: batch/v1
kind: Job
metadata:
  name: hulk-post-provision
spec:
  ttlSecondsAfterFinished: 300
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: postprovision
          args:
            - postprovision
            - --jobdef=/jobs/job.yaml
            - --log-verbosity=1
          image: docker.elastic.co/eck-dev/eck-operator-cell:1.4.0-SNAPSHOT-5851b163
          volumeMounts:
            - mountPath: /jobs
              name: jobs
              readOnly: true
      securityContext:
        runAsNonRoot: true
      serviceAccountName: hulk-post-provision
      volumes:
        - name: jobs
          configMap:
            name: hulk-post-provision
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: hulk-post-provision
data:
  job.yaml: |-
    target:
      kind: Elasticsearch
      namespace: default
      name: hulk
    clientConf:
      requestTimeout: 30s
      retryAttempts: 3
      retryBackoff: 10s
      retryMaxDuration: 150s
    apiCalls:
      - method: PUT
        path: /_ilm/policy/my_policy
        payload: {
          "policy": {
            "phases": {
              "hot": {
                "actions": {
                  "rollover": {
                    "max_size": "25GB"
                  }
                }
              },
              "delete": {
                "min_age": "30d",
                "actions": {
                  "delete": {}
                }
              }
            }
          }
        }
        successCodes: [200]
        retry: true
      - method: PUT
        path: /_cluster/settings
        payload: {
          "transient": {
            "indices.lifecycle.poll_interval": "3m"
          }
        }
        successCodes: [200]
        retry: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hulk-post-provision
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: hulk-post-provision
rules:
  - apiGroups:
      - ""
    resources:
      - pods
      - secrets
    verbs:
      - get
      - list
  - apiGroups:
      - elasticsearch.k8s.elastic.co
    resources:
      - elasticsearches
    verbs:
      - get
      - list
      - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: hulk-post-provision
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: hulk-post-provision
subjects:
  - kind: ServiceAccount
    name: hulk-post-provision

The issues with this approach are:

  • Job definition is too verbose due to requiring the proper RBAC declarations
  • The Job declaration is a separate entity from the Elasticsearch declaration. The connection between them is not obvious.
  • Version upgrades could potentially require re-running the job -- which the users must remember to do.
  • Version upgrades could break the API calls defined in the job.

We could potentially introduce a new CRD or a new field to the existing CRD to alleviate some of the above problems. The question is whether we are comfortable with that level of deep integration for a brittle feature like this.

@pebrc pebrc changed the title API call controller [name might change] API call controller Dec 1, 2020
@sebgl
Copy link
Contributor

sebgl commented Sep 10, 2021

I'd like to expand a bit on the hashing approach I proposed some time ago.

CRD

First, what the CRD would look like (nothing new here) - I think this would be a section in the existing Elasticsearch CRD:

api_requests:
- path: /_security/role_mapping/sso-kibana-123
  method: PUT
  body: |
    {
      "rules": {
          "all": [
              {
                  "field": {
                      "realm.name": "cloud-saml-kibana-abcdeaad866da82f61ed1"
                  }
              },
              {
                  "field": {
                      "groups": "superuser"
                  }
              }
          ]
      },
      "enabled": true,
      "roles": ["superuser"],
      "metadata": {}
    }
- path: /_snapshot/cloud-snapshots
  method: PUT
  body: |
    {
      "cloud-snapshots": {
        "type": "gcs",
        "settings": {
          "bucket": "abcdfd3f246639bfbaa5b01fb0",
          "client": "client-a9fc50",
          "base_path": "snapshots/abcdfd3f246639bfbaa5b01fb0",
          "region": "us-central1",
          "email": "abcd@cloud-email.com",
          "aws_account": "abcd-us-central1"
        }
      }
    }
- path: /_slm/policy/cloud-snapshot-policy
  verb: PUT
  body: |
    {
      "name": "<cloud-snapshot-{now/d}>",
      "schedule": "0 */30 * * * ?",
      "repository": "cloud-snapshots",
      "config": {
        "partial": true
      },
      "retention": {
        "expire_after": "259200s",
        "min_count": 10,
        "max_count": 100
      }
    }
- path: /_cluster/settings
  verb: PUT
  body: |
    {
    "persistent": {
        "action": {
          "auto_create_index": "true"
        },
        "cluster": {
          "routing": {
            "allocation": {
              "disk": {
                "threshold_enabled": "true"
              }
            }
          },
          "indices": {
            "close": {
              "enable": "false"
            }
          }
        }
      }
- path: /_cluster/settings
  verb: PUT
  body: |
    {
      “persistent.snapshot.max_concurrent_operations”: null
    }
- path: /_template/.cloud-hot-warm-allocation
  verb: PUT
  body: |
    {
      ".cloud-hot-warm-allocation-0": {
      "index_patterns": [
        "*"
      ],
      "mappings": {},
      "aliases": {},
      "order": 0,
      "settings": {
        "index": {
          "routing": {
            "allocation": {
              "require": {
                "data": "hot"
              }
            }
          }
        }
      }
    }
- path: /_template/.another-deprecated-template
  verb: DELETE

What the feature is about

I think the general philosophy would be to:

  • offer a way to declare http api requests the operator should run once Elasticsearch is available
  • retry each request until it succeeds - likely at the next reconciliation
  • once successful, don't execute the same http request again
  • if the content of an http request changes (e.g. new body), consider it's a new http request to execute

Persisting hashes of successful http calls

Although many Elasticsearch requests materialize to idempotent PUTs, I think it would be a shame to execute the same request over and over again. In most cases users just want to configure things once.
In order for this to work, we (unfortunately) need to persist which request has been already successfully executed.
Each http request is identified by a hash of its content (everything that makes the request: path, method, body, etc.), so what we need to persist is a set of hashes for all successful api calls.

For example, the sha1 hash of {"path": "/_snapshot/cloud-snapshots", "method": "PUT", "body": "{\"foo\": \"bar\"}"} is b3b07b553ff32cd0ad9767eebc3bb70b6a4206ca.
Those hashes can be persisted in an annotation of the Elasticsearch resource:

metadata:
  annotations:
    elasticsearch.k8s.elastic.co/successful_http_requests: "[\"b3b07b553ff32cd0ad9767eebc3bb70b6a4206ca\"]"

We only need to persist hashes matching API requests specified in the specification. If the user removes the snapshot repository request from the specification, we can also remove its hash from the annotation.
If the user modifies the content of the http request (e.g. new value for settings.bucket in the snapshot repository request body), that requests has a new hash value, hence will be executed (again, with its new content) by the controller.

At least once http calls, not declarative configuration management

If a user patches the snapshot repository through Elasticsearch API directly, bypassing the yaml manifest, then it's considered out of control of the operator. The operator won't try to execute again the same http request, since already successful.
The only way to have the exact same request be executed again is to remove its hash from the persisted successful http requests set annotation. As such, a way to have all http requests executed again is to remove the annotation containing all hashes.

This feature is "just" about executing http calls at least once. It does not allow a user to declaratively manage all objects of a given type. For example, let's imagine a user sets up 2 index templates in the specification: removing an index template from the spec does not lead the controller to also notice it should be removed in Elasticsearch. That removal can only be explicit through a DELETE http call specified in the spec:

- path: /_template/.cloud-hot-warm-allocation
  verb: PUT
  body: |
    {
      ".cloud-hot-warm-allocation-0": {
      "index_patterns": [
        "*"
      ],
      "mappings": {},
      "aliases": {},
      "order": 0,
      "settings": {
        "index": {
          "routing": {
            "allocation": {
              "require": {
                "data": "hot"
              }
            }
          }
        }
      }
    }
- path: /_template/.another-deprecated-template
  verb: DELETE

Another example: remote clusters. They can be configured through the _cluster/settings API, which is an additive API:

PUT /_cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
         "my-remote-cluster-1": {
         "mode":"proxy",
         "proxy_address": "a542184a7a7d45b88b83f95392f450ab.192.168.44.10.ip.es.io:9400",
         "server_name": "a542184a7a7d45b88b83f95392f450ab.192.168.44.10.ip.es.io"
        }
      }
    }
  }
}

In order to remove my-remote-cluster-1, the user must explicitly null-out the previously configured cluster:

PUT /_cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
         "my-remote-cluster-1": {
         "mode": null,
         "proxy_address": null,
         "server_name": null
        }
      }
    }
  }
}

Those cases may be better handled by an explicit remoteClusters feature in ECK (well, we have it already :)), rather than through that generic http requests mechanism.

Status reporting

In addition to the internal annotation to persist successful request hashes, I think it would be useful to provide a way for users to know whether all requests have been successfully executed. Which naturally falls into the status subresource scope:

status:
  health: green
  phase: Ready
  version: 7.15.0
  api_requests: 12/12 # 12 requests in the spec, 12 matching hashes in the annotation

I think it makes sense to keep this status very lightweight, and not go into a direction where the result of each call would be serialized. Rather, if a user notices api_requests: 11/12, they would inspect the events associated to the Elasticsearch resource to get more details about the failing http request. We care about failing requests more than successful ones in that case.

Success criterias

All Elasticsearch API calls I know about can be considered successful under the following conditions:

  • 2xx http response code to a PUT request
  • 2xx or 404 (resource does not exist) response code to a DELETE request

Sequential ordering of the requests

There is an order into which requests may need to be executed. For example, a snapshot repository must be created before an SLM policy that refers to it. Therefore all http calls defined in the spec must be executed in the sequential order they are defined with. If request 3/12 fails, then request 4/12 does not even get executed. This provides an additional way for users to notice which call is likely failing (status says 3/12: call number 4 hasn't succeeded yet!). I don't think an optimization where we would execute http requests in parallel for performance reasons would matter much here.

Additional settings

We could provide more options in the specification for users to tweak how requests get executed. However I think in practice the following works just fine (so I'd rather not add those options for now):

  • request timeout: use the same as we already use for other http requests in the controller
  • retry policy: retry at the next reconciliation iteration, relying on the controller reconciliation retry backoff mechanism. No need for a max number of retries, retry forever as we do for other errors.
  • retry condition: I think the success criterias based on http status code defined above would work well with all APIs I know about
  • no hash persistence: are there cases where users would rather have the operator perform the same http request again and again at each reconciliation, just in case things drift on Elasticsearch side? In which case we may want to couple this with a retryPeriod. I tend to consider that's not required for most use cases.

Caching issues with the successful hashes annotation

If we execute all requests, then update the annotation, chances are that annotation may not be visible yet at the next reconciliation attempt. That's because the controller works with a cached version of the Elasticsearch resource, and the cache has not been populated yet with the updated annotation. We could consider that's not a problem: run the same idempotent requests again. Or (my preference) rely on the existing expectations mechanism: check the updated resourceVersion, and only reconcile again once the caches has a higher resourceVersion.

Pseudo-code implementation

execute_api_requests():
  if !elasticsearch_responds_to_requests:
    return and requeue

  prior_successful_hashes = es.annotations["successful_hashes"]
  new_successful_hashes = []

  for request in es.spec.api_requests:
    request_hash = sha1(request)

    if request_hash in prior_successful_hashes:
      # already successful in the past, move to the next
      new_successful_hashes.append(request_hash)
      continue

    # not successful in the past, execute now
    response, err = http_client.call(request)
    if err || !successful(request, response):
      # request fail, don't execute the next one
      break
    
    # request successful
    new_successful_hashes.append(request_hash)
  
  # we're done, update the annotation and the status
  es.annotations["successful_hashes"] = new_successful_hashes
  es.status.api_requests = len(new_successful_hashes) + "/" + len(es.spec.api_requests)
  k8s_client.update(es)
  reconciliation_cache.expect(es.metadata.resourceVersion)

  if len(new_successful_hashes) != len(es.spec.api_requests):
    # some requests have failed
    return and requeue
  
  # all good
  return

@arieh-HelloHeart
Copy link

To solve this minimally, we have implemented a small utility that can be used as a side-car / job to configure things against ES/Kibana api
https://github.com/Hello-Heart/elasticsearch-api-configurator

@mtparet
Copy link

mtparet commented Feb 2, 2022

(sorry for the duplicate) I think it is interesting to see that Elastic made official terraform provider so user can natively use terraform. I hope we should get soon the same UX experience using k8s native way (CRDs).

https://www.elastic.co/blog/streamline-configuration-processes-with-official-elastic-stack-terraform-provider

@baurmatt
Copy link

We would also be very grateful if this could be implemented. Our current hack is to inject a sidecar:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
spec:
  nodeSets:
    - name: default
      podTemplate:
        spec:
          containers:
            - name: elasticsearch-s3-configuration
              image: curlimages/curl
              command:
                - sh
                - -c
                - |
                  echo "machine localhost login $USERNAME password $PASSWORD" > $HOME/elastic-credentials
                  set -x
                  while [ "$(curl -k -s -o /dev/null -w '%{http_code}\n' --netrc-file $HOME/elastic-credentials https://localhost:9200/_snapshot/s3_snapshots)" != "200" ]; do
                    curl -k -X PUT --netrc-file $HOME/elastic-credentials https://localhost:9200/_snapshot/s3_snapshots -H 'Content-Type: application/json' -d'
                    {
                      "type": "s3",
                      "settings": {
                        "bucket": "es-backup",
                        "endpoint": "minio.example.org"
                      }
                    }'
                    sleep 10
                  done
                  sleep infinity
              env:
                - name: USERNAME
                  value: elastic
                - name: PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: logserver-es-elastic-user
                      key: elastic

@rtkkroland
Copy link

About to implement a hack for this as well. Mergey MacMergeface! MVP! 😄

@genofire
Copy link

I believe a direct Resource controller would be nice, a better solution.
Maybe elastic should reimplement/merge this controller:
https://github.com/xco-sk/eck-custom-resources

@Kushmaro
Copy link

Kushmaro commented May 10, 2023

closing this issue for now, as it depicts a solution to a now solved problem - StackConfigPolicy CRD in ECK allows you to configure what people here depict the "API controller" should do.

@aravindh-odyssey
Copy link

Hello, Is there a standalone tool that supports declarative ES API calls? While this thread, I believe mainly is about k8s integration, we are a small company that uses Managed ES. We also have a similar situation where we would like to have all the operations done on ES via configuration so that they can be checked into source control so that it provides consistency across environments and auditability to the changes done. What are our options if we dont use k8s?

@thbkrkr
Copy link
Contributor

thbkrkr commented Oct 9, 2023

@aravindh-odyssey, as mentioned in #3598 (comment), Elastic supports an official Elastic Stack Terraform provider that allows you to manage and configure the Elastic Stack as code using Terraform.

@aravindh-odyssey
Copy link

@thbkrkr Thank you for pointing me towards that. This helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product
Projects
None yet
Development

Successfully merging a pull request may close this issue.