feat: support static tokens to be used for agent enrollment #2654

olegsu · 2023-06-01T07:19:18Z

What is the problem this PR solves?

This PR adds support for static tokens for an agent enrollment.
The reason we need this capability is to support a flow for serverless security projects when an agent should be installed at the same time as the other project services (kibana, es, fleet).
To be able to install an agent at that point, we need a predictable token that both the agent and the fleet accept.

How does this PR solve the problem?

By accepting static tokens, well just the key part of the token, fleet server will authenticate a request that have this key in the enrollment token it uses and will send back the API key to communicate with ES. For example, passing to the agent token like base64(0123456789:abcdefg) the fleet server will accept it if the configuration includes the following

inputs:
  - type: fleet-server
    policy.id: "${FLEET_SERVER_POLICY_ID:fleet-server-policy}"
    server:
      auth: statis
      static_policy_tokens:
        enabled: true
        policy_tokens:
          # token.key: policy_id
          abcdefg: 901b70f0-fefa-11ed-aa5e-5974b0535e80

Today flow looks something like that

And this PR will support the following flow

How to test this PR locally

Create local stack elastic-package stack up -s "elasticsearch,kibana,package-registry" --version 8.8.0 -v -d
Build and run fleet server local image ( attached to the network of the stack )
Run Elastic-Agent docker image (attached to the same network )
See that the Elastic-Agent is healthy and the spawned Cloudbeat sending findings.

Steps 2,3 looks simple but they require more configuration and changes (especially if I want to debug the fleet process)
First building docker image for the server will not work on arm architecture as the make build-docker compiles it for amd ( I guess that I was missing something).
Later to command to run the image looks like

docker run -it --rm \
    --network elastic-package-stack_default \
    -v $(PWD)/fleet-server.yml:/etc/fleet-server.yml \
    -v $(HOME)/.elastic-package/profiles/default/certs/fleet-server:/etc/ssl/certs \
    -v $(HOME)/.elastic-package/profiles/default/certs/ca-cert.pem:/etc/ssl/certs/elastic-package.pem \
    -e FLEET_TOKEN_POLICY_NAME="Fleet Server (elastic-package)" \
    -e FLEET_URL=https://fleet-server:8220 \
    -e KIBANA_HOST=https://kibana:5601 \
    -e ELASTIC_CONTAINER=true \
    -e FLEET_SERVER_SERVICE_TOKEN=**REDACTED**\
    -e FLEET_SERVER_ENABLE=1 \
    -e ELASTICSEARCH_HOSTS=https://elasticsearch:9200 \
    -e KIBANA_FLEET_SERVICE_TOKEN=**REDACTED** \
    -e KIBANA_FLEET_HOST=https://kibana:5601 \
    -e FLEET_SERVER_ELASTICSEARCH_HOST=https://elasticsearch:9200 \
    -e FLEET_SERVER_HOST=0.0.0.0 \
    -e FLEET_SERVER_CERT_KEY=/etc/ssl/certs/key.pem \
    -e FLEET_SERVER_CERT=/etc/ssl/certs/cert.pem \
    -e KIBANA_FLEET_SETUP=1 \
    -p 8220:8220 \
    --name fleet-server \
    docker.elastic.co/fleet-server/fleet-server:8.9.0

Where the change in fleet-server.yml was to update inputs[0].server with

      static_policy_tokens:
        enabled: true
        policy_tokens:
          # token.key: policy_id
          abcdefg: 901b70f0-fefa-11ed-aa5e-5974b0535e80

The command to run Elastic Agent looks like

docker run -it \
    --network elastic-package-stack_default \
    -v $(HOME)/.elastic-package/profiles/default/certs/ca-cert.pem:/etc/ssl/certs/elastic-package.pem \
    -e FLEET_ENROLLMENT_TOKEN=MDEyMzQ6YWJjZGVmZw== \
    -e FLEET_ENROLL=1 \
    -e FLEET_URL=https://fleet-server:8220 \
    -e KIBANA_HOST=https://kibana:5601 \
    -e ELASTIC_CONTAINER=true \
    docker.elastic.co/elastic-agent/elastic-agent-complete:8.8.0

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

elasticmachine · 2023-06-01T07:21:52Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-07-24T07:09:04.430+0000
Duration: 42 min 37 sec

Test stats 🧪

Test	Results
Failed	0
Passed	751
Skipped	1
Total	752

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

joshdover

Hi @olegsu, exciting to see this up already. I have a few notes & questions on an initial look:

What is the behavior on checkin if the specified policy ID does not actually exist yet?
- Because the policies are initialized by Kibana asynchronously from Fleet Server startup, I think we absolutely need to handle this case. Probably the agent could receive some default / mostly empty policy that doesn't do much but I'm not sure what this would break on the Agent side, for instance if there are no inputs and no output.
- We will need to have e2e tests that include Agent and Fleet Server for this scenario because it will be important not to break it
I think we will want to limit who can enroll agents using a static token.
- I'd suggest we do this by IP address / range so we can lock this down to only agents deployed within our infrastructure internally.
- This doesn't necessarily need to be implemented in the first PR but we should think about what the final config schema should be so we can add this config later without making a breaking change to the structure
This definitely needs unit and e2e tests before we'll be able to merge the first PR.

olegsu · 2023-06-01T09:14:22Z

Thank you for the quick feedback, @joshdover

What is the behavior on checkin if the specified policy ID does not actually exist yet?

Because the policies are initialized by Kibana asynchronously from Fleet Server startup, I think we absolutely need to handle this case. Probably the agent could receive some default / mostly empty policy that doesn't do much but I'm not sure what this would break on the Agent side, for instance if there are no inputs and no output.

This is a good point, I agree we need to validate the policy is been created already. I see two additional options this can be achieved:

On a check-in flow, check if the policy exists and return an error to the agent if not. This will cause the agent to exit and k8s will restart it. Eventually, Kibana will create one (my suggestion).
On fleet setup flow check that all the policy exists and either wait or exit if at least one is not. Same as before, eventually it will be added.

On the other side, sending an empty policy to the agent will also force us to update it once it has been created, which I am not sure is something we want to handle at this point.

I think we will want to limit who can enroll agents using a static token.

I'd suggest we do this by IP address / range so we can lock this down to only agents deployed within our infrastructure internally.

This doesn't necessarily need to be implemented in the first PR but we should think about what the final config schema should be so we can add this config later without making a breaking change to the structure

Right, we need to have some ability to do that, not sure how though.

This definitely needs unit and e2e tests before we'll be able to merge the first PR.

Thank you, will try to do that.

joshdover · 2023-06-01T10:31:15Z

This is a good point, I agree we need to validate the policy is been created already. I see two additional options this can be achieved:
1. On a check-in flow, check if the policy exists and return an error to the agent if not. This will cause the agent to exit and k8s will restart it. Eventually, Kibana will create one (my suggestion).

2. On fleet setup flow check that all the policy exists and either wait or exit if at least one is not. Same as before, eventually it will be added.
On the other side, sending an empty policy to the agent will also force us to update it once it has been created, which I am not sure is something we want to handle at this point.

Thinking about this more, I think the enrollment should reject until the policy is created. This way we avoid having agents in the .fleet-agents index that point to a policy_id that doesn't exist yet, which could break several things. Having k8s restart the agent until it exists seems acceptable for now.

Right, we need to have some ability to do that, not sure how though.

Yeah that's fine, let's just make sure the config will allows to add it later. If we needed to have different allowed IPs per policy_token, that would be harder to add without a breaking change to the config schema you proposed. Maybe we should change the policy_tokens to an array so we can add options later. I like this more too since it makes the config more self-describing:

inputs:
  - type: fleet-server
    policy.id: "${FLEET_SERVER_POLICY_ID:fleet-server-policy}"
    server:
      auth: statis
      static_policy_tokens:
        enabled: true
        allowed_ips: [10.0.0.0/24] # allowlist for all tokens
        policy_tokens:
          - policy_id: 901b70f0-fefa-11ed-aa5e-5974b0535e80
            token: abcdefg
            allowed_ips: [10.0.0.0/24] # allowlist for just this token

A couple other notes:

Let's call these enrollment_tokens to be more consistent with other naming
We should enforce some minimum length on these (I'd suggest 32 characters)

eyalkraft · 2023-06-01T10:53:05Z

Great work @olegsu!

@joshdover Thanks for your comments.

I think we will want to limit who can enroll agents using a static token.

I'd suggest we do this by IP address / range so we can lock this down to only agents deployed within our infrastructure internally.

Could you elaborate why do you think this limitation is required?
I get why we'd only want "internal" agents using static enrollment tokens. My question is - doesn't the static enrollment token itself ensure that? Where would an external agent/user get a static enrollment token from? Given that static tokens would be generated by the project controller upon project creation.
We can decide on whatever key length to have the required encryption strength.

If we still want to defend ourselves from a scenario where an external user gets a static enrollment token somehow, and only allow "internal" agents to enroll using a static token, I'd argue against basing this limitation on IP address or range.
Basing this limitation on IP would require the project controller to configure this IP to the fleet servers it creates, which adds additional complexity and mix of responsibility domains (for example, I don't know how IP aware the project controller even is). Even if we go for the most permissive path of configuring all the fleet servers with the same static IP range list that covers all MKI clusters - IP ranges changes, and sooner or later this mechanism will break or require updates. Seems very fragile.
The better design here IMO would be using a cryptographic challenge of some sort. Like requiring the token to be cryptographically signed using a certificate only the project controller has, or a cluster-scoped certificate available only to the fleet server and agents running on this cluster. This goes back to the question - what is our fear here exactly? what are we protecting from?

joshdover · 2023-06-01T11:08:41Z

@eyalkraft To be clear - I wasn't proposing we implement IP range restrictions in this PR, it is just an example. My main point was that we need to make the config more flexible to add more options later if we need.

The token itself may be good enough - my thinking here was just adding an additional layer of protection in the case that this token is compromised or we use a really bad default (0000) for some reason. Happy to have a deeper conversation about this - but I don't think we need to have this problem solved now in order to move forward with the current PR and unblock validation of this feature.

olegsu · 2023-06-01T12:17:08Z

Thinking about this more, I think the enrollment should reject until the policy is created. This way we avoid having agents in the .fleet-agents index that point to a policy_id that doesn't exist yet, which could break several things. Having k8s restart the agent until it exists seems acceptable for now.

Sounds good, will add that.

Yeah that's fine, let's just make sure the config will allows to add it later. If we needed to have different allowed IPs per policy_token, that would be harder to add without a breaking change to the config schema you proposed. Maybe we should change the policy_tokens to an array so we can add options later. I like this more too since it makes the config more self-describing:
inputs:
  - type: fleet-server
    policy.id: "${FLEET_SERVER_POLICY_ID:fleet-server-policy}"
    server:
      auth: statis
      static_policy_tokens:
        enabled: true
        allowed_ips: [10.0.0.0/24] # allowlist for all tokens
        policy_tokens:
          - policy_id: 901b70f0-fefa-11ed-aa5e-5974b0535e80
            token: abcdefg
            allowed_ips: [10.0.0.0/24] # allowlist for just this token
A couple other notes:

Let's call these enrollment_tokens to be more consistent with other naming

We should enforce some minimum length on these (I'd suggest 32 characters)

I will update the names, sure.
Having the tokens as an array may have runtime implications later if one fleet server will get a lot of static tokens, for some reason. I have update the code so that token.key is the key of the map so that the lookup is immediate. To extend the object, I can suggest having something like

tokens:
  abcd:
    policy: some-id
    # ... more props ...

Do you still think its better to have an array?

michel-laterman

I agree with what @joshdover has stated, we should not accept enroll requests if the policy does not exist.

I know that we're adding this to support our cloud efforts, but it's also likely that this feature will be used in infrastructure as code deployments by customers so we'll need to add user facing documentation and guidelines as well.

michel-laterman · 2023-06-06T20:57:53Z

internal/pkg/config/input.go

+	Bulk               ServerBulk              `config:"bulk"`
+	GC                 GC                      `config:"gc"`
+	Instrumentation    Instrumentation         `config:"instrumentation"`
+	StaticPolicyTokens StaticPolicyTokens      `config:"static_policy_tokens"`


I think that having enrollment tokens as a list as @joshdover suggested is the most straightforward.
And we should add a description in fleet-server.reference.yml.

I don't think we need to be concerned with list/map performance implications on our initial implementation, it can be changed later if we need to.

I think that having enrollment tokens as a list as @joshdover suggested is the most straightforward.
And we should add a description in fleet-server.reference.yml.

Thank you for the feedback, added the reference to fleet-server.reference.yml

I don't think we need to be concerned with list/map performance implications on our initial implementation, it can be changed later if we need to.

I dont think that it would be that easy to change in the future but I have updated it to hold a slice.

michel-laterman · 2023-06-06T21:01:30Z

internal/pkg/api/handleEnroll.go

+	if et.cfg.StaticPolicyTokens.Enabled {
+		// Validate that an enrollment record exists for a key with this id.
+		if policy, ok := et.cfg.StaticPolicyTokens.PolicyTokens[enrollmentAPIKey.Key]; ok {
+			enrollAPI = &model.EnrollmentAPIKey{


We probably want a debug message at this point as well.

michel-laterman · 2023-06-06T21:02:59Z

internal/pkg/api/handleEnroll.go

-func (et *EnrollerT) processRequest(zlog zerolog.Logger, w http.ResponseWriter, r *http.Request, rb *rollback.Rollback, enrollmentAPIKeyID, ver string) (*EnrollResponse, error) {
+func (et *EnrollerT) processRequest(zlog zerolog.Logger, w http.ResponseWriter, r *http.Request, rb *rollback.Rollback, enrollmentAPIKey *apikey.APIKey, ver string) (*EnrollResponse, error) {
+	var enrollAPI *model.EnrollmentAPIKey
+	if et.cfg.StaticPolicyTokens.Enabled {


I think we should try to keep our authentication handling in handleEnroll (currently line 76), we may need to move some existing logic out of processRequest to do so

This is done to prevent the call to elastic to fetch the token since it does not really exists there.
Instead, here I think that it would be a good place to see if the policy exists and return an error in case id does not (suggested here).

mergify · 2023-06-15T09:42:29Z

This pull request is now in conflicts. Could you fix it @olegsu? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b statis_tokens upstream/statis_tokens
git merge upstream/main
git push upstream statis_tokens

michel-laterman

Thanks for adding an e2e test!
We just need a changelog fragment for this as well.

Otherwise I mostly have nitpicks

michel-laterman · 2023-06-20T17:50:12Z

testing/e2e/stand_alone_test.go

@@ -245,3 +245,53 @@ func (suite *StandAloneSuite) TestClientAPI() {
 	bCancel()
 	cmd.Wait()
 }
+
+func (suite *StandAloneSuite) TestStaticTokenAuthentication() {


michel-laterman · 2023-06-20T17:56:26Z

internal/pkg/api/handleEnroll.go

+
+func (et *EnrollerT) fetchStaticTokenPolicy(ctx context.Context, zlog zerolog.Logger, enrollmentAPIKey *apikey.APIKey) (*model.EnrollmentAPIKey, error) {


can you add a comment to this func? returning nil, nil is not very common across this codebase

internal/pkg/api/handleEnroll_test.go

michel-laterman · 2023-06-20T18:08:15Z

internal/pkg/api/handleEnroll.go

+		return nil, nil
+	}
+
+	zlog.Info().Msgf("Checking static enrollment token %s", enrollmentAPIKey.Key)


(nit) change this to a debug and put an info log if a static token is found

joshdover

Changes LGTM - my only concern right now is that adding the caching in the same PR seems unnecessary. I'd prefer to be able to split that change out separately so we can revert either change independent of one another, if necessary.

olegsu · 2023-07-05T11:02:56Z

Changes LGTM - my only concern right now is that adding the caching in the same PR seems unnecessary. I'd prefer to be able to split that change out separately so we can revert either change independent of one another, if necessary.

Thanks for the review, I will remove the caching

eyalkraft

LGTM, Good Job @olegsu

olegsu · 2023-07-24T07:15:44Z

Merging this after a discussion we have with @jkakavas, @michel-laterman, and @eyalkraft
There are things that we need to keep in mind for future releases

As @joshdover suggested, to come up with an additional strategy to block the usage of the tokens to CIDR ranges
Using a cache to reduce the number of calls to ES for enrollment

olegsu added the enhancement New feature or request label Jun 1, 2023

olegsu requested a review from a team as a code owner June 1, 2023 07:19

olegsu force-pushed the statis_tokens branch 2 times, most recently from 647d485 to a0c9c7f Compare June 1, 2023 07:21

olegsu changed the title ~~feat: support statis tokens to be used for agent enrollment~~ feat: support static tokens to be used for agent enrollment Jun 1, 2023

olegsu requested review from joshdover and eyalkraft June 1, 2023 07:21

olegsu force-pushed the statis_tokens branch 2 times, most recently from 8f99d7e to 9e482fc Compare June 1, 2023 07:36

joshdover requested changes Jun 1, 2023

View reviewed changes

olegsu force-pushed the statis_tokens branch from 9e482fc to 2683e44 Compare June 1, 2023 08:55

michel-laterman reviewed Jun 6, 2023

View reviewed changes

olegsu force-pushed the statis_tokens branch 7 times, most recently from 7e2543f to b1065b4 Compare June 13, 2023 07:04

olegsu force-pushed the statis_tokens branch 4 times, most recently from 8683e7b to 95b896f Compare June 19, 2023 07:20

olegsu force-pushed the statis_tokens branch 2 times, most recently from 845a569 to 5527a6f Compare June 20, 2023 09:56

michel-laterman approved these changes Jun 20, 2023

View reviewed changes

olegsu force-pushed the statis_tokens branch 7 times, most recently from 890f433 to a40e0a2 Compare June 25, 2023 10:22

olegsu force-pushed the statis_tokens branch 3 times, most recently from ed3fea7 to f43421d Compare July 2, 2023 07:10

feat: support static tokens to be used for agent enrollment

eefbaa6

olegsu force-pushed the statis_tokens branch from f43421d to 6fc528e Compare July 4, 2023 07:04

joshdover reviewed Jul 5, 2023

View reviewed changes

olegsu added 3 commits July 5, 2023 14:15

feat: validate policy is exists when using static token

9c7afbb

feat: add e2e test

3d16304

chore: add changelog fragment

413b027

olegsu force-pushed the statis_tokens branch from 6fc528e to 413b027 Compare July 5, 2023 11:18

joshdover approved these changes Jul 5, 2023

View reviewed changes

eyalkraft approved these changes Jul 6, 2023

View reviewed changes

olegsu requested a review from jkakavas July 6, 2023 10:16

olegsu added 3 commits July 18, 2023 18:42

Merge branch 'main' into statis_tokens

0013d34

Merge branch 'main' into statis_tokens

b4b6a08

Merge branch 'main' into statis_tokens

57a73c7

olegsu merged commit 2222e3f into main Jul 24, 2023
18 checks passed

olegsu deleted the statis_tokens branch July 24, 2023 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support static tokens to be used for agent enrollment #2654

feat: support static tokens to be used for agent enrollment #2654

olegsu commented Jun 1, 2023 •

edited

elasticmachine commented Jun 1, 2023 •

edited

Build stats

Test stats 🧪

joshdover left a comment

olegsu commented Jun 1, 2023

joshdover commented Jun 1, 2023 •

edited

eyalkraft commented Jun 1, 2023

joshdover commented Jun 1, 2023

olegsu commented Jun 1, 2023 •

edited

michel-laterman left a comment

michel-laterman Jun 6, 2023

olegsu Jun 11, 2023

michel-laterman Jun 6, 2023

michel-laterman Jun 6, 2023

olegsu Jun 11, 2023 •

edited

mergify bot commented Jun 15, 2023

michel-laterman left a comment

michel-laterman Jun 20, 2023

michel-laterman Jun 20, 2023

michel-laterman Jun 20, 2023

joshdover left a comment •

edited

olegsu commented Jul 5, 2023

eyalkraft left a comment

olegsu commented Jul 24, 2023


		func (et EnrollerT) fetchStaticTokenPolicy(ctx context.Context, zlog zerolog.Logger, enrollmentAPIKey apikey.APIKey) (*model.EnrollmentAPIKey, error) {

feat: support static tokens to be used for agent enrollment #2654

feat: support static tokens to be used for agent enrollment #2654

Conversation

olegsu commented Jun 1, 2023 • edited

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

elasticmachine commented Jun 1, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

joshdover left a comment

Choose a reason for hiding this comment

olegsu commented Jun 1, 2023

joshdover commented Jun 1, 2023 • edited

eyalkraft commented Jun 1, 2023

joshdover commented Jun 1, 2023

olegsu commented Jun 1, 2023 • edited

michel-laterman left a comment

Choose a reason for hiding this comment

michel-laterman Jun 6, 2023

Choose a reason for hiding this comment

olegsu Jun 11, 2023

Choose a reason for hiding this comment

michel-laterman Jun 6, 2023

Choose a reason for hiding this comment

michel-laterman Jun 6, 2023

Choose a reason for hiding this comment

olegsu Jun 11, 2023 • edited

Choose a reason for hiding this comment

mergify bot commented Jun 15, 2023

michel-laterman left a comment

Choose a reason for hiding this comment

michel-laterman Jun 20, 2023

Choose a reason for hiding this comment

michel-laterman Jun 20, 2023

Choose a reason for hiding this comment

michel-laterman Jun 20, 2023

Choose a reason for hiding this comment

joshdover left a comment • edited

Choose a reason for hiding this comment

olegsu commented Jul 5, 2023

eyalkraft left a comment

Choose a reason for hiding this comment

olegsu commented Jul 24, 2023

olegsu commented Jun 1, 2023 •

edited

elasticmachine commented Jun 1, 2023 •

edited

joshdover commented Jun 1, 2023 •

edited

olegsu commented Jun 1, 2023 •

edited

olegsu Jun 11, 2023 •

edited

joshdover left a comment •

edited