Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support static tokens to be used for agent enrollment #2654

Merged
merged 7 commits into from
Jul 24, 2023
Merged

Conversation

olegsu
Copy link
Contributor

@olegsu olegsu commented Jun 1, 2023

What is the problem this PR solves?

This PR adds support for static tokens for an agent enrollment.
The reason we need this capability is to support a flow for serverless security projects when an agent should be installed at the same time as the other project services (kibana, es, fleet).
To be able to install an agent at that point, we need a predictable token that both the agent and the fleet accept.

How does this PR solve the problem?

By accepting static tokens, well just the key part of the token, fleet server will authenticate a request that have this key in the enrollment token it uses and will send back the API key to communicate with ES. For example, passing to the agent token like base64(0123456789:abcdefg) the fleet server will accept it if the configuration includes the following

inputs:
  - type: fleet-server
    policy.id: "${FLEET_SERVER_POLICY_ID:fleet-server-policy}"
    server:
      auth: statis
      static_policy_tokens:
        enabled: true
        policy_tokens:
          # token.key: policy_id
          abcdefg: 901b70f0-fefa-11ed-aa5e-5974b0535e80

Today flow looks something like that
image

And this PR will support the following flow
image

How to test this PR locally

  1. Create local stack elastic-package stack up -s "elasticsearch,kibana,package-registry" --version 8.8.0 -v -d
  2. Build and run fleet server local image ( attached to the network of the stack )
  3. Run Elastic-Agent docker image (attached to the same network )
  4. See that the Elastic-Agent is healthy and the spawned Cloudbeat sending findings.

Steps 2,3 looks simple but they require more configuration and changes (especially if I want to debug the fleet process)
First building docker image for the server will not work on arm architecture as the make build-docker compiles it for amd ( I guess that I was missing something).
Later to command to run the image looks like

docker run -it --rm \
    --network elastic-package-stack_default \
    -v $(PWD)/fleet-server.yml:/etc/fleet-server.yml \
    -v $(HOME)/.elastic-package/profiles/default/certs/fleet-server:/etc/ssl/certs \
    -v $(HOME)/.elastic-package/profiles/default/certs/ca-cert.pem:/etc/ssl/certs/elastic-package.pem \
    -e FLEET_TOKEN_POLICY_NAME="Fleet Server (elastic-package)" \
    -e FLEET_URL=https://fleet-server:8220 \
    -e KIBANA_HOST=https://kibana:5601 \
    -e ELASTIC_CONTAINER=true \
    -e FLEET_SERVER_SERVICE_TOKEN=**REDACTED**\
    -e FLEET_SERVER_ENABLE=1 \
    -e ELASTICSEARCH_HOSTS=https://elasticsearch:9200 \
    -e KIBANA_FLEET_SERVICE_TOKEN=**REDACTED** \
    -e KIBANA_FLEET_HOST=https://kibana:5601 \
    -e FLEET_SERVER_ELASTICSEARCH_HOST=https://elasticsearch:9200 \
    -e FLEET_SERVER_HOST=0.0.0.0 \
    -e FLEET_SERVER_CERT_KEY=/etc/ssl/certs/key.pem \
    -e FLEET_SERVER_CERT=/etc/ssl/certs/cert.pem \
    -e KIBANA_FLEET_SETUP=1 \
    -p 8220:8220 \
    --name fleet-server \
    docker.elastic.co/fleet-server/fleet-server:8.9.0

Where the change in fleet-server.yml was to update inputs[0].server with

      static_policy_tokens:
        enabled: true
        policy_tokens:
          # token.key: policy_id
          abcdefg: 901b70f0-fefa-11ed-aa5e-5974b0535e80

The command to run Elastic Agent looks like

docker run -it \
    --network elastic-package-stack_default \
    -v $(HOME)/.elastic-package/profiles/default/certs/ca-cert.pem:/etc/ssl/certs/elastic-package.pem \
    -e FLEET_ENROLLMENT_TOKEN=MDEyMzQ6YWJjZGVmZw== \
    -e FLEET_ENROLL=1 \
    -e FLEET_URL=https://fleet-server:8220 \
    -e KIBANA_HOST=https://kibana:5601 \
    -e ELASTIC_CONTAINER=true \
    docker.elastic.co/elastic-agent/elastic-agent-complete:8.8.0

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@olegsu olegsu added the enhancement New feature or request label Jun 1, 2023
@olegsu olegsu requested a review from a team as a code owner June 1, 2023 07:19
@olegsu olegsu force-pushed the statis_tokens branch 2 times, most recently from 647d485 to a0c9c7f Compare June 1, 2023 07:21
@olegsu olegsu changed the title feat: support statis tokens to be used for agent enrollment feat: support static tokens to be used for agent enrollment Jun 1, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 1, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-07-24T07:09:04.430+0000

  • Duration: 42 min 37 sec

Test stats 🧪

Test Results
Failed 0
Passed 751
Skipped 1
Total 752

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@olegsu olegsu force-pushed the statis_tokens branch 2 times, most recently from 8f99d7e to 9e482fc Compare June 1, 2023 07:36
Copy link
Member

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @olegsu, exciting to see this up already. I have a few notes & questions on an initial look:

  • What is the behavior on checkin if the specified policy ID does not actually exist yet?
    • Because the policies are initialized by Kibana asynchronously from Fleet Server startup, I think we absolutely need to handle this case. Probably the agent could receive some default / mostly empty policy that doesn't do much but I'm not sure what this would break on the Agent side, for instance if there are no inputs and no output.
    • We will need to have e2e tests that include Agent and Fleet Server for this scenario because it will be important not to break it
  • I think we will want to limit who can enroll agents using a static token.
    • I'd suggest we do this by IP address / range so we can lock this down to only agents deployed within our infrastructure internally.
    • This doesn't necessarily need to be implemented in the first PR but we should think about what the final config schema should be so we can add this config later without making a breaking change to the structure
  • This definitely needs unit and e2e tests before we'll be able to merge the first PR.

@olegsu
Copy link
Contributor Author

olegsu commented Jun 1, 2023

Thank you for the quick feedback, @joshdover

  • What is the behavior on checkin if the specified policy ID does not actually exist yet?

    • Because the policies are initialized by Kibana asynchronously from Fleet Server startup, I think we absolutely need to handle this case. Probably the agent could receive some default / mostly empty policy that doesn't do much but I'm not sure what this would break on the Agent side, for instance if there are no inputs and no output.

This is a good point, I agree we need to validate the policy is been created already. I see two additional options this can be achieved:

  1. On a check-in flow, check if the policy exists and return an error to the agent if not. This will cause the agent to exit and k8s will restart it. Eventually, Kibana will create one (my suggestion).
  2. On fleet setup flow check that all the policy exists and either wait or exit if at least one is not. Same as before, eventually it will be added.

On the other side, sending an empty policy to the agent will also force us to update it once it has been created, which I am not sure is something we want to handle at this point.

  • I think we will want to limit who can enroll agents using a static token.

    • I'd suggest we do this by IP address / range so we can lock this down to only agents deployed within our infrastructure internally.
    • This doesn't necessarily need to be implemented in the first PR but we should think about what the final config schema should be so we can add this config later without making a breaking change to the structure

Right, we need to have some ability to do that, not sure how though.

  • This definitely needs unit and e2e tests before we'll be able to merge the first PR.

Thank you, will try to do that.

@joshdover
Copy link
Member

joshdover commented Jun 1, 2023

This is a good point, I agree we need to validate the policy is been created already. I see two additional options this can be achieved:

1. On a check-in flow, check if the policy exists and return an error to the agent if not. This will cause the agent to exit and k8s will restart it. Eventually, Kibana will create one (my suggestion).

2. On fleet setup flow check that all the policy exists and either wait or exit if at least one is not. Same as before, eventually it will be added.

On the other side, sending an empty policy to the agent will also force us to update it once it has been created, which I am not sure is something we want to handle at this point.

Thinking about this more, I think the enrollment should reject until the policy is created. This way we avoid having agents in the .fleet-agents index that point to a policy_id that doesn't exist yet, which could break several things. Having k8s restart the agent until it exists seems acceptable for now.

Right, we need to have some ability to do that, not sure how though.

Yeah that's fine, let's just make sure the config will allows to add it later. If we needed to have different allowed IPs per policy_token, that would be harder to add without a breaking change to the config schema you proposed. Maybe we should change the policy_tokens to an array so we can add options later. I like this more too since it makes the config more self-describing:

inputs:
  - type: fleet-server
    policy.id: "${FLEET_SERVER_POLICY_ID:fleet-server-policy}"
    server:
      auth: statis
      static_policy_tokens:
        enabled: true
        allowed_ips: [10.0.0.0/24] # allowlist for all tokens
        policy_tokens:
          - policy_id: 901b70f0-fefa-11ed-aa5e-5974b0535e80
            token: abcdefg
            allowed_ips: [10.0.0.0/24] # allowlist for just this token

A couple other notes:

  • Let's call these enrollment_tokens to be more consistent with other naming
  • We should enforce some minimum length on these (I'd suggest 32 characters)

@eyalkraft
Copy link

Great work @olegsu!

@joshdover Thanks for your comments.

  • I think we will want to limit who can enroll agents using a static token.
    • I'd suggest we do this by IP address / range so we can lock this down to only agents deployed within our infrastructure internally.

Could you elaborate why do you think this limitation is required?
I get why we'd only want "internal" agents using static enrollment tokens. My question is - doesn't the static enrollment token itself ensure that? Where would an external agent/user get a static enrollment token from? Given that static tokens would be generated by the project controller upon project creation.
We can decide on whatever key length to have the required encryption strength.

If we still want to defend ourselves from a scenario where an external user gets a static enrollment token somehow, and only allow "internal" agents to enroll using a static token, I'd argue against basing this limitation on IP address or range.
Basing this limitation on IP would require the project controller to configure this IP to the fleet servers it creates, which adds additional complexity and mix of responsibility domains (for example, I don't know how IP aware the project controller even is). Even if we go for the most permissive path of configuring all the fleet servers with the same static IP range list that covers all MKI clusters - IP ranges changes, and sooner or later this mechanism will break or require updates. Seems very fragile.
The better design here IMO would be using a cryptographic challenge of some sort. Like requiring the token to be cryptographically signed using a certificate only the project controller has, or a cluster-scoped certificate available only to the fleet server and agents running on this cluster. This goes back to the question - what is our fear here exactly? what are we protecting from?

@joshdover
Copy link
Member

@eyalkraft To be clear - I wasn't proposing we implement IP range restrictions in this PR, it is just an example. My main point was that we need to make the config more flexible to add more options later if we need.

The token itself may be good enough - my thinking here was just adding an additional layer of protection in the case that this token is compromised or we use a really bad default (0000) for some reason. Happy to have a deeper conversation about this - but I don't think we need to have this problem solved now in order to move forward with the current PR and unblock validation of this feature.

@olegsu
Copy link
Contributor Author

olegsu commented Jun 1, 2023

Thinking about this more, I think the enrollment should reject until the policy is created. This way we avoid having agents in the .fleet-agents index that point to a policy_id that doesn't exist yet, which could break several things. Having k8s restart the agent until it exists seems acceptable for now.

Sounds good, will add that.

Yeah that's fine, let's just make sure the config will allows to add it later. If we needed to have different allowed IPs per policy_token, that would be harder to add without a breaking change to the config schema you proposed. Maybe we should change the policy_tokens to an array so we can add options later. I like this more too since it makes the config more self-describing:

inputs:
  - type: fleet-server
    policy.id: "${FLEET_SERVER_POLICY_ID:fleet-server-policy}"
    server:
      auth: statis
      static_policy_tokens:
        enabled: true
        allowed_ips: [10.0.0.0/24] # allowlist for all tokens
        policy_tokens:
          - policy_id: 901b70f0-fefa-11ed-aa5e-5974b0535e80
            token: abcdefg
            allowed_ips: [10.0.0.0/24] # allowlist for just this token

A couple other notes:

  • Let's call these enrollment_tokens to be more consistent with other naming
  • We should enforce some minimum length on these (I'd suggest 32 characters)

I will update the names, sure.
Having the tokens as an array may have runtime implications later if one fleet server will get a lot of static tokens, for some reason. I have update the code so that token.key is the key of the map so that the lookup is immediate. To extend the object, I can suggest having something like

tokens:
  abcd:
    policy: some-id
    # ... more props ...

Do you still think its better to have an array?

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with what @joshdover has stated, we should not accept enroll requests if the policy does not exist.

I know that we're adding this to support our cloud efforts, but it's also likely that this feature will be used in infrastructure as code deployments by customers so we'll need to add user facing documentation and guidelines as well.

Bulk ServerBulk `config:"bulk"`
GC GC `config:"gc"`
Instrumentation Instrumentation `config:"instrumentation"`
StaticPolicyTokens StaticPolicyTokens `config:"static_policy_tokens"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having enrollment tokens as a list as @joshdover suggested is the most straightforward.
And we should add a description in fleet-server.reference.yml.

I don't think we need to be concerned with list/map performance implications on our initial implementation, it can be changed later if we need to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having enrollment tokens as a list as @joshdover suggested is the most straightforward.
And we should add a description in fleet-server.reference.yml.

Thank you for the feedback, added the reference to fleet-server.reference.yml

I don't think we need to be concerned with list/map performance implications on our initial implementation, it can be changed later if we need to.

I dont think that it would be that easy to change in the future but I have updated it to hold a slice.

if et.cfg.StaticPolicyTokens.Enabled {
// Validate that an enrollment record exists for a key with this id.
if policy, ok := et.cfg.StaticPolicyTokens.PolicyTokens[enrollmentAPIKey.Key]; ok {
enrollAPI = &model.EnrollmentAPIKey{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a debug message at this point as well.

func (et *EnrollerT) processRequest(zlog zerolog.Logger, w http.ResponseWriter, r *http.Request, rb *rollback.Rollback, enrollmentAPIKeyID, ver string) (*EnrollResponse, error) {
func (et *EnrollerT) processRequest(zlog zerolog.Logger, w http.ResponseWriter, r *http.Request, rb *rollback.Rollback, enrollmentAPIKey *apikey.APIKey, ver string) (*EnrollResponse, error) {
var enrollAPI *model.EnrollmentAPIKey
if et.cfg.StaticPolicyTokens.Enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to keep our authentication handling in handleEnroll (currently line 76), we may need to move some existing logic out of processRequest to do so

Copy link
Contributor Author

@olegsu olegsu Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done to prevent the call to elastic to fetch the token since it does not really exists there.
Instead, here I think that it would be a good place to see if the policy exists and return an error in case id does not (suggested here).

@olegsu olegsu force-pushed the statis_tokens branch 7 times, most recently from 7e2543f to b1065b4 Compare June 13, 2023 07:04
@mergify
Copy link
Contributor

mergify bot commented Jun 15, 2023

This pull request is now in conflicts. Could you fix it @olegsu? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b statis_tokens upstream/statis_tokens
git merge upstream/main
git push upstream statis_tokens

@olegsu olegsu force-pushed the statis_tokens branch 4 times, most recently from 8683e7b to 95b896f Compare June 19, 2023 07:20
@olegsu olegsu force-pushed the statis_tokens branch 2 times, most recently from 845a569 to 5527a6f Compare June 20, 2023 09:56
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding an e2e test!
We just need a changelog fragment for this as well.

Otherwise I mostly have nitpicks

@@ -245,3 +245,53 @@ func (suite *StandAloneSuite) TestClientAPI() {
bCancel()
cmd.Wait()
}

func (suite *StandAloneSuite) TestStaticTokenAuthentication() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 136 to 140

func (et *EnrollerT) fetchStaticTokenPolicy(ctx context.Context, zlog zerolog.Logger, enrollmentAPIKey *apikey.APIKey) (*model.EnrollmentAPIKey, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to this func? returning nil, nil is not very common across this codebase

internal/pkg/api/handleEnroll_test.go Outdated Show resolved Hide resolved
internal/pkg/api/handleEnroll_test.go Outdated Show resolved Hide resolved
internal/pkg/api/handleEnroll_test.go Show resolved Hide resolved
internal/pkg/api/handleEnroll_test.go Outdated Show resolved Hide resolved
return nil, nil
}

zlog.Info().Msgf("Checking static enrollment token %s", enrollmentAPIKey.Key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) change this to a debug and put an info log if a static token is found

@olegsu olegsu force-pushed the statis_tokens branch 7 times, most recently from 890f433 to a40e0a2 Compare June 25, 2023 10:22
@olegsu olegsu force-pushed the statis_tokens branch 3 times, most recently from ed3fea7 to f43421d Compare July 2, 2023 07:10
Copy link
Member

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM - my only concern right now is that adding the caching in the same PR seems unnecessary. I'd prefer to be able to split that change out separately so we can revert either change independent of one another, if necessary.

@olegsu
Copy link
Contributor Author

olegsu commented Jul 5, 2023

Changes LGTM - my only concern right now is that adding the caching in the same PR seems unnecessary. I'd prefer to be able to split that change out separately so we can revert either change independent of one another, if necessary.

Thanks for the review, I will remove the caching

Copy link

@eyalkraft eyalkraft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Good Job @olegsu

@olegsu olegsu requested a review from jkakavas July 6, 2023 10:16
@olegsu
Copy link
Contributor Author

olegsu commented Jul 24, 2023

Merging this after a discussion we have with @jkakavas, @michel-laterman, and @eyalkraft
There are things that we need to keep in mind for future releases

  1. As @joshdover suggested, to come up with an additional strategy to block the usage of the tokens to CIDR ranges
  2. Using a cache to reduce the number of calls to ES for enrollment

@olegsu olegsu merged commit 2222e3f into main Jul 24, 2023
18 checks passed
@olegsu olegsu deleted the statis_tokens branch July 24, 2023 08:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants