Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Fleet Server to serve the PGP key when stored on disk #2887

Closed
jlind23 opened this issue Aug 17, 2023 · 23 comments · Fixed by #2977
Closed

Allow Fleet Server to serve the PGP key when stored on disk #2887

jlind23 opened this issue Aug 17, 2023 · 23 comments · Fixed by #2977
Assignees
Labels
Team:Fleet Label for the Fleet team

Comments

@jlind23
Copy link
Contributor

jlind23 commented Aug 17, 2023

If Elastic Agent are unable to connect with Elastic public URL to retrieve the PGP key, they will fallback to a Fleet Server URL where the public key can be hosted too.

In Fleet Server, we must define a fixed route for hosting the GPG key downloads/signing/key.pub to mirror the public URL and have it read the GPG key in from its configuration using the same secure mechanisms we use to read the TLS private key.
Mandate HTTPS on this endpoint and let Elastic Agent call it whenever needed.

Related Elastic Agent issue - elastic/elastic-agent#3264

@jlind23 jlind23 added the Team:Fleet Label for the Fleet team label Aug 17, 2023
@joshdover
Copy link
Contributor

joshdover commented Aug 17, 2023

One key check is that Fleet Server should only serve this file if it's only writable by the root user on the filesystem.

We need to agree on a URL path now so that the Agent side can be implemented. For maximum flexibility in the future, I'd suggest we include the Agent version in the path so that if there is a key rotation that only applies to upgrades of a specific version, we can later support this on the server side.

My suggested route is: GET /api/agents/upgrades/<major>.<minor>.<patch>/pgp-public-key. For now Fleet Server should always serve the same file and ignore the version number in the path.

The file on disk should probably be something like /Elastic/Agent/elastic-agent-upgrade-keys/default.pgp. We should include default in the filename to indicate this is the default that will be provided unless there is a more specific version file (again we're not implementing this now, but we may in the future).

@cmacknz
Copy link
Member

cmacknz commented Aug 17, 2023

My suggested route is: GET /api/agents/upgrades/../pgp-public-key. For now Fleet Server should always serve the same file and ignore the version number in the path.

👍 great idea let's do this.

@lucabelluccini
Copy link
Contributor

Fleet Server

Fleet Server needs to be updated first.
Let's suppose the new key is used on 8.Y.

We first upgrade Fleet Server on a version [8.9, 8.Y) to [8.Y,+Inf)

  • On ECE, ESS and ECK, there is no issue as we redeploy containers
  • On prem with tar.gz, bare metal:
    • If Fleet Server has public internet access, when upgrading from versions [8.9, 8.Y), it will attempt the embedded key (fails) and then the first fallback URL (success) 🟢
    • If Fleet Server is air-gapped, when upgrading from version [8.9, 8.Y), it will attempt the embedded key (fails) and then the first fallback URL
      • If the user deployed their webserver with TLS to serve the file and updated the DNS to resolve the URL to the webserver, it will be successful 🟢
      • Else, it will fallback on the second URL - but as it is a Fleet Server itself... It doesn't have a Fleet Server to download the key from ❓ - maybe it should load from local file?

Elastic Agents connected to the Fleet Server

  • On ECE, ESS, ECK, there is no issue as we redeploy containers
  • On prem with tar.gz, bare metal:
    • If Elastic Agent has public internet access, when upgrading from versions [8.9, 8.Y), it will attempt the embedded key (fails) and then the first fallback URL (success) 🟢
    • If Elastic Agent is air-gapped, when upgrading from version [8.9, 8.Y), it will attempt the embedded key (fails) and then the first fallback URL
      • If the user deployed their webserver with TLS to serve the file and updated the DNS to resolve the URL to the webserver, it will be successful 🟢
      • Else, it will fallback on the second URL which is served by Fleet Server. As the user has something already in place to communicate to Fleet Server, it will be successful 🟢

Is this step-by-step review of the flow correct?

As a side note, it would be nice to include the default.pgp in the docker images of Fleet Server/Elastic Agent to be able to support EA users attempting to upgrade when connected to Fleet servers running on ECK and ECE (within air-gapped envs).

@cmacknz
Copy link
Member

cmacknz commented Aug 22, 2023

That looks correct to me, thanks for that breakdown and it highlights an omitted edge case in the problem description:

  • If Fleet Server is air-gapped, when upgrading from version [8.9, 8.Y), it will attempt the embedded key (fails) and then the first fallback URL
    • If the user deployed their webserver with TLS to serve the file and updated the DNS to resolve the URL to the webserver, it will be successful 🟢
    • Else, it will fallback on the second URL - but as it is a Fleet Server itself... It doesn't have a Fleet Server to download the key from ❓ - maybe it should load from local file?

In the case of an air gapped Fleet server there are two cases, one of which we can handle easily and one we can't:

  1. The problem we can handle is an upgrade of an air gapped Fleet Server after the first release that allows Fleet Server to host the GPG key. In this case the Elastic Agent supervising the Fleet Server does the download + verification, so it will have to reach out to the Fleet Server it is supervising to get the GPG key. It should be able to do this over a unix socket, the same way we handle Fleet checkins of the agent that runs Fleet Server.

  2. The problem we don't handle is the case where a user is upgrading an air gapped Fleet Server from a version before the hosted GPG key was introduced. In this case I think they have to do this upgrade before the embedded GPG key is expired or they'd have to re-install. I believe we have this problem already in the agent where users need to upgrade to the version that includes the fallback GPG URL before the key is expired. This is just a different version of the same problem, although reinstalling Fleet Servers is probably more feasible than reinstalling possibly 1000s of agents.

@matthiasledergerber
Copy link

We are currently having this issue. Our Fleet Server are on 8.9.2. Our Elastic Agents on 8.9.0. We have an air gapped systems and use custom repositories that work reliable.

When trying to Upgrade the Elastic Agent from 8.9.0 to 8.9.2:

[elastic_agent][info] starting upgrade to version 8.9.2 in background
[elastic_agent][info] Upgrading agent
[elastic_agent][info] download from http://nexus.local.tld/repository/proxy-raw-elasticagent/beats/elastic-agent/elastic-agent-8.9.2-windows-x86_64.zip completed in Less than a second @ +InfYBps
[elastic_agent][info] download from http://nexus.local.tld/repository/proxy-raw-elasticagent/beats/elastic-agent/elastic-agent-8.9.2-windows-x86_64.zip.sha512 completed in Less than a second @ +InfYBps
[elastic_agent][info] Default PGP being appended
[elastic_agent][info] Default PGP being appended
[elastic_agent][error] upgrade to version 8.9.2 failed: failed verification of agent binary: 2 errors occurred:
	* Get "https://artifacts.elastic.co/GPG-KEY-elastic-agent": dial tcp 34.120.127.130:443: connectex: No connection could be made because the target machine actively refused it.
	* Get "https://artifacts.elastic.co/GPG-KEY-elastic-agent": dial tcp 34.120.127.130:443: connectex: No connection could be made because the target machine actively refused it.

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 7, 2023

@cmacknz please keep me honest here but @matthiasledergerber your agent should first try to validate the binary with the signature that is bundled in the installed agent. Do you have any signature verification failure in logs?

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 7, 2023

As a reference, this is the PR where this was introduced: elastic/elastic-agent#2980

@matthiasledergerber
Copy link

To add further information possible useful for debugging:

I've tried Upgrading from 8.9.0 to 8.9.2 with multiple Agents (Windows, Linux). All running in air-gapped environments. I've also cleared the cache of our custom repository. The upgrade procedure used is the one from Fleet. I've looked at the firewall logs and there was the block to https://artifacts.elastic.co/GPG-KEY-elastic-agent (air-gapped). I did not see any signature failure logs when setting the log level to debug on the failing agent.

Note: I've redacted the ip and hostnames of my own systems.

Windows, 8.9.0 -> 8.9.2, connection to https://artifacts.elastic.co/GPG-KEY-elastic-agent blocked, Upgrade process fails

[elastic_agent][debug] save state on disk : {action:0xc00071fc80 ackToken:f67c366d-d1dc-4b53-b13c-8dc015f95685 queue:[]}
[elastic_agent][warn] Skipping addition to action-queue, issue gathering start time from action id 45ffb72a-4777-4a25-847a-7809d4684b30: action has no start time
[elastic_agent][debug] Gathered 0 actions from queue, 0 actions expired
[elastic_agent][debug] Expired actions: []
[elastic_agent][debug] save state on disk : {action:0xc00071fc80 ackToken:f67c366d-d1dc-4b53-b13c-8dc015f95685 queue:[]}
[elastic_agent][debug] Dispatch 1 actions of types: *fleetapi.ActionUpgrade
[elastic_agent][debug] handlerUpgrade: action 'action_id: 45ffb72a-4777-4a25-847a-7809d4684b30, type: UPGRADE' received
[elastic_agent][debug] Successfully dispatched action: 'action_id: 45ffb72a-4777-4a25-847a-7809d4684b30, type: UPGRADE'
[elastic_agent][info] starting upgrade to version 8.9.2 in background
[elastic_agent][info] Upgrading agent
[elastic_agent][debug] Cleaning up non-matching downloaded versions
[elastic_agent][debug] Downloading upgrade artifact
[elastic_agent][debug] download attempt 1
[elastic_agent][info] download from http://nexus.domain.tld/repository/proxy-raw-elasticagent/beats/elastic-agent/elastic-agent-8.9.2-windows-x86_64.zip completed in Less than a second @ +InfYBps
[elastic_agent][info] download from http://nexus.domain.tld/repository/proxy-raw-elasticagent/beats/elastic-agent/elastic-agent-8.9.2-windows-x86_64.zip.sha512 completed in Less than a second @ +InfYBps
[elastic_agent][debug] FleetGateway calling Checkin API
[elastic_agent][debug] Checking started
[elastic_agent][debug] using previously saved ack token: f67c366d-d1dc-4b53-b13c-8dc015f95685
[elastic_agent][debug] Request method: POST, path: /api/fleet/agents/8b27661e-017f-4c63-8ebb-941d2f92761d/checkin, reqID: 01H9QP46C5AQVNQZT9QAH6Q5Y6
[elastic_agent][debug] Creating new request to request URL https://1.1.1.1:8220/api/fleet/agents/8b27661e-017f-4c63-8ebb-941d2f92761d/checkin?
[elastic_agent][info] Default PGP being appended
[elastic_agent][info] Default PGP being appended
[elastic_agent][debug] Cleaning up non-matching downloaded versions
[elastic_agent][error] upgrade to version 8.9.2 failed: failed verification of agent binary: 2 errors occurred:
	* Get "https://artifacts.elastic.co/GPG-KEY-elastic-agent": dial tcp 34.120.127.130:443: connectex: No connection could be made because the target machine actively refused it.
	* Get "https://artifacts.elastic.co/GPG-KEY-elastic-agent": dial tcp 34.120.127.130:443: connectex: No connection could be made because the target machine actively refused it.
[elastic_agent][debug] appending action with id '45ffb72a-4777-4a25-847a-7809d4684b30' to the queue
[elastic_agent][debug] lazy acker: ack batch: [action_id: 45ffb72a-4777-4a25-847a-7809d4684b30, type: UPGRADE]
[elastic_agent][debug] fleet acker: ackbatch, actions: []fleetapi.Action{(*fleetapi.ActionUpgrade)(0xc00090ab00)}
[elastic_agent][debug] fleet acker: ackbatch, events: []fleetapi.AckEvent{fleetapi.AckEvent{EventType:"ACTION_RESULT", SubType:"ACKNOWLEDGED", Timestamp:"2023-09-07T13:23:38.00024+02:00", ActionID:"45ffb72a-4777-4a25-847a-7809d4684b30", AgentID:"8b27661e-017f-4c63-8ebb-941d2f92761d", Message:"Action \"45ffb72a-4777-4a25-847a-7809d4684b30\" of type \"UPGRADE\" acknowledged.", Payload:json.RawMessage(nil), Data:json.RawMessage(nil), ActionInputType:"", ActionData:json.RawMessage(nil), ActionResponse:map[string]interface {}(nil), StartedAt:"", CompletedAt:"", Error:""}}
[elastic_agent][debug] 1 actions with ids '45ffb72a-4777-4a25-847a-7809d4684b30' acknowledging
[elastic_agent][debug] Request method: POST, path: /api/fleet/agents/8b27661e-017f-4c63-8ebb-941d2f92761d/acks, reqID: 01H9QP49MGHCDPMRPTGDPEBP5T
[elastic_agent][debug] Creating new request to request URL https://1.1.1.1:8220/api/fleet/agents/8b27661e-017f-4c63-8ebb-941d2f92761d/acks?
[elastic_agent][debug] observed check-in for endpoint service: token:"e4b15a74-5d5d-4781-ac68-e5c57bb47b43" units:{id:"endpoint-default-e54eaa95-af04-405b-b018-d38ee7fccd38" config_state_idx:4 state:HEALTHY message:"Applied policy {e54eaa95-af04-405b-b018-d38ee7fccd38}" payload:{fields:{key:"error" value:{struct_value:{fields:{key:"code" value:{number_value:0}} fields:{key:"message" value:{string_value:"Success"}}}}}}} units:{id:"endpoint-default" type:OUTPUT config_state_idx:1 state:HEALTHY message:"Applied policy {e54eaa95-af04-405b-b018-d38ee7fccd38}" payload:{fields:{key:"error" value:{struct_value:{fields:{key:"code" value:{number_value:0}} fields:{key:"message" value:{string_value:"Success"}}}}}}} version_info:{name:"Endpoint" version:"8.9.0"} features:{source:{fields:{key:"agent" value:{struct_value:{fields:{key:"features" value:{struct_value:{fields:{key:"fqdn" value:{struct_value:{fields:{key:"enabled" value:{bool_value:false}}}}}}}}}}}} fqdn:{}} features_idx:1

Linux, 8.9.0 -> 8.9.2, wget https://artifacts.elastic.co/GPG-KEY-elastic-agent works from the system, Upgrade Process works

[elastic_agent][info] starting upgrade to version 8.9.2 in background
[elastic_agent][info] Upgrading agent
[elastic_agent][info] download from http://nexus.domain.tld/repository/proxy-raw-elasticagent/beats/elastic-agent/elastic-agent-8.9.2-linux-x86_64.tar.gz completed in 3 seconds @ 178.7MBps
[elastic_agent][info] download from http://nexus.domain.tld/repository/proxy-raw-elasticagent/beats/elastic-agent/elastic-agent-8.9.2-linux-x86_64.tar.gz.sha512 completed in Less than a second @ +InfYBps
[elastic_agent][info] Default PGP being appended
[elastic_agent][info] Using 2 PGP keys
[elastic_agent][info] Default PGP being appended
[elastic_agent][info] Using 2 PGP keys
[elastic_agent][info] Verification with PGP[0] successful
[elastic_agent][info] Unpacked upgrade artifact
[elastic_agent][info] Copying run directory
[elastic_agent][info] Changing symlink
[elastic_agent][info] Writing upgrade marker file
[elastic_agent][info] Updating active commit
[elastic_agent][info] APM instrumentation disabled
[elastic_agent][info] Gathered system information
[elastic_agent][info] Detected available inputs and outputs
[elastic_agent][info] Capabilities file not found in /opt/Elastic/Agent/capabilities.yml
[elastic_agent][info] Determined allowed capabilities
[elastic_agent][debug] exiting, lock already exists
[elastic_agent][info] Parsed configuration and determined agent is managed by Fleet
[elastic_agent][info] Starting stats endpoint
[elastic_agent][info] Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)
[elastic_agent][info] Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
[elastic_agent][info] restoring current policy from disk
[elastic_agent][info] Source URI changed from "https://artifacts.elastic.co/downloads/" to "http://nexus.domain.tld/repository/proxy-raw-elasticagent/"
[elastic_agent][info] Updating running component model


@jlind23
Copy link
Contributor Author

jlind23 commented Sep 7, 2023

@matthiasledergerber the connection blocked will trigger a problem in the agent upgrade process indeed.
@pierrehilbert will document a workaround which will help airgapped users manually downloading the key https://artifacts.elastic.co/GPG-KEY-elastic-agent and hosting it behind the same url in their air gapped environment. Most probably by changing the etc host of the agent machines.

Would that work for you?

@cmacknz
Copy link
Member

cmacknz commented Sep 7, 2023

The embedded GPG key is still valid, that we are considering the download of the fallback key at the public URL a fatal error is a bug. elastic/elastic-agent#3368

@matthiasledergerber
Copy link

@matthiasledergerber the connection blocked will trigger a problem in the agent upgrade process indeed. @pierrehilbert will document a workaround which will help airgapped users manually downloading the key https://artifacts.elastic.co/GPG-KEY-elastic-agent and hosting it behind the same url in their air gapped environment. Most probably by changing the etc host of the agent machines.

Would that work for you?

yes, we can try.

We have about 100 agents. Our repositories are http only for other reasons, therefore we cannot use an workaround requiring encrypted connections and certificates (requires to have an PKI Infrastructure for the user and deployment options when self signed certificates, etc.).

But in the end if we are required to do DNS redirection / HTTP redirecton on every host we can also redeploy the agents.

There is no pressure for us on upgrading the agents currently.

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 7, 2023

The fix that Craig linked will not help in your case.
The upgrade functionality that you are using is the one of the installed agent which will always require this DNS redirection until you are upgraded to a release that contains the fix.
So redeploying the agent would probably be better only as soon as the release containing the patch is shipped.

@pierrehilbert
Copy link
Contributor

I just merged this PR elastic/elastic-agent#3375 to add a doc with the workaround.
@kilfoyle added a known issue in addition.

@michalpristas
Copy link
Contributor

just quick question.
considering path suggested by josh. are we taking into consideration version qualifiers, snapshot flags etc?
or are we sticking with simple major.minor.patch

@jlind23
Copy link
Contributor Author

jlind23 commented Sep 20, 2023

I believe we should use major.minor.patch without the snapshot flag which means that both 8.10.0 and 8.10.0-SNAPSHOT will have their key available at the same path.
@michel-laterman is this the path you took in your PR?

@michel-laterman
Copy link
Contributor

Yes it is.

Currently all version numbers will result in the same (default) key path being used.

This key is retrieved from a single "upstream" source if it's not available on disk.

@michalpristas
Copy link
Contributor

michalpristas commented Sep 21, 2023

@michel-laterman do you have some branch i can work with?
i have something that should work but i'd rather test it with 'real' server

@defensivedepth
Copy link

Side note for Elastic team - this bug would have been caught if there was some testing for Air-Gapped Agent upgrades. Is this something that can be added to your automated testing regime?

@joshdover
Copy link
Contributor

Hi @defensivedepth - I think you're referring to this bug elastic/elastic-agent#3368 which is related. This ticket is more about making it simpler for air gapped upgrades if/when our code signing key is rotated, but the change in Agent that introduced the bug shouldn't have failed upgrades before this ticket was completed.

We have scheduled work to implement tests for air gapped to prevent this from happening again: elastic/elastic-agent#3403

@defensivedepth
Copy link

Fantastic, thanks @joshdover

@michel-laterman
Copy link
Contributor

@michalpristas, #2977 has the work; implementation is pretty much complete, i'm just fixing the e2e tests

@thethirstyturtle
Copy link

Adding this comment for others that find this post.

The workaround that utilizes DNS redirection is not really workable as it also requires being able to generate a TLS certificate for the false domain, which is not possible for us to do.

We were able to upgrade Fleet and the agents using the below command. All of our hosts are RHEL7/8 with agents installed via tarball. We tried to give it the pgp-path and uri flags, but none worked. The upgrade still attempted to go to the public Elastic Artifacts URL.

AGENT-BINARY-DIR/elastic-agent upgrade <version> --skip-verify

On RHEL7/8 this was the command we used
/opt/Elastic/Agent/elastic-agent upgrade 8.10.4 --skip-verify

@lucabelluccini
Copy link
Contributor

Hello @thethirstyturtle - sorry I didn't manage to update this issue.
We have a knowledge article in the support portal which reports your suggested workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants