Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy Java attacher jar to a tmp directory #8803

Merged
merged 29 commits into from Sep 7, 2022

Conversation

eyalkoren
Copy link
Contributor

@eyalkoren eyalkoren commented Aug 4, 2022

Motivation/summary

Fixing a bug discovered while testing this end-to-end: when Elastic Agent is ran through ./elastic-agent install, it creates a file tree that is not readable for all users on Linux by default.
Therefore, when running the attacher as a non-root user, it can't access the attacher jar.
The fix is to create a tmp directory with write permissions only to root and copy the attacher jar to it with 0755 permissions to make it accessible for all users for read and execute.

Fixe verified on:

  • Amazon Linux 2
  • Ubuntu 22.04

Note for reviewer:

  • pay extra care to security concerns I may have overlooked
  • currently, the agent is downloaded and verified for each attachment, but we should consider ways to prevent that
  • we currently assume the default temp dir would work, but in some OSs we may need to have a tmp dir per user
  • the new test may fail on Windows, where file modes are handled differently, we'll take care of that later

Checklist

How to manually test these changes

  1. setup a stack and install a matching agent on the tested Linux host
  2. set Elastic agent log level to debug
  3. run a Java app on the tested host as a limited-privileges user. It is recommended that this app will have a way to identify it specifically through matching rules, for example, use -Delastic.apm.attach=true in the command line that runs it
  4. within the Fleet UI, edit the corresponding agent's policy to include the APM integration. In the APM integration, within the APM agent tab, enable the Java agent auto-attachment option and make sure the matching rule includes a hint that will specifically identify your Java app process
  5. apply the policy and inspect the logs. It should contain indications that attachment took place. In addition, it should contain the path of the temp dir to which the attacher was copied
  6. login to the Linux host and look for the tmp dir indicated by logs - it should have 0700 mode and be owned by the same user that runs the Java app. It should contain the attacher jar with 0600 mode and be owned by the same user. If attachment was successful, it should contain an agent folder with internal structure that includes some PGP-utility libraries and a directory per agent version as requested within the integration.
  7. verify that APM data is being traced and available in Kibana
  8. restart the Java app, invoke some traffic on it and verify that the same agent jar was used and that APM data is available
  9. stop Elastic agent and verify that the attacher temp dir is removed

Related issues

#8718
#8660
#6969
#7832

@eyalkoren eyalkoren added bug backport-8.4 Automated backport with mergify labels Aug 4, 2022
@apmmachine
Copy link
Collaborator

apmmachine commented Aug 4, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-09-07T11:07:05.100+0000

  • Duration: 28 min 50 sec

Test stats 🧪

Test Results
Failed 0
Passed 131
Skipped 0
Total 131

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@apmmachine
Copy link
Collaborator

apmmachine commented Aug 4, 2022

📚 Go benchmark report

Diff with the main branch

name                                                                                              old time/op    new time/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/errors.ndjson-12                                                                   68.6µs ±15%    86.1µs ±15%  +25.58%  (p=0.032 n=5+5)
BackendProcessor/invalid-metadata-2.ndjson-12                                                       7.94µs ± 8%    7.14µs ± 8%  -10.05%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/errors_rum.ndjson-12                    7.89µs ± 9%    6.97µs ± 4%  -11.72%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12            3.67µs ± 2%    3.79µs ± 2%   +3.20%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/minimal-service.ndjson-12               3.50µs ± 4%    3.72µs ± 5%   +6.30%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/invalid-json-event.ndjson-12            2.67µs ± 7%    2.45µs ± 7%   -8.38%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/otel-bridge.ndjson-12                   7.58µs ± 4%    6.86µs ± 7%   -9.55%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/errors_rum.ndjson-12                    2.34µs ± 1%    2.37µs ± 1%   +1.46%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12           784ns ± 3%     757ns ± 2%   -3.41%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12          1.06µs ± 1%    1.04µs ± 0%   -1.14%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12           490ns ± 1%     472ns ± 1%   -3.64%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12             498ns ± 1%     476ns ± 2%   -4.49%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12      1.58µs ± 1%    1.56µs ± 2%   -1.52%  (p=0.032 n=5+5)
ReadBatch/errors_rum.ndjson-12                                                                      33.3µs ±14%    24.5µs ±43%  -26.34%  (p=0.016 n=5+5)
ReadBatch/transactions_spans_rum.ndjson-12                                                          16.6µs ±18%    22.8µs ±18%  +37.20%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old alloc/op   new alloc/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/transactions.ndjson-12                                                             26.3kB ± 1%    26.7kB ± 1%   +1.55%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions_spans_rum.ndjson-12        6.20kB ± 1%    6.27kB ± 2%   +1.15%  (p=0.048 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12          4.60kB ± 1%    4.53kB ± 1%   -1.40%  (p=0.032 n=5+5)
ReadBatch/transactions-huge_traces.ndjson-12                                                        13.8kB ± 0%    13.8kB ± 0%   +0.05%  (p=0.024 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ShardedWriteTransactionUncontended-12                                                               3.32kB ± 1%    3.37kB ± 0%   +1.57%  (p=0.016 n=5+4)
ReadEvents/json_codec/1_events-12                                                                   8.28kB ± 0%    8.28kB ± 0%   -0.04%  (p=0.032 n=5+5)
ReadEvents/nop_codec/399_events-12                                                                   899kB ± 0%     896kB ± 0%   -0.28%  (p=0.040 n=5+5)

name                                                                                              old allocs/op  new allocs/op  delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/decoder goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions.ndjson-12                     527 ± 0%       526 ± 0%   -0.19%  (p=0.029 n=4+4)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old speed      new speed      delta
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/errors.ndjson-12                                                                 93.1MB/s ±14%  74.5MB/s ±14%  -20.00%  (p=0.032 n=5+5)
BackendProcessor/invalid-metadata-2.ndjson-12                                                     55.0MB/s ± 8%  61.3MB/s ± 9%  +11.49%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/errors_rum.ndjson-12                   241MB/s ± 8%   272MB/s ± 4%  +13.07%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12           160MB/s ± 2%   155MB/s ± 2%   -3.11%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/minimal-service.ndjson-12              121MB/s ± 4%   114MB/s ± 5%   -5.90%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/invalid-json-event.ndjson-12           220MB/s ± 7%   240MB/s ± 7%   +9.26%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/otel-bridge.ndjson-12                  248MB/s ± 4%   275MB/s ± 7%  +10.60%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/errors_rum.ndjson-12                   812MB/s ± 1%   800MB/s ± 1%   -1.43%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12         499MB/s ± 3%   517MB/s ± 2%   +3.52%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12         556MB/s ± 1%   563MB/s ± 0%   +1.18%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata-2.ndjson-12         891MB/s ± 1%   924MB/s ± 1%   +3.77%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-metadata.ndjson-12           895MB/s ± 1%   937MB/s ± 2%   +4.71%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12     730MB/s ± 1%   741MB/s ± 2%   +1.55%  (p=0.032 n=5+5)
ReadBatch/errors_rum.ndjson-12                                                                    57.3MB/s ±15%  83.8MB/s ±62%  +46.31%  (p=0.016 n=5+5)
ReadBatch/transactions_spans_rum.ndjson-12                                                        70.9MB/s ±17%  51.2MB/s ±20%  -27.77%  (p=0.032 n=5+5)

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

@eyalkoren
Copy link
Contributor Author

/test windows

internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
magefile.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
@eyalkoren
Copy link
Contributor Author

/test windows

@eyalkoren eyalkoren added backport-skip Skip notification from the automated backport with mergify and removed backport-8.4 Automated backport with mergify labels Aug 10, 2022
@axw axw added the v8.5.0 label Aug 11, 2022
@axw axw marked this pull request as draft August 11, 2022 05:12
@eyalkoren
Copy link
Contributor Author

eyalkoren commented Aug 11, 2022

Status update:

  • Windows: it appears that on Windows, a JVM that runs as administrator can attach to a JVM that is ran by another user. Therefore, copying to a temp folder may not even be required and the bundled attacher can be used directly.

  • Linux: if we copy the attacher jar directly to the root temp dir, it would be accessible to all users with 0644 mode. Once a JVM is matched against an include rule, the attacher is invoked in the context of the target JVM user. It then downloads and verifies an agent jar and store it locally. Since the current directory in this context (root temp dir) is writable for that user, it will use it to cache the agent jars by creating a <TEMP-ROOT>/agent directory, owned by that user and with 0755 mode. Subsequent attachments by this user can use cached agents. This is safe since the agent folder is not writable to other users. However, if the attacher integration tries to attach to a JVM as another user, the existing cache won't be accessible for it.

Following is how the proposed implementation should work, until we discover something new when testing that:

  • If running on Windows - no need to copy to temp, run the embedded jar as administrator
  • On Linux, keep a sub temp attacher dir for each user that we want to attach to, following these guidelines:
    • user-attacher-temp-subdir creation sequence:
      • create a temp dir using os.MkdirTemp, which atomically creates it with access mode 0700
      • copy the attacher jar into it
      • os.Chmod attacher jar to 0600
      • os.Chown attacher jar to target user
      • osChown temp dir to target user
    • upon any error with executing the above, cache the result and don't try again for the same user
    • at javaattacher.Run() exit, all tmp dirs should be removed. This may rely on os.RemoveAll, but better to recursively traverse all temp subtrees, so that if there are non-removable files (e.g. help by still running attacher process), everything else will be cleared - decided to rely on os.RemoveAll as is
  • macOS - needs more investigation

Placing here an offline discussion for reference- the above solution for Linux relies on us properly maintaining directory tree within a root that has a 777 access mode (the root temp dir). If we instead copy to anywhere with the user home directory, for example the .cache or equivalent caching folder, then all files will be inherently protected by the OS access default settings. However, we decided to got with temp based on the following considerations:

  • getting the temp dir based on os.TempDir is generic and properly maintained
  • in contrary, getting the user home dir and specific subdir (like caching dir) requires a per-OS error-prone discovery logic
  • temp directory is the proper place to store temp files, such that should be removed on process termination, especially where there is risk that we fail to remove files on exit

@eyalkoren eyalkoren added the ci:windows Enable builds in the CI on Windows label Aug 23, 2022
@eyalkoren eyalkoren marked this pull request as ready for review August 23, 2022 13:03
@eyalkoren
Copy link
Contributor Author

Verified through manual testing on Linux. Eventual tmp directory tree is as follows:

[ec2-user@ip-172-31-20-154 bin]$ ls -l /tmp/
total 0
drwx------ 3 ec2-user ec2-user 44 Aug 23 08:38 elasticapmagent-114732219
drwxr-xr-x 2 ec2-user ec2-user 18 Aug 23 08:39 hsperfdata_ec2-user
[ec2-user@ip-172-31-20-154 bin]$ ls -l /tmp/elasticapmagent-114732219/
total 8340
drwxr-xr-x 3 ec2-user ec2-user      20 Aug 23 08:38 agent
-rw------- 1 ec2-user ec2-user 8538845 Aug 23 08:38 java-attacher.jar
[ec2-user@ip-172-31-20-154 bin]$ ls -l /tmp/elasticapmagent-114732219/agent/
total 0
drwxr-xr-x 3 ec2-user ec2-user 53 Aug 23 08:38 1_33_0
[ec2-user@ip-172-31-20-154 bin]$ ls -l /tmp/elasticapmagent-114732219/agent/1_33_0/
total 9408
-rw-r--r-- 1 ec2-user ec2-user 9630190 Aug 23 08:38 elastic-apm-agent-1.33.0.jar
drwxr-xr-x 2 ec2-user ec2-user      97 Aug 23 08:38 lib

Summary:

  • tmp dir owner is app user and access mode is 0700
  • attacher jar owner is app user and access mode is 0600
  • agent downloaded and cached within tmp dir that is only accessible to app user

One comment remained unresolved, waiting for input.

In addition, I am pretty sure there are better ways to exclude tests on a specific OS than what I did and maybe there is a better way to do OS-polymorphism than this os_utils and os_utils_windows method.

@eyalkoren eyalkoren requested a review from axw August 23, 2022 15:34
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
@eyalkoren eyalkoren requested a review from axw August 24, 2022 12:59
@eyalkoren eyalkoren requested a review from marclop August 25, 2022 14:07
@eyalkoren
Copy link
Contributor Author

I won't be available for some time to merge this.
If you feel this is close enough, please bring to completion and merge.
I will update elastic/apm-agent-java#2196, #8660 and #8718 with the state of things assuming this PR is merged more-or-less as is now.

@eyalkoren
Copy link
Contributor Author

@elastic/apm-server - anything else still required?
AFAIK I addressed all review comments and updated other issues in what's still required for Windows and macOS

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A handful of minor comments, and what looks like a bug with error handling in copying the attacher jar.

internal/beater/java_attacher/java_attacher.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
internal/beater/java_attacher/os_utils.go Outdated Show resolved Hide resolved
@eyalkoren eyalkoren requested a review from axw September 7, 2022 06:05
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. LGTM.

@eyalkoren
Copy link
Contributor Author

Thanks a lot @axw and @marclop for your assistance! 🙏
Anything else before merging? Should we add something to changelog? I am not sure, as this is still partial (Linux support mostly) and in technical preview. If so, I'd appreciate your proposal for that.

@axw
Copy link
Member

axw commented Sep 7, 2022

@eyalkoren no worries. Yes, I think a changelog entry is worthwhile. Please also update the PR description to use the PR template, explaining steps to be taken for manual testing. Otherwise I think it's good to merge.

@eyalkoren
Copy link
Contributor Author

Done, please just see that the changelog addition is done as expected.

changelogs/head.asciidoc Outdated Show resolved Hide resolved
Co-authored-by: Andrew Wilkins <axwalk@gmail.com>
@eyalkoren eyalkoren enabled auto-merge (squash) September 7, 2022 09:48
@eyalkoren eyalkoren merged commit e561b1b into elastic:main Sep 7, 2022
@eyalkoren eyalkoren deleted the java-attacher-copy-to-temp branch September 7, 2022 11:47
@axw axw added the test-plan label Sep 29, 2022
@SylvainJuge
Copy link
Member

Tested with 8.5.0

Fleet agent used https://github.com/elastic/elastic-agent/commits/2b60ccf8da06fca193cb37a1186c2baa5c3b1f28
Ubuntu 22.04 LTS

Works as expected, is there any os-specific variant that requires further testing ? For example is it required to also test on Amazon Linux 2 ?

@axw
Copy link
Member

axw commented Sep 30, 2022

@SylvainJuge thanks! To the best of my knowledge there's no need to test on any specific Linux distros. If anyone knows otherwise, it would be @eyalkoren

I'll go ahead and mark this tested then.

@SylvainJuge
Copy link
Member

FYI @eyalkoren is on PTO, but I remember reproducing the bug before this change was applied, thus if it used to not work at all and now at least works on one of Linux distributions, so we already have some positive progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify bug ci:windows Enable builds in the CI on Windows test-plan test-plan-ok v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants