Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data is generated with host name under Discover tab if agent is directly installed with policy having FQDN host name already selected. #2697

Closed
amolnater-qasource opened this issue May 16, 2023 · 11 comments · Fixed by elastic/beats#35736
Assignees
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent Label for the Agent team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.8 BC4 Kibana cloud environment
BUILD: 63052
COMMIT: ecb9826ceec5457214bec760fcf049f5614ad3d0

Host OS and Browser version: All, All

Preconditions:

  1. 8.8 BC4 Kibana cloud environment should be available.

Steps to reproduce:

  1. Navigate to Fleet tab and create Agent policies with FQDN host name type already selected.
  2. Install agent with this policy.
  3. Observe intial data under Discover tab is generated with host name and then continue with FQDN host name.
    • Data generated for both indices- logs-* and metrics-*.

Expected:
Data should only be generated with FQDN host name under Discover tab if agent is directly installed with policy having FQDN host name already selected.

Note:

  • More data with host name[not with FQDN] is generated for Windows hosts as compared to Linux agent.

Similar Issue:

Screen Recording:

Discover.-.Elastic.-.Google.Chrome.2023-05-16.12-47-08.mp4
Discover.-.Elastic.-.Google.Chrome.2023-05-16.13-10-19.mp4

Logs:
elastic-agent-diagnostics-2023-05-16T07-18-13Z-00.zip

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels May 16, 2023
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@amolnater-qasource
Copy link
Author

JFI @ycombinator

@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 17, 2023
@fearful-symmetry
Copy link
Contributor

Alright, trying to reproduce this.
On a new 8.8.0 cluster and an agent on a debian GCP node:

On logs-*:

  • host.hostname will sometimes be debian and sometimes be the non-fqdn hostname. Presumably depends on the data stream the document is coming from. However, I don't think this is the field that fqdn should be reporting on.

  • host.name is consistently the fqdn.

On metrics-*

  • host.name is just the fdqn

Not sure if there's some kind of race going on that makes this work less than 100% of the time. Will keep investigating.

@fearful-symmetry
Copy link
Contributor

@amolnater-qasource still trying to reproduce this. Were you able to get this issue to reproduce at all on linux? Or was it just Windows? Did you manage to reproduce it 100% of the time?

@fearful-symmetry
Copy link
Contributor

Also, @amolnater-qasource , while I work on reproducing this, can you get another diagnostic dump with debug-level logs after the issues appears? Might help.

@fearful-symmetry
Copy link
Contributor

Progress! I was able to reproduce this on my linux box by having add_host_metadata make some fake hostnames depending on if features.FQDN() was set, and added some extra log lines.

grep "ADD HOST METADATA" data/elastic-agent-92c669/logs/elastic-agent-20230607.ndjson | grep "metricbeat" | head -n 100 | jq .message
"ADD HOST METADATA: in loadData, updating config with fqdn disabled"
"ADD HOST METADATA: created metadata object with name 'shoebill.nest.no-fqdn'"
"ADD HOST METADATA: called loadData in new, fqdn: false"
"ADD HOST METADATA: updating metadata callback in New()"
"ADD HOST METADATA: in loadData, updating config with fqdn disabled"
"ADD HOST METADATA: created metadata object with name 'shoebill.nest.no-fqdn'"
"ADD HOST METADATA: called loadData in new, fqdn: false"
"ADD HOST METADATA: updating metadata callback in New()"
"ADD HOST METADATA: in loadData, updating config with fqdn disabled"
"ADD HOST METADATA: created metadata object with name 'shoebill.nest.no-fqdn'"
"ADD HOST METADATA: called loadData in new, fqdn: false"
"ADD HOST METADATA: updating metadata callback in New()"
"ADD HOST METADATA CALLED expireCache"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: in loadData, cache is not expired, fqdn: true"
"ADD HOST METADATA: updated event with data: shoebill.nest.no-fqdn"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: in loadData, cache is not expired, fqdn: true"
"ADD HOST METADATA: updated event with data: shoebill.nest.no-fqdn"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: in loadData, cache is not expired, fqdn: true"
"ADD HOST METADATA: updated event with data: shoebill.nest.no-fqdn"
"ADD HOST METADATA: in loadData, updating config with fqdn enabled"
"ADD HOST METADATA: created metadata object with name 'shoebill.nest.fqdn'"
"ADD HOST METADATA: updated event with data: shoebill.nest.fqdn"
"ADD HOST METADATA CALLED expireCache"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: in loadData, updating config with fqdn enabled"
"ADD HOST METADATA: created metadata object with name 'shoebill.nest.fqdn'"
"ADD HOST METADATA: updated event with data: shoebill.nest.fqdn"

This is the interesting part:

"ADD HOST METADATA: updating metadata callback in New()"
"ADD HOST METADATA CALLED expireCache"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: CALLED RUN(), FQDN: true"
"ADD HOST METADATA: in loadData, cache is not expired, fqdn: true"
"ADD HOST METADATA: updated event with data: shoebill.nest.no-fqdn"

I'm not quite sure of the why/how, but we call expireCache(), which is what happens when the V2 manager gets the updated config with the feature, but in the next call to Run(), it decides the cache is not expired, and we get an event with shoebill.nest.no-fqdn

Not sure if there's multiple unit states/processors that are conflicting in some way, or there's an issue with the cache. Will investigate more tomorrow.

@amolnater-qasource
Copy link
Author

Hi @fearful-symmetry

Thank you for looking into this issue.

Were you able to get this issue to reproduce at all on linux? Or was it just Windows? Did you manage to reproduce it 100% of the time?

Yes, the issue is every time reproducible to us. However, the data generated with host name(when FQDN is set) is generated less for Linux agent as compared to Windows agent.

Linux host used: Ubuntu 22.04:

We have revalidated this issue on latest 8.9.0 SNAPSHOT and is still reproducible.

  • Data is generated with host name under Discover tab if agent is directly installed with policy having FQDN host name already selected.

Build details:
VERSION: 8.9.0 SNAPSHOT
BUILD: 63857
COMMIT: 6cf0c8c5642f4f2f41cd7db524f55c2a35a5bbfb

Debug level Logs:
elastic-agent-diagnostics-2023-06-08T04-37-22Z-00.zip

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-06-08.09-55-42.mp4

Please let us know if anything else is required from our end.
Thanks!

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Jun 8, 2023

So I think what's happening is this:

  1. The add_host_metadata starts up before everything else, sets the hostname value without FQDN
  2. the config manager gets a unit update with FQDN: true. It sets the global FQDN flag
  3. The config manager calls expireCache() in add_host_metadata(), telling the processor to update the FQDN value next time it gets an event
  4. A new event reaches the processor, this new event calls loadData(), which checks to see if the cache is expired. It is, so expired() returns true and then resets the cache update time
  5. Another event comes in. This event calls loadData() again. loadData() calls expired(), which returns false. loadData() then returns without updating the FQDN, as the cache is not expired. An event with the non-FQDN value is returned from the processor.
  6. Repeat step 5 between 1-3 times
  7. The original event that first triggered the cache time reset finally loads the FQDN, setting add_host_metadata's processor to the correct FQDN.

@fearful-symmetry
Copy link
Contributor

Fix here: elastic/beats#35736

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Jun 14, 2023
@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.9.0 BC2 Kibana cloud environment and found it fixed now.

Observations:

  • Data is only generated with FQDN host name under Discover tab if agent is directly installed with policy having FQDN host name already selected.

Screen Recording:

Discover.-.Elastic.-.Google.Chrome.2023-07-06.10-42-53.mp4
Discover.-.Elastic.-.Google.Chrome.2023-07-06.11-30-14.mp4

Build details:
BUILD: 64459
COMMIT: 6950a2b8207d8388ee8c842d6c0e2b1e1031fd36
Artifact Link: https://staging.elastic.co/8.9.0-826fd88d/downloads/beats/elastic-agent/elastic-agent-8.9.0-windows-x86_64.zip

Hence, we are marking this issue as QA:Validated.

Thanks

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants