Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modelindexer: improve observability #6630

Merged
merged 6 commits into from Nov 17, 2021
Merged

Conversation

axw
Copy link
Member

@axw axw commented Nov 16, 2021

Motivation/summary

Improve observability of modelindexer:

  • Count number of events rejected due to Elasticsearch queue saturation (libbeat.output.events.toomany)
  • Count number of events successfully indexed (libbeat.output.events.acked)
  • Count number of bulk requests (libbeat.output.events.batches)
  • Add tracing of bulk requests, similar to the libbeat output. Request-level errors are captured.
  • Inject trace/transaction ID into logs.

Checklist

How to test these changes

  1. Run apm-server with -e -E output.elasticsearch.experimental=true -E instrumentation.enabled=true -E monitoring.enabled=true
  2. Send events to server
  3. Check logs for "bulk request completed". This entry should have trace.id and transaciton.id fields, which should correspond to a trace and transaction with the same IDs.
  4. Check stack monitoring docs contain the added metrics

Related issues

#6360

@axw axw added v8.0.0 backport-8.0 Automated backport with mergify labels Nov 16, 2021
@apmmachine
Copy link
Collaborator

apmmachine commented Nov 16, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-11-17T12:21:42.964+0000

  • Duration: 48 min 42 sec

  • Commit: 42751bd

Test stats 🧪

Test Results
Failed 0
Passed 5622
Skipped 19
Total 5641

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /hey-apm : Run the hey-apm benchmark.

  • /package : Generate and publish the docker images.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@axw axw marked this pull request as ready for review November 16, 2021 09:03
@axw axw requested a review from a team November 16, 2021 09:03
@mergify
Copy link
Contributor

mergify bot commented Nov 16, 2021

This pull request is now in conflicts. Could you fix it @axw? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b modelindexer-errors upstream/modelindexer-errors
git merge upstream/master
git push upstream modelindexer-errors

Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

@axw axw enabled auto-merge (squash) November 17, 2021 11:32
@axw axw merged commit 3064a5d into elastic:master Nov 17, 2021
mergify bot pushed a commit that referenced this pull request Nov 17, 2021
* model/modelindexer: capture more metrics

* model/modelindexer: trace flush operations

* model/modelindexer: inject trace ID into logs

* Update changelog

(cherry picked from commit 3064a5d)
axw added a commit that referenced this pull request Nov 18, 2021
* modelindexer: improve observability (#6630)

* model/modelindexer: capture more metrics

* model/modelindexer: trace flush operations

* model/modelindexer: inject trace ID into logs

* Update changelog

(cherry picked from commit 3064a5d)

* make update

Co-authored-by: Andrew Wilkins <axw@elastic.co>
@marclop marclop self-assigned this Dec 21, 2021
@marclop
Copy link
Contributor

marclop commented Dec 21, 2021

Verified using 9bb5e4c, with one caveat, I had to enable debug logging since that's the level those logs use:

{"log.level":"debug","@timestamp":"2021-12-21T15:21:10.672+0800","log.logger":"modelindexer","log.origin":{"file.name":"modelindexer/indexer.go","file.line":400},"message":"bulk request completed: 4 indexed, 0 failed (0 exceeded capacity)","service.name":"apm-server","trace.id":"05aae453c235fa7efbcd3b72f3666339","transaction.id":"05aae453c235fa7e","ecs.version":"1.6.0"}

Metrics are correctly added to the monitoring docs:

{"log.level":"debug","@timestamp":"2021-12-21T15:21:18.966+0800","log.logger":"monitoring","log.origin":{"file.name":"processing/processors.go","file.line":203},"message":"Publish event: {\n  \"@timestamp\": \"2021-12-21T07:21:18.964Z\",\n  \"@metadata\": {\n    \"beat\": \"apm-server\",\n    \"type\": \"_doc\",\n    \"version\": \"8.0.0\",\n    \"params\": {\n      \"interval\": \"10s\"\n    },\n    \"cluster_uuid\": \"nE_NVzmNRAWVnk5BzWVSJw\",\n    \"type\": \"beats_stats\",\n    \"interval_ms\": 10000\n  },\n  \"metrics\": {\n    \"beat\": {\n      \"memstats\": {\n        \"memory_total\": 39181400,\n        \"memory_alloc\": 24287288,\n        \"memory_sys\": 38142984,\n        \"gc_next\": 31763216,\n        \"rss\": 38535168\n      },\n      \"cpu\": {\n        \"total\": {\n          \"ticks\": 2,\n          \"time\": {\n            \"ms\": 97\n          },\n          \"value\": 2\n        },\n        \"user\": {\n          \"time\": {\n            \"ms\": 54\n          },\n          \"ticks\": 1\n        },\n        \"system\": {\n          \"time\": {\n            \"ms\": 43\n          },\n          \"ticks\": 1\n        }\n      },\n      \"runtime\": {\n        \"goroutines\": 51\n      },\n      \"info\": {\n        \"uptime\": {\n          \"ms\": 30032\n        },\n        \"ephemeral_id\": \"93e72de4-f0c8-4fb9-b3f3-3beb681456db\",\n        \"version\": \"8.0.0\"\n      }\n    },\n    \"apm-server\": {\n      \"jaeger\": {\n        \"grpc\": {\n          \"sampling\": {\n            \"response\": {\n              \"count\": 0,\n              \"errors\": {\n                \"count\": 0\n              },\n              \"valid\": {\n                \"count\": 0\n              }\n            },\n            \"event\": {\n              \"received\": {\n                \"count\": 0\n              }\n            },\n            \"request\": {\n              \"count\": 0\n            }\n          },\n          \"collect\": {\n            \"request\": {\n              \"count\": 0\n            },\n            \"response\": {\n              \"errors\": {\n                \"count\": 0,\n                \"ratelimit\": 0,\n                \"timeout\": 0,\n                \"unauthorized\": 0\n              },\n              \"valid\": {\n                \"count\": 0\n              },\n              \"count\": 0\n            }\n          }\n        }\n      },\n      \"server\": {\n        \"request\": {\n          \"count\": 4\n        },\n        \"response\": {\n          \"count\": 3,\n          \"errors\": {\n            \"decode\": 0,\n            \"unavailable\": 0,\n            \"closed\": 0,\n            \"validate\": 0,\n            \"notfound\": 0,\n            \"method\": 0,\n            \"toolarge\": 0,\n            \"unauthorized\": 0,\n            \"internal\": 0,\n            \"timeout\": 0,\n            \"count\": 0,\n            \"queue\": 0,\n            \"ratelimit\": 0,\n            \"forbidden\": 0,\n            \"invalidquery\": 0\n          },\n          \"valid\": {\n            \"count\": 3,\n            \"accepted\": 3,\n            \"ok\": 0,\n            \"notmodified\": 0\n          }\n        },\n        \"unset\": 0\n      },\n      \"decoder\": {\n        \"uncompressed\": {\n          \"content-length\": 0,\n          \"count\": 0\n        },\n        \"reader\": {\n          \"count\": 4\n        },\n        \"missing-content-length\": {\n          \"count\": 4\n        },\n        \"deflate\": {\n          \"content-length\": 0,\n          \"count\": 0\n        },\n        \"gzip\": {\n          \"content-length\": 0,\n          \"count\": 0\n        }\n      },\n      \"root\": {\n        \"request\": {\n          \"count\": 0\n        },\n        \"response\": {\n          \"count\": 0,\n          \"errors\": {\n            \"unauthorized\": 0,\n            \"toolarge\": 0,\n            \"invalidquery\": 0,\n            \"closed\": 0,\n            \"notfound\": 0,\n            \"queue\": 0,\n            \"timeout\": 0,\n            \"decode\": 0,\n            \"internal\": 0,\n            \"method\": 0,\n            \"ratelimit\": 0,\n            \"count\": 0,\n            \"forbidden\": 0,\n            \"validate\": 0,\n            \"unavailable\": 0\n          },\n          \"valid\": {\n            \"count\": 0,\n            \"notmodified\": 0,\n            \"accepted\": 0,\n            \"ok\": 0\n          }\n        },\n        \"unset\": 0\n      },\n      \"profile\": {\n        \"request\": {\n          \"count\": 0\n        },\n        \"response\": {\n          \"errors\": {\n            \"closed\": 0,\n            \"count\": 0,\n            \"unavailable\": 0,\n            \"timeout\": 0,\n            \"queue\": 0,\n            \"decode\": 0,\n            \"method\": 0,\n            \"invalidquery\": 0,\n            \"validate\": 0,\n            \"toolarge\": 0,\n            \"forbidden\": 0,\n            \"ratelimit\": 0,\n            \"internal\": 0,\n            \"notfound\": 0,\n            \"unauthorized\": 0\n          },\n          \"valid\": {\n            \"accepted\": 0,\n            \"count\": 0,\n            \"ok\": 0,\n            \"notmodified\": 0\n          },\n          \"count\": 0\n        },\n        \"unset\": 0\n      },\n      \"otlp\": {\n        \"grpc\": {\n          \"metrics\": {\n            \"request\": {\n              \"count\": 0\n            },\n            \"response\": {\n              \"valid\": {\n                \"count\": 0\n              },\n              \"count\": 0,\n              \"errors\": {\n                \"unauthorized\": 0,\n                \"count\": 0,\n                \"ratelimit\": 0,\n                \"timeout\": 0\n              }\n            },\n            \"consumer\": {\n              \"unsupported_dropped\": 0\n            }\n          },\n          \"traces\": {\n            \"response\": {\n              \"count\": 0,\n              \"errors\": {\n                \"timeout\": 0,\n                \"unauthorized\": 0,\n                \"count\": 0,\n                \"ratelimit\": 0\n              },\n              \"valid\": {\n                \"count\": 0\n              }\n            },\n            \"request\": {\n              \"count\": 0\n            }\n          },\n          \"logs\": {\n            \"request\": {\n              \"count\": 0\n            },\n            \"response\": {\n              \"valid\": {\n                \"count\": 0\n              },\n              \"count\": 0,\n              \"errors\": {\n                \"ratelimit\": 0,\n                \"timeout\": 0,\n                \"unauthorized\": 0,\n                \"count\": 0\n              }\n            }\n          }\n        }\n      },\n      \"sampling\": {\n        \"transactions_dropped\": 1\n      },\n      \"acm\": {\n        \"request\": {\n          \"count\": 1\n        },\n        \"response\": {\n          \"errors\": {\n            \"notfound\": 0,\n            \"closed\": 0,\n            \"timeout\": 0,\n            \"unauthorized\": 0,\n            \"validate\": 0,\n            \"internal\": 0,\n            \"decode\": 0,\n            \"ratelimit\": 0,\n            \"method\": 0,\n            \"queue\": 0,\n            \"count\": 1,\n            \"invalidquery\": 0,\n            \"unavailable\": 0,\n            \"toolarge\": 0,\n            \"forbidden\": 1\n          },\n          \"valid\": {\n            \"notmodified\": 0,\n            \"accepted\": 0,\n            \"count\": 0,\n            \"ok\": 0\n          },\n          \"count\": 1\n        },\n        \"unset\": 0\n      },\n      \"processor\": {\n        \"transaction\": {\n          \"transformations\": 5\n        },\n        \"metric\": {\n          \"transformations\": 5\n        },\n        \"stream\": {\n          \"accepted\": 18,\n          \"errors\": {\n            \"invalid\": 0,\n            \"toolarge\": 0\n          }\n        },\n        \"span\": {\n          \"transformations\": 9\n        }\n      },\n      \"aggregation\": {\n        \"txmetrics\": {\n          \"active_groups\": 3,\n          \"overflowed\": 0\n        }\n      }\n    },\n    \"libbeat\": {\n      \"config\": {\n        \"scans\": 0,\n        \"reloads\": 0,\n        \"module\": {\n          \"starts\": 0,\n          \"stops\": 0,\n          \"running\": 0\n        }\n      },\n      \"output\": {\n        \"write\": {\n          \"bytes\": 0,\n          \"errors\": 0\n        },\n        \"read\": {\n          \"errors\": 0,\n          \"bytes\": 0\n        },\n        \"type\": \"elasticsearch\",\n        \"events\": {\n          \"toomany\": 0,\n          \"total\": 18,\n          \"acked\": 11,\n          \"active\": 7,\n          \"batches\": 2,\n          \"failed\": 0\n        }\n      },\n      \"pipeline\": {\n        \"events\": {\n          \"dropped\": 0,\n          \"retry\": 0,\n          \"active\": 0,\n          \"total\": 0,\n          \"filtered\": 0,\n          \"published\": 0,\n          \"failed\": 0\n        },\n        \"queue\": {\n          \"max_events\": 4096,\n          \"acked\": 0\n        },\n        \"clients\": 1\n      }\n    },\n    \"system\": {\n      \"cpu\": {\n        \"cores\": 8\n      },\n      \"load\": {\n        \"5\": 5.6587,\n        \"15\": 6.3672,\n        \"norm\": {\n          \"1\": 0.5795,\n          \"5\": 0.7073,\n          \"15\": 0.7959\n        },\n        \"1\": 4.6357\n      }\n    }\n  },\n  \"beat\": {\n    \"name\": \"Marcs-MacBook-Pro-2.local\",\n    \"host\": \"Marcs-MacBook-Pro-2.local\",\n    \"uuid\": \"bf2a6d26-c5ed-4b41-a507-feab5b75c095\",\n    \"type\": \"apm-server\",\n    \"version\": \"8.0.0\"\n  }\n}","service.name":"apm-server","ecs.version":"1.6.0"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants