Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Spring Boot Actuator and metrics endpoints #1673

Merged
merged 4 commits into from
Dec 1, 2023

Conversation

chgl
Copy link
Collaborator

@chgl chgl commented Sep 3, 2023

Closes #1671 .

This adds the spring-boot-starter-actuator and micrometer-registry-prometheus dependencies, which expose health and metrics endpoints.
In the default configuration, health-endpoints are exposed under :8080/actuator/health and metrics at ``:8080/actuator/prometheus`.

@johngrimes
Copy link
Member

I am thinking that perhaps the default port for this interface should be different from the default port that is used for the FHIR interface.

This is because usually you would want a different level of visibility of this service, e.g. it might only need to be accessible by a monitoring component within the cluster, not exposed externally.

@johngrimes
Copy link
Member

I've made this change in 280fef1, let me know what you think.

@chgl
Copy link
Collaborator Author

chgl commented Sep 5, 2023

That's definitely a great change - I've defaulted to the same behavior in the HAPI FHIR Helm chart as well: https://github.com/hapifhir/hapi-fhir-jpaserver-starter/blob/master/charts/hapi-fhir-jpaserver/templates/deployment.yaml#L98.

One more thing (we can of course always override any of these properties at runtime), but what do you think about the add-additional-paths property: https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.endpoints.kubernetes-probes setting. It would expose the metrics at a dedicated port (8081) but the liveness and readiness probes are kept at :8080/livez and :8080/readyz. Alternatively, to keep the interface FHIR-native, we could introduce a $readyz/livez operation at some later point.

@johngrimes
Copy link
Member

The probes could be on the management port as well right? I'm not aware of any requirement to have the probes available on the same port as the application.

I would imagine that something like this in the container spec would work well:

startupProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8081
  periodSeconds: 5
  failureThreshold: 36
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8081
  periodSeconds: 5
  failureThreshold: 12
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8081
  periodSeconds: 5
  failureThreshold: 3

@chgl
Copy link
Collaborator Author

chgl commented Sep 5, 2023

Absolutely. the given rationale for using the same port as the main application:

If your Actuator endpoints are deployed on a separate management context, the endpoints do not use the same web infrastructure (port, connection pools, framework components) as the main application. In this case, a probe check could be successful even if the main application does not work properly (for example, it cannot accept new connections). For this reason, is it a good idea to make the liveness and readiness health groups available on the main server port. This can be done by setting the following property: management.endpoint.health.probes.add-additional-paths=true

https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.endpoints.kubernetes-probes

@johngrimes
Copy link
Member

Ok, that makes sense.

I'm not worried about the FHIR thing, as this would be outside the FHIR endpoint anyway (/fhir).

Would this be exposing only the liveness and readiness endpoints, or also the Prometheus endpoint? Are there any security implications to having this accessible externally?

@chgl
Copy link
Collaborator Author

chgl commented Sep 5, 2023

The Prometheus endpoint will only be exposed on port 8081 at /actuator/prometheus. It's only the health probes that will be available on :8080 as well. There's some docs on security: https://docs.spring.io/spring-boot/docs/current/reference/html/actuator.html#actuator.endpoints.security . Since we explicitly expose only health and metrics endpoints there shouldn't be anything sensitive in there.

Edit: for completeness, here's the response from the /actuator/prometheus endpoint. I POSTed some resources and the queried them using ?_query=fhirPath&filter=Patient.id.exists() just to make sure that query parameters don't show up in the metrics:

# HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
# TYPE jvm_classes_loaded_classes gauge
jvm_classes_loaded_classes 32095.0
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total 1277.0
# HELP tomcat_sessions_created_sessions_total  
# TYPE tomcat_sessions_created_sessions_total counter
tomcat_sessions_created_sessions_total 0.0
# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
jvm_buffer_total_capacity_bytes{id="direct",} 71359.0
# HELP jvm_gc_max_data_size_bytes Max size of long-lived heap memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 2.147483648E9
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
# TYPE jvm_memory_max_bytes gauge
jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 1.22028032E8
jvm_memory_max_bytes{area="heap",id="G1 Survivor Space",} -1.0
jvm_memory_max_bytes{area="heap",id="G1 Old Gen",} 2.147483648E9
jvm_memory_max_bytes{area="nonheap",id="Metaspace",} 4.194304E8
jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 7598080.0
jvm_memory_max_bytes{area="heap",id="G1 Eden Space",} -1.0
jvm_memory_max_bytes{area="nonheap",id="Compressed Class Space",} 4.11041792E8
jvm_memory_max_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 1.22032128E8
# HELP tomcat_sessions_active_current_sessions  
# TYPE tomcat_sessions_active_current_sessions gauge
tomcat_sessions_active_current_sessions 0.0
# HELP jvm_gc_memory_promoted_bytes_total Count of positive increases in the size of the old generation memory pool before GC to after GC
# TYPE jvm_gc_memory_promoted_bytes_total counter
jvm_gc_memory_promoted_bytes_total 1.98886144E8
# HELP jvm_gc_pause_seconds Time spent in GC pause
# TYPE jvm_gc_pause_seconds summary
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Humongous Allocation",} 14.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="G1 Humongous Allocation",} 0.143
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Metadata GC Threshold",} 3.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Metadata GC Threshold",} 0.032
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause",} 71.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="G1 Evacuation Pause",} 0.608
# HELP jvm_gc_pause_seconds_max Time spent in GC pause
# TYPE jvm_gc_pause_seconds_max gauge
jvm_gc_pause_seconds_max{action="end of minor GC",cause="G1 Humongous Allocation",} 0.051
jvm_gc_pause_seconds_max{action="end of minor GC",cause="Metadata GC Threshold",} 0.02
jvm_gc_pause_seconds_max{action="end of minor GC",cause="G1 Evacuation Pause",} 0.033
# HELP log4j2_events_total Number of fatal level log events
# TYPE log4j2_events_total counter
log4j2_events_total{level="warn",} 0.0
log4j2_events_total{level="debug",} 0.0
log4j2_events_total{level="error",} 0.0
log4j2_events_total{level="trace",} 0.0
log4j2_events_total{level="fatal",} 0.0
log4j2_events_total{level="info",} 0.0
# HELP tomcat_sessions_active_max_sessions  
# TYPE tomcat_sessions_active_max_sessions gauge
tomcat_sessions_active_max_sessions 0.0
# HELP system_cpu_usage The "recent cpu usage" of the system the application is running in
# TYPE system_cpu_usage gauge
system_cpu_usage 0.05811693863908422
# HELP process_uptime_seconds The uptime of the Java virtual machine
# TYPE process_uptime_seconds gauge
process_uptime_seconds 100.266
# HELP process_start_time_seconds Start time of the process since unix epoch.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.69392353888E9
# HELP executor_active_threads The approximate number of threads that are actively executing tasks
# TYPE executor_active_threads gauge
executor_active_threads{name="applicationTaskExecutor",} 0.0
# HELP jvm_gc_live_data_size_bytes Size of long-lived heap memory pool after reclamation
# TYPE jvm_gc_live_data_size_bytes gauge
jvm_gc_live_data_size_bytes 3.04591352E8
# HELP system_cpu_count The number of processors available to the Java virtual machine
# TYPE system_cpu_count gauge
system_cpu_count 24.0
# HELP disk_total_bytes Total space for path
# TYPE disk_total_bytes gauge
disk_total_bytes{path="/.",} 1.081101176832E12
# HELP process_files_open_files The open file descriptor count
# TYPE process_files_open_files gauge
process_files_open_files 442.0
# HELP tomcat_sessions_expired_sessions_total  
# TYPE tomcat_sessions_expired_sessions_total counter
tomcat_sessions_expired_sessions_total 0.0
# HELP jvm_threads_live_threads The current number of live threads including both daemon and non-daemon threads
# TYPE jvm_threads_live_threads gauge
jvm_threads_live_threads 345.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads 338.0
# HELP executor_queued_tasks The approximate number of tasks that are queued for execution
# TYPE executor_queued_tasks gauge
executor_queued_tasks{name="applicationTaskExecutor",} 0.0
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 6.698432E7
jvm_memory_used_bytes{area="heap",id="G1 Survivor Space",} 2.7262976E7
jvm_memory_used_bytes{area="heap",id="G1 Old Gen",} 3.35000056E8
jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 2.44581256E8
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 1832448.0
jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 6.27048448E8
jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 3.2076944E7
jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 2.687808E7
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="mapped",} 0.0
jvm_buffer_count_buffers{id="direct",} 26.0
# HELP process_files_max_files The maximum file descriptor count
# TYPE process_files_max_files gauge
process_files_max_files 1048576.0
# HELP executor_completed_tasks_total The approximate total number of tasks that have completed execution
# TYPE executor_completed_tasks_total counter
executor_completed_tasks_total{name="applicationTaskExecutor",} 3.0
# HELP disk_free_bytes Usable space for path
# TYPE disk_free_bytes gauge
disk_free_bytes{path="/.",} 5.03406247936E11
# HELP executor_pool_core_threads The core number of threads for the pool
# TYPE executor_pool_core_threads gauge
executor_pool_core_threads{name="applicationTaskExecutor",} 8.0
# HELP application_started_time_seconds Time taken to start the application
# TYPE application_started_time_seconds gauge
application_started_time_seconds{main_application_class="au.csiro.pathling.PathlingServer",} 4.726
# HELP executor_queue_remaining_tasks The number of additional elements that this queue can ideally accept without blocking
# TYPE executor_queue_remaining_tasks gauge
executor_queue_remaining_tasks{name="applicationTaskExecutor",} 2.147483647E9
# HELP tomcat_sessions_rejected_sessions_total  
# TYPE tomcat_sessions_rejected_sessions_total counter
tomcat_sessions_rejected_sessions_total 0.0
# HELP tomcat_sessions_alive_max_seconds  
# TYPE tomcat_sessions_alive_max_seconds gauge
tomcat_sessions_alive_max_seconds 0.0
# HELP application_ready_time_seconds Time taken for the application to be ready to service requests
# TYPE application_ready_time_seconds gauge
application_ready_time_seconds{main_application_class="au.csiro.pathling.PathlingServer",} 4.731
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 345.0
# HELP jvm_gc_memory_allocated_bytes_total Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
# TYPE jvm_gc_memory_allocated_bytes_total counter
jvm_gc_memory_allocated_bytes_total 2.1690843136E10
# HELP logback_events_total Number of log events that were enabled by the effective log level
# TYPE logback_events_total counter
logback_events_total{level="warn",} 14.0
logback_events_total{level="debug",} 0.0
logback_events_total{level="error",} 0.0
logback_events_total{level="trace",} 0.0
logback_events_total{level="info",} 6.0
# HELP jvm_buffer_memory_used_bytes An estimate of the memory that the Java virtual machine is using for this buffer pool
# TYPE jvm_buffer_memory_used_bytes gauge
jvm_buffer_memory_used_bytes{id="mapped",} 0.0
jvm_buffer_memory_used_bytes{id="direct",} 71360.0
# HELP process_cpu_usage The "recent cpu usage" for the Java Virtual Machine process
# TYPE process_cpu_usage gauge
process_cpu_usage 0.05555728292187468
# HELP executor_pool_size_threads The current number of threads in the pool
# TYPE executor_pool_size_threads gauge
executor_pool_size_threads{name="applicationTaskExecutor",} 3.0
# HELP executor_pool_max_threads The maximum allowed number of threads in the pool
# TYPE executor_pool_max_threads gauge
executor_pool_max_threads{name="applicationTaskExecutor",} 2.147483647E9
# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 0.017546886
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="UNKNOWN",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="UNKNOWN",} 0.703383726
http_server_requests_seconds_count{exception="None",method="POST",outcome="SUCCESS",status="200",uri="root",} 1.0
http_server_requests_seconds_sum{exception="None",method="POST",outcome="SUCCESS",status="200",uri="root",} 16.796287862
http_server_requests_seconds_count{exception="None",method="GET",outcome="CLIENT_ERROR",status="400",uri="UNKNOWN",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="CLIENT_ERROR",status="400",uri="UNKNOWN",} 0.133944759
http_server_requests_seconds_count{exception="None",method="GET",outcome="CLIENT_ERROR",status="404",uri="/**",} 2.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="CLIENT_ERROR",status="404",uri="/**",} 0.025861885
# HELP http_server_requests_seconds_max Duration of HTTP server request handling
# TYPE http_server_requests_seconds_max gauge
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/actuator/prometheus",} 0.017546886
http_server_requests_seconds_max{exception="None",method="GET",outcome="SUCCESS",status="200",uri="UNKNOWN",} 0.703383726
http_server_requests_seconds_max{exception="None",method="POST",outcome="SUCCESS",status="200",uri="root",} 16.796287862
http_server_requests_seconds_max{exception="None",method="GET",outcome="CLIENT_ERROR",status="400",uri="UNKNOWN",} 0.133944759
http_server_requests_seconds_max{exception="None",method="GET",outcome="CLIENT_ERROR",status="404",uri="/**",} 0.023286038
# HELP jvm_gc_overhead_percent An approximation of the percent of CPU time used by GC activities over the last lookback period or since monitoring began, whichever is shorter, in the range [0..1]
# TYPE jvm_gc_overhead_percent gauge
jvm_gc_overhead_percent 0.007988737420196745
# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 6.7043328E7
jvm_memory_committed_bytes{area="heap",id="G1 Survivor Space",} 2.7262976E7
jvm_memory_committed_bytes{area="heap",id="G1 Old Gen",} 4.39353344E8
jvm_memory_committed_bytes{area="nonheap",id="Metaspace",} 2.60399104E8
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 3801088.0
jvm_memory_committed_bytes{area="heap",id="G1 Eden Space",} 6.9206016E8
jvm_memory_committed_bytes{area="nonheap",id="Compressed Class Space",} 3.8428672E7
jvm_memory_committed_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 2.6935296E7
# HELP jvm_threads_states_threads The current number of threads
# TYPE jvm_threads_states_threads gauge
jvm_threads_states_threads{state="runnable",} 21.0
jvm_threads_states_threads{state="blocked",} 0.0
jvm_threads_states_threads{state="waiting",} 59.0
jvm_threads_states_threads{state="timed-waiting",} 265.0
jvm_threads_states_threads{state="new",} 0.0
jvm_threads_states_threads{state="terminated",} 0.0
# HELP system_load_average_1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
# TYPE system_load_average_1m gauge
system_load_average_1m 1.02
# HELP jvm_memory_usage_after_gc_percent The percentage of long-lived heap pool used after the last GC event, in the range [0..1]
# TYPE jvm_memory_usage_after_gc_percent gauge
jvm_memory_usage_after_gc_percent{area="heap",pool="long-lived",} 0.14183640107512474

@johngrimes
Copy link
Member

Thanks for this!

I think we should go ahead and change the default configuration as you suggest, to expose the health endpoints on 8080 and the other actuator information on 8081.

Copy link
Member

@johngrimes johngrimes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, and we can roll it in to the next release.

Thanks @chgl!

@johngrimes johngrimes merged commit de45dd1 into aehrc:main Dec 1, 2023
1 check passed
@chgl chgl deleted the added-spring-boot-actuator branch December 1, 2023 08:44
@johngrimes
Copy link
Member

@chgl Just letting you know that this change is now in v6.4.0 - check it out and let us know what you think.

@chgl
Copy link
Collaborator Author

chgl commented Dec 6, 2023

@johngrimes , awesome, thanks! Works great! I've already updated my helm chart to use the new endpoints for liveness/readiness and metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding the Spring Boot Actuator package to the server for helpful monitoring features
2 participants