Skip to content

reliability: consolidate monitoring, migrate JDBC, fix JPA, swap Kafka#603

Merged
yasithdev merged 5 commits intomasterfrom
worktree-reliability+server-hardening
Mar 30, 2026
Merged

reliability: consolidate monitoring, migrate JDBC, fix JPA, swap Kafka#603
yasithdev merged 5 commits intomasterfrom
worktree-reliability+server-hardening

Conversation

@yasithdev
Copy link
Copy Markdown
Contributor

@yasithdev yasithdev commented Mar 30, 2026

Summary

Server reliability improvements — consolidate redundant infrastructure, harden startup/shutdown, and ensure the server connects to all dependencies healthily.

Monitoring consolidation

  • Removed 4 redundant per-service MonitoringServer instances (ports 9093-9096)
  • All metrics share CollectorRegistry.defaultRegistry — single endpoint on :9097 serves everything
  • Removed 12 unused monitoring properties and Ansible config

JDBC config migration

  • Created SpringSettingsBridge to inject spring.datasource.* into ApplicationSettings overrides
  • Legacy code reading airavata.jdbc.* transparently receives Spring-managed values
  • application.yml is now the single source of truth for DB credentials
  • Removed duplicate airavata.jdbc.* from all properties files and Ansible configs

Spring Boot auto-configured JPA

  • Removed manual LocalContainerEntityManagerFactoryBean from AiravataServerMain
  • Using @EntityScan + PhysicalNamingStrategyStandardImpl in application.yml
  • Resolves Hibernate 6.6 DuplicateMappingException on shared @Id/@joincolumn columns
  • @DependsOn("springSettingsBridge") on AiravataServerHandler ensures JDBC config available during construction

Kafka image

  • Replaced unmaintained wurstmeister/kafka (last updated 2021) with official apache/kafka:3.9.0
  • KRaft mode (no ZooKeeper dependency for Kafka; ZK retained for Helix)

Reliability hardening

  • Compose healthchecks on all 5 services (MariaDB, RabbitMQ, ZooKeeper, Kafka, Keycloak)
  • Graceful shutdown (server.shutdown=graceful, 30s timeout) — gRPC, Tomcat, Kafka producer, JPA all close properly
  • HikariCP resilience (initialization-fail-timeout=-1 allows lazy pool init if DB is slow)
  • Infrastructure health indicator/actuator/health now reports real RabbitMQ/Kafka/ZK connectivity status
  • setup.sh waits for compose healthchecks instead of ad-hoc polling

Lifecycle test results

Test Result
Cold start (from scratch) UP in 10s, all health checks green
Graceful shutdown (SIGTERM) Clean exit in 3s, all 4 ports released
Warm restart (infra running) UP in 10s, reconnects to all services
Final shutdown Clean exit in 2s, all ports released
Actuator health Reports db, infrastructure (rabbitmq/kafka/zk), diskSpace, ssl
Redundant ports 9093-9096 All closed (consolidated to 9097)

Test plan

  • mvn spotless:check passes
  • mvn test -T4 — 179 tests, 0 failures
  • Full cold start → shutdown → warm start lifecycle verified
  • All compose services reach healthy status with healthchecks
  • /actuator/health shows real infrastructure status
  • Graceful shutdown releases all ports cleanly

@yasithdev yasithdev force-pushed the worktree-reliability+server-hardening branch from 80e235c to f133a03 Compare March 30, 2026 16:13
…itoringServer on :9097

Remove 4 redundant MonitoringServer instances (ports 9093-9096) from GlobalParticipant,
PreWorkflowManager, PostWorkflowManager, and ParserWorkflowManager. All metrics already
share CollectorRegistry.defaultRegistry, so a single endpoint on :9097 serves everything.
Remove 12 monitoring properties from airavata-server.properties and Ansible configs.
…plicate properties

Create SpringSettingsBridge that injects spring.datasource.* values into
ApplicationSettings overrides so legacy code reading airavata.jdbc.* transparently
receives Spring-managed datasource values. application.yml is now the single source
of truth for database credentials. Remove airavata.jdbc.* from all properties files
and Ansible configs.
…tityManagerFactory

Replace the manual LocalContainerEntityManagerFactoryBean with Spring Boot auto-config
using @EntityScan and PhysicalNamingStrategyStandardImpl (set via application.yml).
This resolves the Hibernate 6.6 DuplicateMappingException on shared @Id/@joincolumn
columns. Add @dependsOn on AiravataServerHandler to ensure SpringSettingsBridge runs first.
Switch to the official Apache Kafka image using KRaft mode (no ZooKeeper dependency
for Kafka; ZooKeeper is retained for Helix). Also apply spotless formatting fixes.
@yasithdev yasithdev force-pushed the worktree-reliability+server-hardening branch from f133a03 to 6517e8a Compare March 30, 2026 17:43
…ence, infra health indicator

- Add healthchecks to all 5 compose services (MariaDB, RabbitMQ, ZooKeeper, Kafka, Keycloak)
- Configure Spring Boot graceful shutdown (server.shutdown=graceful, 30s timeout)
- Add HikariCP resilience (initialization-fail-timeout=-1 allows lazy pool init)
- Add InfrastructureHealthIndicator: /actuator/health now reports real RabbitMQ/Kafka/ZK status
- Show health details in actuator (management.endpoint.health.show-details=always)
- Update setup.sh to wait for compose healthchecks instead of ad-hoc polling

Verified: cold start (10s), graceful shutdown (3s, all ports released), warm restart (10s).
@yasithdev yasithdev merged commit dbb1ec5 into master Mar 30, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant