diff --git a/README.md b/README.md index a5e4a8c..c2e6940 100644 --- a/README.md +++ b/README.md @@ -1,122 +1,38 @@ # Spring Boot Security & Observability Lab -This repository is a hands-on lab designed to demonstrate the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices. +This repository is an advanced, hands-on lab demonstrating the architectural evolution of a modern Java application. We will build a system from the ground up, starting with a secure monolith and progressively refactoring it into a fully observable, distributed system using cloud-native best practices. --- -## Lab Progress: Phase 6 - Proactive Alerting with Alertmanager +## Workshop Guide: The Evolutionary Phases -The `main` branch currently represents the completed state of **Phase 6**. +This lab is structured in distinct, self-contained phases. The `main` branch always represents the latest completed phase. To explore a previous phase's code and detailed documentation, use the links below. -* **Git Tag for this Phase:** `v6.0-proactive-alerting` - -### Objective - -The goal of this phase was to transition our monitoring strategy from passive (dashboards) to **proactive**. We have integrated the Prometheus Alertmanager into our stack to create a system that can automatically detect and route notifications about problems, without requiring a human to be watching a screen. This demonstrates the completion of a production-grade monitoring feedback loop. - -### Key Concepts Demonstrated - -* **Prometheus Alerting Pipeline:** Understanding the distinct roles of Prometheus (which evaluates rules and generates alerts) and Alertmanager (which receives, de-duplicates, groups, and routes alerts). -* **Declarative Alerting Rules:** Defining alerting conditions as code using PromQL expressions in a version-controlled YAML file. -* **Alerting on Technical & Security Metrics:** Creating two distinct types of alerts: - 1. A **technical alert** (`ApiServerErrorRateHigh`) that fires on infrastructure-level signals like a spike in 5xx server errors. - 2. A **security alert** (`UnauthorizedAdminAccessSpike`) that fires on application-level signals, such as an abnormal rate of `4xx` errors on a privileged endpoint. -* **Alert Lifecycle:** Observing the full lifecycle of an alert: `Inactive` -> `Pending` -> `Firing` -> `Resolved`. -* **UI-Driven Test Harness:** Building a dedicated "Alerting Test Panel" in our web application to reliably trigger alert conditions on demand, proving the entire pipeline works end-to-end. - -### Architecture Overview - -Phase 6 introduces Alertmanager and connects it to our existing Prometheus instance. The data flow for alerting is now a core part of our observability stack. - -```mermaid -graph TD - subgraph "Application Services" - RS[Resource Server] - WC[Web Client] - end - - subgraph "Observability Stack" - Prom[Prometheus] -->|1. Scrapes Metrics| RS - Prom -->|1. Scrapes Metrics| WC - - subgraph "Alerting Pipeline" - Rules[alerts.yml] -->|2. Evaluates| Prom - Prom -->|3. Sends Firing Alerts| AM[Alertmanager] - end - - G[Grafana] - end - - subgraph "Operators / External Systems" - AM -->|4. Routes Notifications| Notif[Email, Slack, etc.] - Ops[Operator] -->|Views & Manages Alerts| AM - Ops -->|Views Dashboards| G - end -``` - -1. **[Prometheus](config/prometheus/prometheus.yml):** Its role is expanded. It is now configured to load a [rule file](config/prometheus/alerts.yml) and to send any alerts that become "Firing" to the Alertmanager service. The `--web.external-url` flag is set to ensure backlinks are generated with a browser-resolvable hostname. -2. **[Alertmanager](config/alertmanager/alertmanager.yml):** The new central hub for all alerts. It receives alerts from Prometheus, groups them to reduce noise, and would (in a production setup) route them to configured receivers. For this lab, we use a "null" receiver. +| Phase | Description & Key Concepts | Code & Docs (at tag) | Key Pull Requests | +|:-----------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| **1. The Secure Monolith** | A standalone service that issues and validates its own JWTs. Concepts: `AuthenticationManager`, custom `JwtAuthenticationFilter`, `jjwt` library, and a foundational CI pipeline. | [`v1.0-secure-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v1.0-secure-monolith) | [#2](https://github.com/apenlor/spring-boot-security-observability-lab/pull/2), [#3](https://github.com/apenlor/spring-boot-security-observability-lab/pull/3), [#4](https://github.com/apenlor/spring-boot-security-observability-lab/pull/4) | +| **2. Observing the Monolith** | The service is containerized and orchestrated via `docker-compose`. Concepts: Micrometer, Prometheus, Grafana, custom metrics, and automated dashboard provisioning. | [`v2.0-observable-monolith`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v2.0-observable-monolith) | [#6](https://github.com/apenlor/spring-boot-security-observability-lab/pull/6) | +| **3. Evolving to Federated Identity** | The system is refactored into a multi-service architecture with an external IdP. Concepts: Keycloak, OIDC, OAuth2 Client (`web-client`) vs. Resource Server, Traefik reverse proxy, service-to-service security. | [`v3.0-federated-identity`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v3.0-federated-identity) | [#8](https://github.com/apenlor/spring-boot-security-observability-lab/pull/8) | +| **4. Tracing a Distributed System** | Services are instrumented with the OpenTelemetry agent to generate traces. Concepts: Tempo, agent-based instrumentation, W3C Trace Context, Service Graphs, and a hybrid PUSH/PULL metrics architecture. | [`v4.0-distributed-tracing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v4.0-distributed-tracing) | [#10](https://github.com/apenlor/spring-boot-security-observability-lab/pull/10) | +| **5. Correlated Logs & Access Auditing** | The three pillars of observability are complete (metrics, traces, logs). Alloy is the unified collection agent. Concepts: Loki, Grafana Alloy, Docker service discovery, structured JSON logs, AOP-based auditing, trace-to-log correlation, and detailed audit metrics. | [`v5.0-correlated-logs-auditing`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v5.0-correlated-logs-auditing) | [#12](https://github.com/apenlor/spring-boot-security-observability-lab/pull/12) | +| **6. Proactive Alerting** | The system transitions from passive to proactive monitoring. Concepts: Alertmanager, declarative PromQL alert rules, alerting on technical vs. security metrics, and a UI-driven test harness. | [`v6.0-proactive-alerting`](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v6.0-proactive-alerting) | [#14](https://github.com/apenlor/spring-boot-security-observability-lab/pull/14) | +| **7. Continuous Security Integration** | _Upcoming..._ | - | - | +| **8. Advanced Secret Management** | _Upcoming..._ | - | - | --- -### Key Configuration Details +## How to Follow This Lab -#### 1. Prometheus Alert Rules - -The core of this phase is the [alerts.yml](config/prometheus/alerts.yml) file. We have defined two rules that are specifically tailored for our application and optimized for a lab environment with short `for` durations for rapid testing. - -* **`ApiServerErrorRateHigh`:** This rule fires when the rate of `5xx` status codes from the `resource-server` exceeds 0 for a continuous period. It is designed to be triggered by our `ChaosController`. -* **`UnauthorizedAdminAccessSpike`:** This security-focused rule fires when the rate of `4xx` status codes on the specific `/api/secure/admin` endpoint exceeds 0. This is more robust than checking for just `403` as it captures any client-side error on this privileged endpoint, signaling a potential issue. - -#### 2. UI-Driven Test Harness - -To validate the entire alerting pipeline, we implemented a dedicated "Alerting Test Panel" in the `web-client`. -* The `ChaosController` in the `resource-server` was enhanced with a guaranteed-failure endpoint (`/api/chaos/error`). -* The `WebController` in the `web-client` was updated with two new `POST` endpoints that call the backend to generate `5xx` and `4xx` errors. - ---- - -## Local Development & Quick Start - -The prerequisites and setup are the same as in previous phases. - -1. **Configure Local Hostnames (One-Time Setup, if not already done):** - Edit your local `hosts` file to add: - ``` - 127.0.0.1 keycloak.local - ``` -2. **Create and Configure Your Environment File:** - ```bash - cp .env.example .env - # ...then edit .env to add your WEB_CLIENT_SECRET from Keycloak. - ``` -3. **Build and run the entire stack:** - ```bash - docker-compose up --build -d - ``` -4. **Access the Services:** - * **Web Client Application:** [http://localhost:8082](http://localhost:8082) (Login with `lab-user`/`lab-user` or `lab-admin`/`lab-admin`) - * **Keycloak Admin Console:** [http://keycloak.local](http://keycloak.local) (Login with `admin`/`admin`) - * **Prometheus UI:** [http://localhost:9090](http://localhost:9090) - * **Alertmanager UI:** [http://localhost:9093](http://localhost:9093) - * **Grafana UI:** [http://localhost:3000](http://localhost:3000) +1. **Start with the `main` branch** to see the latest state of the project. +2. To go back in time, use the **"Code & Docs" link** for a specific phase. This will show you the `README.md` for that phase, which contains the specific instructions and examples for that version of the code. +3. To understand the *"why"* behind the changes, review the **Key Pull Requests** for each phase. --- -## Validating the New Alerting Features - -1. **Confirm Rules are Loaded:** - * Navigate to the Prometheus UI's "Alerts" tab ([http://localhost:9090/alerts](http://localhost:9090/alerts)). - * Verify that both new alerts are present and in the green "Inactive" state. +## Running the Project -2. **Trigger the Alerts via the UI:** - * Log in to the Web Client as **`lab-user` / `lab-user`**. - * In the "Alerting Test Panel", repeatedly click the buttons to generate `403` and `5xx` errors. - * Watch the Prometheus Alerts UI. The alerts will transition from `Inactive` to `Pending` (yellow) and then to `Firing` (red). - * Once firing, the alerts will appear in the Alertmanager UI. +To run the application and see usage examples for the **current phase**, please refer to the detailed instructions in its tagged `README.md` file. -#### Stop the Environment +**[>> Go to instructions for the current phase: `v6.0-proactive-alerting` <<](https://github.com/apenlor/spring-boot-security-observability-lab/tree/v6.0-proactive-alerting?tab=readme-ov-file#local-development--quick-start)** -```bash -docker-compose down -v -``` \ No newline at end of file +As the lab progresses, this link will always be updated to point to the latest completed phase. \ No newline at end of file