An end-to-end, production-ready cloud infrastructure designed for high availability and deep observability. This platform leverages Infrastructure as Code (IaC) to provision a resilient environment on AWS, featuring a self-healing microservice architecture.
The full architecture includes a Multi-AZ VPC, managed RDS/Redis, and an EKS cluster with optimized pod density.
Cloud Provider: AWS (VPC, EKS, RDS, ElastiCache, IAM)
Provisioning: Terraform (IaC)
Orchestration: Kubernetes (Amazon EKS)
Deployment: ArgoCD (GitOps Workflow)
Application: FastAPI (Python 3.11)
Observability: Prometheus Operator & Grafana
-
GitOps CD Workflow Using ArgoCD, the platform follows a strict GitOps model. Any changes pushed to the /k8s directory in GitHub are automatically synchronized with the EKS cluster, ensuring zero drift between code and production.
-
Self-Healing & Resiliency The application is configured with replicas: 3 and native Kubernetes liveness probes. During testing, manual pod terminations resulted in instantaneous recovery with zero downtime, as managed by the Kubernetes ReplicaSet controller.
-
Deep Observability The FastAPI application is instrumented to export real-time metrics. A custom Grafana dashboard tracks:
Traffic: Total requests and Requests Per Second (RPS).
Latency: Average and P99 response times.
Errors: HTTP 2xx/4xx/5xx error rates.
Saturation: CPU and Memory utilization per pod.
The "Pod Density" hurdle
- Problem: On t3.small nodes, the default ENI limit capped pod capacity at 11 per node, preventing the heavy Prometheus stack from scheduling.
- Solution: Implemented VPC Prefix Delegation and scaled the node group to 4 managed nodes to provide enough IP headroom for the monitoring stack and application.
Persistent Storage for Metrics
- Problem: Prometheus requires persistent storage to keep metric history across restarts.
- Solution: Configured the AWS EBS CSI Driver with appropriate IAM OpenID Connect (OIDC) roles to allow dynamic provisioning of EBS volumes as PersistentVolumeClaims (PVCs).
- API Response
Confirmed connection between FastAPI, RDS Postgres, and ElastiCache Redis.
- ArgoCD Synchronization
- Grafana Metrics
.
Infrastructure:
cd terraform && terraform apply
Access EKS:
aws eks update-kubeconfig --name reliability-cluster
Metrics:
helm install monitoring prometheus-community/kube-prometheus-stack
