GitHub - aashiruu/reliability-platform: Enterprise-grade SRE platform featuring GitOps (ArgoCD), self-healing EKS, and SLO-based observability.

Reliability-First Cloud Platform (AWS EKS + GitOps + Observability)

An end-to-end, production-ready cloud infrastructure designed for high availability and deep observability. This platform leverages Infrastructure as Code (IaC) to provision a resilient environment on AWS, featuring a self-healing microservice architecture.

System Architecture

The full architecture includes a Multi-AZ VPC, managed RDS/Redis, and an EKS cluster with optimized pod density.

Cloud Provider: AWS (VPC, EKS, RDS, ElastiCache, IAM)

Provisioning: Terraform (IaC)

Orchestration: Kubernetes (Amazon EKS)

Deployment: ArgoCD (GitOps Workflow)

Application: FastAPI (Python 3.11)

Observability: Prometheus Operator & Grafana

Key Features

GitOps CD Workflow Using ArgoCD, the platform follows a strict GitOps model. Any changes pushed to the /k8s directory in GitHub are automatically synchronized with the EKS cluster, ensuring zero drift between code and production.
Self-Healing & Resiliency The application is configured with replicas: 3 and native Kubernetes liveness probes. During testing, manual pod terminations resulted in instantaneous recovery with zero downtime, as managed by the Kubernetes ReplicaSet controller.
Deep Observability The FastAPI application is instrumented to export real-time metrics. A custom Grafana dashboard tracks:

Traffic: Total requests and Requests Per Second (RPS).

Latency: Average and P99 response times.

Errors: HTTP 2xx/4xx/5xx error rates.

Saturation: CPU and Memory utilization per pod.

Technical Challenges & Solutions

The "Pod Density" hurdle

Problem: On t3.small nodes, the default ENI limit capped pod capacity at 11 per node, preventing the heavy Prometheus stack from scheduling.
Solution: Implemented VPC Prefix Delegation and scaled the node group to 4 managed nodes to provide enough IP headroom for the monitoring stack and application.

Persistent Storage for Metrics

Problem: Prometheus requires persistent storage to keep metric history across restarts.
Solution: Configured the AWS EBS CSI Driver with appropriate IAM OpenID Connect (OIDC) roles to allow dynamic provisioning of EBS volumes as PersistentVolumeClaims (PVCs).

Proof of Life

API Response

Confirmed connection between FastAPI, RDS Postgres, and ElastiCache Redis.

ArgoCD Synchronization

Grafana Metrics

.

How to Run (Local)

Infrastructure:

cd terraform && terraform apply

Access EKS:

aws eks update-kubeconfig --name reliability-cluster

Metrics:

helm install monitoring prometheus-community/kube-prometheus-stack

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
k8s/base/api		k8s/base/api
services/api		services/api
terraform		terraform
.gitignore		.gitignore
README.md		README.md
argocd-app.yaml		argocd-app.yaml
graph.png		graph.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reliability-First Cloud Platform (AWS EKS + GitOps + Observability)

System Architecture

Key Features

Technical Challenges & Solutions

Proof of Life

How to Run (Local)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reliability-First Cloud Platform (AWS EKS + GitOps + Observability)

System Architecture

Key Features

Technical Challenges & Solutions

Proof of Life

How to Run (Local)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages