diff --git a/README.fr.md b/README.fr.md index afc3e60..fe3178d 100644 --- a/README.fr.md +++ b/README.fr.md @@ -2,10 +2,7 @@ ![PyPI](https://img.shields.io/pypi/v/cleancloud) ![Python Versions](https://img.shields.io/pypi/pyversions/cleancloud) -![Docker Pulls](https://img.shields.io/docker/pulls/getcleancloud/cleancloud) ![License](https://img.shields.io/badge/License-MIT-yellow.svg) -[![Security Scanning](https://github.com/cleancloud-io/cleancloud/actions/workflows/security-scan.yml/badge.svg)](https://github.com/cleancloud-io/cleancloud/actions/workflows/security-scan.yml) -![GitHub stars](https://img.shields.io/github/stars/cleancloud-io/cleancloud?style=social) **Languages / Langues :** 🇬🇧 [English](README.md) | 🇫🇷 [Français](README.fr.md) @@ -33,7 +30,7 @@ C'est CleanCloud. Scannez vos environnements AWS, Azure et GCP, obtenez des find | Hygiène multi-comptes / multi-abonnements / multi-projets | ❌ | ✅ | ✅ | | Application planifiée et CI/CD (codes de sortie) | ❌ | ❌ | ✅ | -- **30 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe +- **31 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe. Les règles IA/ML (SageMaker) sont opt-in via `--category ai` - **Gouvernance et application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 100` — appliquer des seuils de gaspillage sur un planning, géré par les équipes platform ou FinOps - **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org` - **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus @@ -63,18 +60,88 @@ Toutes les opérations sont en lecture seule. Sûr pour les comptes de productio - Gouvernance d'hygiène planifiée — job hebdomadaire qui détecte les nouveaux gaspillages et applique les seuils sur tous les comptes - Rapports pré-revue — exportez les findings en markdown avant une revue trimestrielle des coûts ou un board meeting +## Exemple de résultat détaillé + ``` -6 problèmes d'hygiène détectés : +6 problèmes détectés : + +1. [AWS] Instance RDS inactive (aucune connexion depuis 21 jours) + Risque : Élevé + Confiance : High + Ressource : aws.rds.instance → db-prod-analytics + Région : us-east-1 + Règle : aws.rds.instance.idle + Raison : Instance RDS sans connexion depuis 21 jours + Détails : + - instance_class: db.r5.large + - engine: postgres 15.4 + - estimated_monthly_cost: ~$380/mois + +2. [AWS] Volume EBS non attaché + Risque : Faible + Confiance : High + Ressource : aws.ebs.volume → vol-0a1b2c3d4e5f67890 + Région : us-east-1 + Règle : aws.ebs.volume.unattached + Raison : Volume non attaché depuis 47 jours + Détails : + - size_gb: 500 + - state: available + - tags: {"Project": "legacy-api", "Owner": "platform"} + +3. [AWS] NAT Gateway inactive + Risque : Moyen + Confiance : Medium + Ressource : aws.ec2.nat_gateway → nat-0abcdef1234567890 + Région : us-west-2 + Règle : aws.ec2.nat_gateway.idle + Raison : Aucun trafic détecté depuis 21 jours + Détails : + - name: staging-nat + - total_bytes_out: 0 + - estimated_monthly_cost: ~$32/mois + +4. [AWS] Load Balancer inactif (aucune cible saine) + Risque : Moyen + Confiance : High + Ressource : aws.elbv2.load_balancer → alb-staging-api + Région : us-east-1 + Règle : aws.elbv2.load_balancer.idle + Raison : Load balancer sans cible saine depuis 30 jours + Détails : + - type: application + - estimated_monthly_cost: ~$18/mois + +5. [AWS] Elastic IP non attachée + Risque : Faible + Confiance : High + Ressource : aws.ec2.elastic_ip → eipalloc-0a1b2c3d4e5f6 + Région : eu-west-1 + Règle : aws.ec2.elastic_ip.unattached + Raison : Elastic IP non associée à aucune instance ou ENI (ancienneté : 92 jours) -1. [AWS] Volume EBS non attaché — $40/mois -2. [AWS] NAT Gateway inactive — $32.40/mois -3. [AWS] Elastic IP non attachée — $0/mois -... +6. [AWS] Ancien snapshot EBS (438 jours) + Risque : Faible + Confiance : High + Ressource : aws.ebs.snapshot → snap-0a1b2c3d4e5f67890 + Région : us-west-2 + Règle : aws.ebs.snapshot.old + Raison : Snapshot âgé de 438 jours sans activité récente + Détails : + - size_gb: 200 + - estimated_monthly_cost: ~$10/mois -Gaspillage mensuel estimé : ~$147 -Régions scannées : us-east-1, us-west-2, eu-west-1 +--- Résumé du scan --- +Total findings : 6 +Par risque : faible: 3 moyen: 2 élevé: 1 +Par confiance : high: 5 medium: 1 +Gaspillage minimum estimé : ~$480/mois +(5 findings sur 6 chiffrés) +Régions scannées : us-east-1, us-west-2, eu-west-1 (auto-détectées) ``` +Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans aucun credential. + ## Mentionné dans la presse - [Korben](https://korben.info/cleancloud-nettoyeur-cloud-aws-azure.html) 🇫🇷 — Grand média tech français @@ -104,6 +171,7 @@ Régions scannées : us-east-1, us-west-2, eu-west-1 | Flag | Fonction | |---|---| | `--provider aws\|azure\|gcp` | Fournisseur cloud à scanner *(obligatoire)* | +| `--category hygiene\|ai\|all` | Catégorie de règles : `hygiene` (défaut), `ai` (SageMaker, AWS uniquement) ou `all` (hygiene + IA) | | `--region REGION` | Scanner une seule région | | `--all-regions` | Toutes les régions actives — AWS/Azure uniquement | | **AWS multi-comptes** | | @@ -221,54 +289,6 @@ pip uninstall cleancloud && pipx install cleancloud && pipx ensurepath --- -## Exemple de résultat détaillé - -``` -6 problèmes d'hygiène détectés : - -1. [AWS] Volume EBS non attaché - Risque : Faible - Confiance : High - Ressource : aws.ebs.volume → vol-0a1b2c3d4e5f67890 - Région : us-east-1 - Règle : aws.ebs.volume.unattached - Raison : Volume non attaché depuis 47 jours - Détails : - - size_gb: 500 - - state: available - - tags: {"Project": "legacy-api", "Owner": "platform"} - -2. [AWS] NAT Gateway inactive - Risque : Moyen - Confiance : Medium - Ressource : aws.ec2.nat_gateway → nat-0abcdef1234567890 - Région : us-west-2 - Règle : aws.ec2.nat_gateway.idle - Raison : Aucun trafic détecté depuis 21 jours - Détails : - - name: staging-nat - - total_bytes_out: 0 - - estimated_monthly_cost_usd: 32.40 - -3. [AWS] Elastic IP non attachée - Risque : Faible - Confiance : High - Ressource : aws.ec2.elastic_ip → eipalloc-0a1b2c3d4e5f6 - Région : eu-west-1 - Règle : aws.ec2.elastic_ip.unattached - Raison : Elastic IP non associée à aucune instance ou ENI (ancienneté : 92 jours) - ---- Résumé du scan --- -Total findings : 6 -Par risque : faible: 5 moyen: 1 -Par confiance : high: 2 medium: 4 -Gaspillage minimum estimé : ~$147/mois -(4 findings sur 6 chiffrés) -Régions scannées : us-east-1, us-west-2, eu-west-1 (auto-détectées) -``` - -Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans aucun credential. - ### Rapport markdown partageable ```bash @@ -317,6 +337,7 @@ Pour des exemples de sortie complets incluant `doctor`, JSON, CSV et markdown : - Plateforme : instances RDS inactives (HIGH) - Observabilité : logs CloudWatch à rétention infinie - Gouvernance : ressources sans tags, security groups inutilisés +- IA/ML *(opt-in : `--category ai`)* : endpoints SageMaker inactifs avec zéro invocations depuis 14+ jours — endpoints GPU flaggés risque HIGH ($500–$23K/mois) **Azure :** - Compute : VMs arrêtées (non désallouées) (HIGH) @@ -562,7 +583,9 @@ Guide complet : [Configuration GCP →](docs/gcp.md) **Policy-as-code** — `cleancloud.yaml` avec packs de règles, exceptions par équipe, et seuils de coût en config — la principale demande de gouvernance FinOps pour 2025/2026 -**Plus de règles AWS** — lacunes de cycle de vie S3, gaspillage IA/GPU (endpoints SageMaker inactifs, instances GPU orphelines), Redshift inactif +**Plus de règles IA/ML** — clusters de calcul Azure ML inactifs, endpoints Vertex AI inactifs, instances de notebook SageMaker inutilisées, artefacts d'entraînement orphelins + +**Plus de règles AWS** — lacunes de cycle de vie S3, Redshift inactif, fuite de coût NAT Gateway (services internes routant via NAT au lieu de VPC endpoints — S3, DynamoDB, ECR, SSM), VPC endpoints inutilisés **Plus de règles Azure** — Azure Firewall inactif, pools de nœuds AKS inactifs, pools Azure Batch inutilisés diff --git a/README.md b/README.md index 19ff987..89b2d0e 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,7 @@ ![PyPI](https://img.shields.io/pypi/v/cleancloud) ![Python Versions](https://img.shields.io/pypi/pyversions/cleancloud) -![Docker Pulls](https://img.shields.io/docker/pulls/getcleancloud/cleancloud) ![License](https://img.shields.io/badge/License-MIT-yellow.svg) -[![Security Scanning](https://github.com/cleancloud-io/cleancloud/actions/workflows/security-scan.yml/badge.svg)](https://github.com/cleancloud-io/cleancloud/actions/workflows/security-scan.yml) -![GitHub stars](https://img.shields.io/github/stars/cleancloud-io/cleancloud?style=social) **Languages / Langues :** 🇬🇧 [English](README.md) | 🇫🇷 [Français](README.fr.md) @@ -33,7 +30,7 @@ That's CleanCloud. Scan your AWS, Azure, and GCP environments, get specific acti | Multi-account / multi-subscription / multi-project | ❌ | ✅ | ✅ | | CI/CD and scheduled enforcement (exit codes) | ❌ | ❌ | ✅ | -- **30 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate +- **31 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate. AI/ML rules (SageMaker) are opt-in via `--category ai` - **Governance enforcement (opt-in):** `--fail-on-confidence HIGH` or `--fail-on-cost 100` — enforce waste thresholds on a schedule, owned by platform or FinOps teams - **Multi-account scanning (AWS):** scan entire AWS Organizations in one run — config file, inline IDs, or auto-discovery via `--org` - **Multi-subscription scanning (Azure):** scan all Azure subscriptions in parallel — auto-discovery via Management Group, per-subscription cost breakdown included @@ -63,18 +60,88 @@ All operations are read-only. Safe for production accounts, air-gapped environme - Scheduled hygiene governance — weekly job that catches new waste and enforces thresholds across all accounts - Pre-review reports — export findings to markdown before a quarterly cost review or board meeting +## What It Looks Like + ``` Found 6 hygiene issues: -1. [AWS] Unattached EBS Volume — $40/month -2. [AWS] Idle NAT Gateway — $32.40/month -3. [AWS] Unattached Elastic IP — $0/month -... +1. [AWS] Idle RDS Instance (No Connections for 21 Days) + Risk : High + Confidence : High + Resource : aws.rds.instance → db-prod-analytics + Region : us-east-1 + Rule : aws.rds.instance.idle + Reason : RDS instance has had zero connections for 21 days + Details: + - instance_class: db.r5.large + - engine: postgres 15.4 + - estimated_monthly_cost: ~$380/month + +2. [AWS] Unattached EBS Volume + Risk : Low + Confidence : High + Resource : aws.ebs.volume → vol-0a1b2c3d4e5f67890 + Region : us-east-1 + Rule : aws.ebs.volume.unattached + Reason : Volume has been unattached for 47 days + Details: + - size_gb: 500 + - state: available + - tags: {"Project": "legacy-api", "Owner": "platform"} -Estimated monthly waste: ~$147 -Regions scanned: us-east-1, us-west-2, eu-west-1 +3. [AWS] Idle NAT Gateway + Risk : Medium + Confidence : Medium + Resource : aws.ec2.nat_gateway → nat-0abcdef1234567890 + Region : us-west-2 + Rule : aws.ec2.nat_gateway.idle + Reason : No traffic detected for 21 days + Details: + - name: staging-nat + - total_bytes_out: 0 + - estimated_monthly_cost: ~$32/month + +4. [AWS] Idle Load Balancer (No Healthy Targets) + Risk : Medium + Confidence : High + Resource : aws.elbv2.load_balancer → alb-staging-api + Region : us-east-1 + Rule : aws.elbv2.load_balancer.idle + Reason : Load balancer has no healthy targets for 30 days + Details: + - type: application + - estimated_monthly_cost: ~$18/month + +5. [AWS] Unattached Elastic IP + Risk : Low + Confidence : High + Resource : aws.ec2.elastic_ip → eipalloc-0a1b2c3d4e5f6 + Region : eu-west-1 + Rule : aws.ec2.elastic_ip.unattached + Reason : Elastic IP not associated with any instance or ENI (age: 92 days) + +6. [AWS] Old EBS Snapshot (438 Days) + Risk : Low + Confidence : High + Resource : aws.ebs.snapshot → snap-0a1b2c3d4e5f67890 + Region : us-west-2 + Rule : aws.ebs.snapshot.old + Reason : Snapshot is 438 days old with no recent activity + Details: + - size_gb: 200 + - estimated_monthly_cost: ~$10/month + +--- Scan Summary --- +Total findings: 6 +By risk: low: 3 medium: 2 high: 1 +By confidence: high: 5 medium: 1 +Minimum estimated waste: ~$480/month +(5 of 6 findings costed) +Regions scanned: us-east-1, us-west-2, eu-west-1 (auto-detected) ``` +No cloud account yet? `cleancloud demo` shows sample output without any credentials. + ## As featured in - [Korben](https://korben.info/cleancloud-nettoyeur-cloud-aws-azure.html) 🇫🇷 — Major French tech publication @@ -135,13 +202,22 @@ cleancloud scan --provider azure cleancloud scan --provider gcp --all-projects ``` -**Not sure if your credentials have the right permissions?** Run `cleancloud doctor --provider aws`, `cleancloud doctor --provider azure`, or `cleancloud doctor --provider gcp` first. +**Not sure if your credentials have the right permissions?** + +Run: + +`cleancloud doctor --provider aws`, + +`cleancloud doctor --provider azure`, or + +`cleancloud doctor --provider gcp` first. ### Scan flags: | Flag | What it does | |---|---| | `--provider aws\|azure\|gcp` | Cloud provider to scan *(required)* | +| `--category hygiene\|ai\|all` | Rule category: `hygiene` (default), `ai` (SageMaker, AWS-only), or `all` (hygiene + AI) | | `--region REGION` | Scan a single region | | `--all-regions` | Scan all active regions — AWS/Azure only | | **AWS multi-account** | | @@ -223,54 +299,6 @@ pip uninstall cleancloud && pipx install cleancloud && pipx ensurepath --- -## What It Looks Like - -``` -Found 6 hygiene issues: - -1. [AWS] Unattached EBS Volume - Risk : Low - Confidence : High - Resource : aws.ebs.volume → vol-0a1b2c3d4e5f67890 - Region : us-east-1 - Rule : aws.ebs.volume.unattached - Reason : Volume has been unattached for 47 days - Details: - - size_gb: 500 - - state: available - - tags: {"Project": "legacy-api", "Owner": "platform"} - -2. [AWS] Idle NAT Gateway - Risk : Medium - Confidence : Medium - Resource : aws.ec2.nat_gateway → nat-0abcdef1234567890 - Region : us-west-2 - Rule : aws.ec2.nat_gateway.idle - Reason : No traffic detected for 21 days - Details: - - name: staging-nat - - total_bytes_out: 0 - - estimated_monthly_cost_usd: 32.40 - -3. [AWS] Unattached Elastic IP - Risk : Low - Confidence : High - Resource : aws.ec2.elastic_ip → eipalloc-0a1b2c3d4e5f6 - Region : eu-west-1 - Rule : aws.ec2.elastic_ip.unattached - Reason : Elastic IP not associated with any instance or ENI (age: 92 days) - ---- Scan Summary --- -Total findings: 6 -By risk: low: 5 medium: 1 -By confidence: high: 2 medium: 4 -Minimum estimated waste: ~$147/month -(4 of 6 findings costed) -Regions scanned: us-east-1, us-west-2, eu-west-1 (auto-detected) -``` - -No cloud account yet? `cleancloud demo` shows sample output without any credentials. - ### Shareable markdown report ```bash @@ -310,7 +338,7 @@ For full output examples including `doctor`, JSON, CSV, and markdown: [`docs/exa ## What CleanCloud Detects -30 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments. +31 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments. **AWS:** - Compute: stopped instances 30+ days (EBS charges continue) @@ -319,6 +347,7 @@ For full output examples including `doctor`, JSON, CSV, and markdown: [`docs/exa - Platform: idle RDS instances (HIGH) - Observability: infinite retention CloudWatch Logs - Governance: untagged resources, unused security groups +- AI/ML *(opt-in: `--category ai`)*: idle SageMaker endpoints with zero invocations 14+ days — GPU-backed endpoints flagged HIGH risk ($500–$23K/month) **Azure:** - Compute: stopped (not deallocated) VMs (HIGH) @@ -564,7 +593,9 @@ Full setup guide: [GCP setup →](docs/gcp.md) **Policy-as-code** — `cleancloud.yaml` with rule packs, per-team exceptions, and cost thresholds in config — the top FinOps governance ask for 2025/2026 -**More AWS rules** — S3 lifecycle gaps, AI/GPU waste (idle SageMaker endpoints, orphaned GPU instances), Redshift idle, idle load balancers (ALB/NLB with zero traffic), NAT Gateway cost leakage (internal services routing through NAT instead of VPC endpoints — S3, DynamoDB, ECR, SSM), unused VPC endpoints +**More AI/ML waste rules** — Azure ML compute clusters idle, Vertex AI endpoints idle, SageMaker notebook instances running unused, orphaned training artifacts + +**More AWS rules** — S3 lifecycle gaps, Redshift idle, NAT Gateway cost leakage (internal services routing through NAT instead of VPC endpoints — S3, DynamoDB, ECR, SSM), unused VPC endpoints **More Azure rules** — Azure Firewall idle, AKS node pool idle, Azure Batch unused pools diff --git a/cleancloud/demo/command.py b/cleancloud/demo/command.py index fd3cd61..c72e113 100644 --- a/cleancloud/demo/command.py +++ b/cleancloud/demo/command.py @@ -3,7 +3,13 @@ import click -from cleancloud.demo.findings import ALL_FINDINGS, AWS_FINDINGS, AZURE_FINDINGS, GCP_FINDINGS +from cleancloud.demo.findings import ( + ALL_FINDINGS, + AWS_AI_FINDINGS, + AWS_FINDINGS, + AZURE_FINDINGS, + GCP_FINDINGS, +) from cleancloud.output.human import print_human from cleancloud.output.summary import _print_summary, build_summary @@ -15,7 +21,13 @@ default=None, help="Show demo findings for a specific provider (default: all)", ) -def demo(provider: Optional[str]): +@click.option( + "--category", + type=click.Choice(["hygiene", "ai"]), + default="hygiene", + help="Rule category to demo: hygiene (default) or ai (SageMaker)", +) +def demo(provider: Optional[str], category: str): """Show realistic sample findings without cloud credentials.""" click.echo() click.echo("=" * 60) @@ -23,7 +35,11 @@ def demo(provider: Optional[str]): click.echo(" Run 'cleancloud scan --provider aws' to scan your own environment") click.echo("=" * 60) - if provider == "aws": + if category == "ai": + findings = AWS_AI_FINDINGS + regions = ["us-east-1"] + region_mode = "explicit" + elif provider == "aws": findings = AWS_FINDINGS regions = ["us-east-1", "us-west-2", "eu-west-1"] region_mode = "all-regions" @@ -44,7 +60,7 @@ def demo(provider: Optional[str]): summary = build_summary(findings) summary["scanned_at"] = datetime.now(timezone.utc).isoformat() - summary["provider"] = provider or "mixed" + summary["provider"] = provider or ("aws" if category == "ai" else "mixed") summary["regions_scanned"] = regions _print_summary(summary, region_selection_mode=region_mode) diff --git a/cleancloud/demo/findings.py b/cleancloud/demo/findings.py index 5b88f0b..889d27a 100644 --- a/cleancloud/demo/findings.py +++ b/cleancloud/demo/findings.py @@ -614,3 +614,51 @@ ] ALL_FINDINGS: List[Finding] = AWS_FINDINGS + AZURE_FINDINGS + GCP_FINDINGS + +AWS_AI_FINDINGS: List[Finding] = [ + Finding( + provider="aws", + rule_id="aws.sagemaker.endpoint.idle", + resource_type="aws.sagemaker.endpoint", + resource_id="llm-inference-prod", + region="us-east-1", + title="Idle SageMaker Endpoint (No Invocations for 21 Days)", + summary=( + "SageMaker endpoint 'llm-inference-prod' has received zero invocations " + "for 21 days but remains InService, incurring continuous GPU charges." + ), + reason="SageMaker endpoint has zero invocations for 21 days", + risk=RiskLevel.HIGH, + confidence=ConfidenceLevel.HIGH, + detected_at=_NOW, + details={ + "endpoint_name": "llm-inference-prod", + "instance_type": "ml.g5.2xlarge", + "is_gpu": True, + "variant_count": 1, + "total_instances": 1, + "age_days": 21, + "idle_window_days": 21, + "idle_days_threshold": 14, + "estimated_monthly_cost": "~$1,008/month", + }, + evidence=Evidence( + signals_used=[ + "Zero recorded invocations for 21 days (CloudWatch metric)", + "Endpoint state: InService", + "Endpoint age: 21 days", + "Total running instances (DesiredInstanceCount): 1", + "Instance type: ml.g5.2xlarge", + "GPU-backed instance — high hourly cost", + ], + signals_not_checked=[ + "Scheduled or batch invocation patterns", + "Internal health-check invocations", + "Planned future usage", + "Shadow mode / canary deployments", + ], + time_window="21 days", + ), + estimated_monthly_cost_usd=1008.0, + ), +] diff --git a/cleancloud/doctor/aws.py b/cleancloud/doctor/aws.py index 296da17..adf951a 100644 --- a/cleancloud/doctor/aws.py +++ b/cleancloud/doctor/aws.py @@ -239,8 +239,9 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None info(" s3:GetBucketTagging") info(" sts:GetCallerIdentity") info("") - info("Copy the ready-to-use IAM policy from:") - info(" security/aws-readonly-policy.json (in this repo)") + info("Copy the ready-to-use IAM policies from:") + info(" security/aws/base-readonly.json (required for all scans)") + info(" security/aws/hygiene-readonly.json (hygiene rules, default)") info(" or: docs/aws.md (full setup guide with OIDC)") fail("No credentials — configure AWS access and re-run doctor") @@ -584,6 +585,110 @@ def run_aws_doctor(profile: Optional[str], region: Optional[str] = None) -> None warn("AWS ENVIRONMENT READY (partial coverage)") else: success("AWS ENVIRONMENT READY FOR CLEANCLOUD") + info("") + info("Tip: To also validate AI/ML permissions (SageMaker rules), run:") + info(" cleancloud doctor --provider aws --category ai") + info("=" * 70) + info("") + + +def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> None: + """Validate AWS permissions for --category ai (SageMaker rules).""" + if region is None: + region = "us-east-1" + + info("") + info("=" * 70) + info("AWS AI/ML PERMISSION VALIDATION") + info("=" * 70) + info("") + info("Validating permissions for: cleancloud scan --provider aws --category ai") + info(f"Region: {region}") + info("") + + try: + session = create_aws_session(profile=profile, region=region) + except Exception as e: + fail(f"Failed to create AWS session: {e}") + return + + # Verify identity + try: + sts = session.client("sts") + identity = sts.get_caller_identity() + success(f"Account: {identity['Account']} ({identity['Arn']})") + except Exception as e: + fail(f"AWS identity verification failed: {e}") + return + + info("") + info("Permission Checks") + info("-" * 70) + + permissions_tested = [] + permissions_failed = [] + + from datetime import datetime, timedelta, timezone + + sagemaker = session.client("sagemaker", region_name=region) + + try: + sagemaker.list_endpoints(MaxResults=1) + permissions_tested.append("sagemaker:ListEndpoints") + success("sagemaker:ListEndpoints") + except Exception as e: + permissions_failed.append(("sagemaker:ListEndpoints", str(e))) + warn(f"sagemaker:ListEndpoints - {e}") + + try: + # Describe the first InService endpoint if one exists; otherwise permission is assumed + # present since ListEndpoints already confirmed SageMaker access. + endpoints = sagemaker.list_endpoints(MaxResults=1, StatusEquals="InService") + endpoint_list = endpoints.get("Endpoints", []) + if endpoint_list: + sagemaker.describe_endpoint(EndpointName=endpoint_list[0]["EndpointName"]) + permissions_tested.append("sagemaker:DescribeEndpoint") + success("sagemaker:DescribeEndpoint") + except Exception as e: + permissions_failed.append(("sagemaker:DescribeEndpoint", str(e))) + warn(f"sagemaker:DescribeEndpoint - {e}") + + try: + cloudwatch = session.client("cloudwatch", region_name=region) + now = datetime.now(timezone.utc) + cloudwatch.get_metric_statistics( + Namespace="AWS/SageMaker", + MetricName="Invocations", + Dimensions=[], + StartTime=now - timedelta(hours=1), + EndTime=now, + Period=3600, + Statistics=["Sum"], + ) + permissions_tested.append("cloudwatch:GetMetricStatistics") + success("cloudwatch:GetMetricStatistics") + except Exception as e: + permissions_failed.append(("cloudwatch:GetMetricStatistics", str(e))) + warn(f"cloudwatch:GetMetricStatistics - {e}") + + info("") + info("=" * 70) + total = len(permissions_tested) + len(permissions_failed) + info(f"Permissions: {len(permissions_tested)}/{total} passed") + + if permissions_failed: + info("") + for perm, _ in permissions_failed: + warn(f" missing: {perm}") + info("") + info("Attach security/aws/ai-readonly.json to your IAM role/user.") + info("Then re-run: cleancloud doctor --provider aws --category ai") + info("") + warn("AWS AI/ML PERMISSIONS INCOMPLETE") + else: + info("") + success("AWS AI/ML PERMISSIONS READY") + info("Run: cleancloud scan --provider aws --category ai") info("=" * 70) info("") diff --git a/cleancloud/doctor/command.py b/cleancloud/doctor/command.py index fa2d3e7..dd6050b 100644 --- a/cleancloud/doctor/command.py +++ b/cleancloud/doctor/command.py @@ -21,6 +21,12 @@ default=None, help="GCP project ID to probe permissions against (GCP only)", ) +@click.option( + "--category", + default="hygiene", + type=click.Choice(["hygiene", "ai", "all"]), + help="Permission set to validate: hygiene (default), ai (SageMaker), or all", +) @click.option( "--config", type=click.Path(exists=True), @@ -44,6 +50,7 @@ def doctor( region: str, profile: Optional[str], project: Optional[str], + category: str, config: Optional[str], multi_account_file: Optional[str], role_name: str, @@ -51,6 +58,12 @@ def doctor( click.echo("Running CleanCloud doctor") click.echo() + if category == "ai" and provider not in (None, "aws"): + raise click.UsageError( + "--category ai is only supported with --provider aws (SageMaker rules). " + "AI/ML rules for Azure and GCP are on the roadmap." + ) + if multi_account_file: if provider != "aws" and provider is not None: click.echo("Error: --multi-account is only supported with --provider aws") @@ -66,7 +79,9 @@ def doctor( run_aws_multi_account_doctor(ma_config, profile=profile, region=region) return - run_doctor(provider=provider, profile=profile, region=region, project=project) + run_doctor( + provider=provider, profile=profile, region=region, project=project, category=category + ) try: cfg = CleanCloudConfig.empty() diff --git a/cleancloud/doctor/runner.py b/cleancloud/doctor/runner.py index 3ea4891..2785177 100644 --- a/cleancloud/doctor/runner.py +++ b/cleancloud/doctor/runner.py @@ -1,7 +1,7 @@ import sys from typing import Optional -from cleancloud.doctor.aws import run_aws_doctor +from cleancloud.doctor.aws import run_aws_ai_doctor, run_aws_doctor from cleancloud.doctor.azure import run_azure_doctor from cleancloud.doctor.common import DoctorError, info, success from cleancloud.doctor.gcp import run_gcp_doctor @@ -12,6 +12,7 @@ def run_doctor( profile: Optional[str] = None, region: Optional[str] = None, project: Optional[str] = None, + category: str = "hygiene", ) -> None: # Validate provider valid_providers = ["aws", "azure", "gcp"] @@ -47,7 +48,13 @@ def run_doctor( for p in providers_to_check: try: if p == "aws": - run_aws_doctor(profile=profile, region=region) + if category == "ai": + run_aws_ai_doctor(profile=profile, region=region) + elif category == "all": + run_aws_doctor(profile=profile, region=region) + run_aws_ai_doctor(profile=profile, region=region) + else: + run_aws_doctor(profile=profile, region=region) results[p] = {"status": "passed", "error": None} elif p == "azure": diff --git a/cleancloud/output/human.py b/cleancloud/output/human.py index 1a1495a..ff3f587 100644 --- a/cleancloud/output/human.py +++ b/cleancloud/output/human.py @@ -5,10 +5,10 @@ def print_human(findings: List[Finding]): if not findings: - print("No hygiene issues detected") + print("No issues detected") return - print(f"\nFound {len(findings)} hygiene issues:\n") + print(f"\nFound {len(findings)} issue(s):\n") for i, f in enumerate(findings, start=1): print(f"{i}. [{f.provider.upper()}] {f.title}") diff --git a/cleancloud/output/markdown.py b/cleancloud/output/markdown.py index 8dddf8c..3ef82af 100644 --- a/cleancloud/output/markdown.py +++ b/cleancloud/output/markdown.py @@ -48,12 +48,16 @@ def write_markdown( subscriptions = summary.get("subscriptions_scanned", []) accounts_scanned = summary.get("accounts_scanned") + projects_scanned = summary.get("projects_scanned", []) + lines.append(f"**Provider:** {provider} ") if subscriptions: lines.append(f"**Subscriptions:** {', '.join(subscriptions)} ") elif accounts_scanned is not None: lines.append(f"**Accounts:** {accounts_scanned} ") lines.append(f"**Regions:** {regions_str} ") + elif projects_scanned: + lines.append(f"**Projects:** {', '.join(projects_scanned)} ") else: lines.append(f"**Regions:** {regions_str} ") @@ -68,7 +72,7 @@ def write_markdown( # Findings table total = summary.get("total_findings", 0) if total == 0: - lines.append("**No hygiene issues detected.**") + lines.append("**No issues detected.**") else: lines.append(f"**Total findings:** {total}") lines.append("") @@ -109,6 +113,21 @@ def write_markdown( lines.append(f"| {label} ({r['id']}) | {r['findings']}{status_str} | {cost_str} |") lines.append("") + # GCP multi-project breakdown + per_project = summary.get("per_project") + if per_project: + lines.append("**Per-project breakdown:**") + lines.append("") + lines.append("| Project | Findings | Est. Monthly Cost |") + lines.append("|---------|--------:|------------------:|") + for r in per_project: + label = r["name"] if r["name"] != r["id"] else r["id"] + cost = r.get("estimated_monthly_cost_usd", 0) + cost_str = f"~${cost:,.0f}" if cost else "—" + status_str = f" _{r['status']}_" if r["status"] != "success" else "" + lines.append(f"| {label} ({r['id']}) | {r['findings']}{status_str} | {cost_str} |") + lines.append("") + # Azure multi-subscription breakdown per_sub = summary.get("per_subscription") if per_sub: @@ -126,7 +145,7 @@ def write_markdown( # Footer lines.append( "> Generated by [CleanCloud](https://github.com/cleancloud-io/cleancloud) — " - "read-only cloud hygiene scanner for AWS and Azure." + "read-only cloud cost scanner for AWS, Azure, and GCP." ) output = "\n".join(lines) diff --git a/cleancloud/output/summary.py b/cleancloud/output/summary.py index 1ddb942..a62396e 100644 --- a/cleancloud/output/summary.py +++ b/cleancloud/output/summary.py @@ -155,7 +155,7 @@ def _print_summary(summary: dict, region_selection_mode: str = None, multi_accou click.echo("To enable skipped rules, update your IAM policy/role to the latest version:") if has_aws: click.echo( - " AWS: https://github.com/cleancloud-io/cleancloud/blob/main/security/aws-readonly-policy.json" + " AWS: https://github.com/cleancloud-io/cleancloud/tree/main/security/aws/" ) click.echo( " Run 'cleancloud doctor --provider aws' to validate permissions after updating." @@ -169,8 +169,7 @@ def _print_summary(summary: dict, region_selection_mode: str = None, multi_accou ) if has_gcp: click.echo( - " GCP: ensure your service account has roles/compute.viewer, " - "roles/cloudsql.viewer, and roles/monitoring.viewer" + " GCP: https://github.com/cleancloud-io/cleancloud/blob/main/security/gcp-readonly-roles.json" ) click.echo( " Run 'cleancloud doctor --provider gcp --project PROJECT_ID' to validate permissions after updating." diff --git a/cleancloud/providers/aws/multi_account.py b/cleancloud/providers/aws/multi_account.py index eea3b42..0b4d4a6 100644 --- a/cleancloud/providers/aws/multi_account.py +++ b/cleancloud/providers/aws/multi_account.py @@ -1,7 +1,7 @@ import time from concurrent.futures import FIRST_COMPLETED, ThreadPoolExecutor, wait from dataclasses import dataclass, field -from typing import List, Optional +from typing import Callable, List, Optional import botocore.exceptions import click @@ -35,6 +35,7 @@ def scan_account( region: Optional[str], external_id: Optional[str], regions_override: Optional[List[str]] = None, + rules: Optional[List[Callable]] = None, ) -> AccountScanResult: start = time.monotonic() click.echo(f" {account.name} ({account.id}) starting...") @@ -63,7 +64,7 @@ def scan_account( ) findings, skipped_rules, regions_failed = scan_aws_regions_with_session( - assumed_session, regions_to_scan + assumed_session, regions_to_scan, rules=rules ) for f in findings: @@ -121,6 +122,7 @@ def scan_multiple_accounts( profile: Optional[str], max_concurrent: int = 5, per_account_regions: bool = False, + rules: Optional[List[Callable]] = None, ) -> List[AccountScanResult]: hub_session = create_aws_session(profile=profile, region=region or "us-east-1") @@ -159,6 +161,7 @@ def scan_multiple_accounts( region, config.external_id, regions_override, + rules, ): account for account in config.accounts } diff --git a/cleancloud/providers/aws/rules/sagemaker_endpoint_idle.py b/cleancloud/providers/aws/rules/sagemaker_endpoint_idle.py new file mode 100644 index 0000000..cb93781 --- /dev/null +++ b/cleancloud/providers/aws/rules/sagemaker_endpoint_idle.py @@ -0,0 +1,285 @@ +from datetime import datetime, timedelta, timezone +from typing import List, Optional, Tuple + +import boto3 +from botocore.exceptions import ClientError + +from cleancloud.core.confidence import ConfidenceLevel +from cleancloud.core.evidence import Evidence +from cleancloud.core.finding import Finding +from cleancloud.core.risk import RiskLevel + +RULE_METADATA = { + "id": "aws.sagemaker.endpoint.idle", + "category": "ai", + "service": "sagemaker", + "cost_impact": "high", +} + +# GPU/accelerator instance families — significantly more expensive than CPU +_GPU_FAMILIES = ( + "ml.g4dn", + "ml.g5", + "ml.p2", + "ml.p3", + "ml.p4d", + "ml.p4de", + "ml.p5", + "ml.trn1", + "ml.inf1", + "ml.inf2", +) + +# Approximate monthly cost by instance type (on-demand, us-east-1) +_MONTHLY_COST_BY_FAMILY = { + "ml.g4dn.xlarge": 531.0, + "ml.g4dn.2xlarge": 941.0, + "ml.g4dn.4xlarge": 1_633.0, + "ml.g4dn.12xlarge": 2_817.0, + "ml.g5.xlarge": 600.0, + "ml.g5.2xlarge": 1_008.0, + "ml.g5.4xlarge": 1_838.0, + "ml.g5.12xlarge": 5_443.0, + "ml.p3.2xlarge": 2_754.0, + "ml.p3.8xlarge": 11_016.0, + "ml.p4d.24xlarge": 23_596.0, + "ml.m5.large": 94.0, + "ml.m5.xlarge": 188.0, + "ml.m5.2xlarge": 376.0, + "ml.c5.large": 85.0, + "ml.c5.xlarge": 171.0, + "ml.t2.medium": 40.0, + "ml.t3.medium": 42.0, +} +_DEFAULT_MONTHLY_COST = 150.0 + + +def find_idle_sagemaker_endpoints( + session: boto3.Session, + region: str, + days_idle: int = 14, +) -> List[Finding]: + """ + Find SageMaker inference endpoints with zero invocations. + + SageMaker endpoints incur continuous charges while InService, regardless + of whether they receive any traffic. GPU-backed endpoints cost $500–$23K/month. + Endpoints deployed for experiments or demos are frequently abandoned. + + Detection logic: + - Endpoint is in InService state + - Endpoint has at least one running instance (DesiredInstanceCount > 0) + - Zero Invocations over the effective idle window (CloudWatch AWS/SageMaker) + + Confidence: + - HIGH: Zero invocations over the full idle window (age >= days_idle) + - MEDIUM: Zero invocations, age >= 75% of days_idle threshold + + IAM permissions: + - sagemaker:ListEndpoints + - sagemaker:DescribeEndpoint + - cloudwatch:GetMetricStatistics + """ + sagemaker = session.client("sagemaker", region_name=region) + cloudwatch = session.client("cloudwatch", region_name=region) + now = datetime.now(timezone.utc) + findings: List[Finding] = [] + + try: + paginator = sagemaker.get_paginator("list_endpoints") + + for page in paginator.paginate(StatusEquals="InService"): + for endpoint in page.get("Endpoints", []): + endpoint_name = endpoint["EndpointName"] + + # Calculate age — normalize to UTC to handle timezone-naive timestamps + # from older boto3 versions, which would otherwise raise TypeError + create_time = endpoint.get("CreationTime") + age_days = 0 + if create_time: + if create_time.tzinfo is None: + create_time = create_time.replace(tzinfo=timezone.utc) + age_days = (now - create_time).days + + # Skip endpoints younger than half the idle threshold — + # too new to reliably classify as abandoned + if age_days < max(days_idle // 2, 7): + continue + + # Describe endpoint — get cost, GPU flag, variant info + monthly_cost, is_gpu, variant_count, total_instances, primary_instance_type = ( + _describe_endpoint(sagemaker, endpoint_name) + ) + + # Skip scaled-to-zero endpoints — no running instances, no compute cost + if total_instances == 0: + continue + + # Use effective window: can't look back further than the endpoint's age + effective_window = min(days_idle, age_days) + + # Skip if effective window is too small to draw a reliable conclusion + if effective_window < 3: + continue + + # Check invocations over the effective window + has_invocations = _check_invocations(cloudwatch, endpoint_name, effective_window) + if has_invocations: + continue + + # Confidence based on age relative to idle threshold + if age_days >= days_idle: + confidence = ConfidenceLevel.HIGH + elif age_days >= int(days_idle * 0.75): + confidence = ConfidenceLevel.MEDIUM + else: + continue # too borderline for a confident finding + + risk = RiskLevel.HIGH if is_gpu else RiskLevel.MEDIUM + + signals = [ + f"Zero recorded invocations for {effective_window} days (CloudWatch metric)", + "Endpoint state: InService", + f"Endpoint age: {age_days} days", + f"Total running instances (DesiredInstanceCount): {total_instances}", + ] + if primary_instance_type: + signals.append(f"Instance type: {primary_instance_type}") + if is_gpu: + signals.append("GPU-backed instance — high hourly cost") + if variant_count > 1: + signals.append(f"Production variants: {variant_count}") + + evidence = Evidence( + signals_used=signals, + signals_not_checked=[ + "Scheduled or batch invocation patterns", + "Internal health-check invocations", + "Planned future usage", + "Shadow mode / canary deployments", + ], + time_window=f"{effective_window} days", + ) + + findings.append( + Finding( + provider="aws", + rule_id="aws.sagemaker.endpoint.idle", + resource_type="aws.sagemaker.endpoint", + resource_id=endpoint_name, + region=region, + estimated_monthly_cost_usd=monthly_cost, + title=f"Idle SageMaker Endpoint (No Invocations for {effective_window} Days)", + summary=( + f"SageMaker endpoint '{endpoint_name}' has received zero invocations " + f"for {effective_window} days but remains InService, incurring continuous charges." + ), + reason=f"SageMaker endpoint has zero invocations for {effective_window} days", + risk=risk, + confidence=confidence, + detected_at=now, + evidence=evidence, + details={ + "endpoint_name": endpoint_name, + "instance_type": primary_instance_type, + "is_gpu": is_gpu, + "variant_count": variant_count, + "total_instances": total_instances, + "age_days": age_days, + "idle_window_days": effective_window, + "idle_days_threshold": days_idle, + "estimated_monthly_cost": f"~${monthly_cost:,.0f}/month", + }, + ) + ) + + except ClientError as e: + code = e.response["Error"]["Code"] + if code in ("UnauthorizedOperation", "AccessDenied", "AccessDeniedException"): + raise PermissionError( + "Missing required IAM permissions: " + "sagemaker:ListEndpoints, sagemaker:DescribeEndpoint, " + "cloudwatch:GetMetricStatistics" + ) from e + raise + + return findings + + +def _check_invocations(cloudwatch, endpoint_name: str, days: int) -> bool: + """Return True if the endpoint received any invocations in the past `days` days. + + CloudWatch only publishes SageMaker Invocations data when invocations actually occur. + An endpoint that has never been called will have no metric series — empty datapoints. + This function is only called after the age guard ensures the endpoint is ≥7 days old, + so empty datapoints reliably indicates a genuinely idle endpoint. + + Returns True (assume active) only on CloudWatch API failure (ClientError). + """ + now = datetime.now(timezone.utc) + start_time = now - timedelta(days=max(days, 1)) + + try: + response = cloudwatch.get_metric_statistics( + Namespace="AWS/SageMaker", + MetricName="Invocations", + Dimensions=[{"Name": "EndpointName", "Value": endpoint_name}], + StartTime=start_time, + EndTime=now, + Period=3600, # 1 hour — more datapoints, more reliable for sparse traffic + Statistics=["Sum"], + ) + datapoints = response.get("Datapoints", []) + # Empty datapoints: CloudWatch never recorded any invocations. + # The age guard guarantees this endpoint is ≥7 days old, so "no data" means idle. + if not datapoints: + return False + return any(dp.get("Sum", 0) > 0 for dp in datapoints) + + except ClientError: + # Metrics unavailable — assume active to avoid false positives + return True + + +def _describe_endpoint( + sagemaker, endpoint_name: str +) -> Tuple[float, bool, int, int, Optional[str]]: + """Return (monthly_cost, is_gpu, variant_count, total_instances, primary_instance_type). + + Cost is computed per-variant using DesiredInstanceCount × per-instance cost, summed + across all variants. GPU flag is True if any variant uses an accelerator instance. + + Returns (default_cost, False, 1, 1, None) on failure to ensure the endpoint is + still flagged conservatively rather than silently dropped. + """ + try: + response = sagemaker.describe_endpoint(EndpointName=endpoint_name) + variants = response.get("ProductionVariants", []) + if not variants: + return _DEFAULT_MONTHLY_COST, False, 0, 0, None + + total_monthly_cost = 0.0 + is_gpu = False + total_instances = 0 + primary_instance_type: Optional[str] = None + + for i, v in enumerate(variants): + itype = v.get("CurrentInstanceType") + count = v.get("DesiredInstanceCount") or 0 + total_instances += count + + if i == 0: + primary_instance_type = itype + + cost_per_instance = _MONTHLY_COST_BY_FAMILY.get(itype, _DEFAULT_MONTHLY_COST) + total_monthly_cost += cost_per_instance * count + + if itype and any(itype.startswith(fam) for fam in _GPU_FAMILIES): + is_gpu = True + + return total_monthly_cost, is_gpu, len(variants), total_instances, primary_instance_type + + except ClientError: + # Unknown state — return zero instances so the endpoint is skipped rather + # than flagged with assumed cost and instance count. + return 0.0, False, 0, 0, None diff --git a/cleancloud/providers/aws/scan.py b/cleancloud/providers/aws/scan.py index 02e0bef..25cb6ad 100644 --- a/cleancloud/providers/aws/scan.py +++ b/cleancloud/providers/aws/scan.py @@ -23,6 +23,9 @@ from cleancloud.providers.aws.rules.nat_gateway_idle import find_idle_nat_gateways from cleancloud.providers.aws.rules.rds_idle import find_idle_rds_instances from cleancloud.providers.aws.rules.rds_snapshot_old import find_old_rds_snapshots +from cleancloud.providers.aws.rules.sagemaker_endpoint_idle import ( + find_idle_sagemaker_endpoints, +) from cleancloud.providers.aws.rules.untagged_resources import ( find_untagged_resources as find_aws_untagged_resources, ) @@ -45,9 +48,18 @@ find_old_rds_snapshots, ] +# AI/ML waste rules — not run by default; use --category ai or --category all +AWS_AI_RULES: List[Callable] = [ + find_idle_sagemaker_endpoints, +] + def scan_aws_with_region_selection( - *, profile: Optional[str], region: Optional[str], all_regions: bool + *, + profile: Optional[str], + region: Optional[str], + all_regions: bool, + rules: Optional[List[Callable]] = None, ) -> Tuple[str, List[Finding], List[str], List[dict]]: validate_region_params(region, all_regions) @@ -89,7 +101,7 @@ def scan_aws_with_region_selection( click.echo() - findings, skipped_rules = scan_aws_regions(profile, regions_to_scan) + findings, skipped_rules = scan_aws_regions(profile, regions_to_scan, rules=rules) regions_scanned = regions_to_scan return region_selection_mode, findings, regions_scanned, skipped_rules @@ -244,7 +256,9 @@ def _get_all_aws_regions(session) -> List[str]: def scan_aws_regions( profile: Optional[str], regions_to_scan: List[str], + rules: Optional[List[Callable]] = None, ) -> Tuple[List[Finding], List[dict]]: + active_rules = rules if rules is not None else AWS_RULES findings: List[Finding] = [] all_skipped_rules: List[dict] = [] @@ -256,7 +270,7 @@ def scan_aws_regions( ) as bar: with ThreadPoolExecutor(max_workers=min(5, len(regions_to_scan))) as executor: futures = { - executor.submit(_scan_aws_region, profile, region): region + executor.submit(_scan_aws_region, profile, region, active_rules): region for region in regions_to_scan } @@ -286,19 +300,21 @@ def scan_aws_regions( def scan_aws_regions_with_session( session, regions_to_scan: List[str], + rules: Optional[List[Callable]] = None, ) -> Tuple[List[Finding], List[dict], List[str]]: """ Scan a list of regions using a pre-existing boto3 session (e.g. assumed role). Used by multi-account scanning. No progress bars — account-level logging handles UX. Returns (findings, skipped_rules, failed_regions). """ + active_rules = rules if rules is not None else AWS_RULES findings: List[Finding] = [] all_skipped_rules: List[dict] = [] failed_regions: List[str] = [] with ThreadPoolExecutor(max_workers=min(5, len(regions_to_scan))) as executor: futures = { - executor.submit(_scan_aws_region_with_session, session, region): region + executor.submit(_scan_aws_region_with_session, session, region, active_rules): region for region in regions_to_scan } for future in as_completed(futures): @@ -316,7 +332,9 @@ def scan_aws_regions_with_session( return findings, all_skipped_rules, failed_regions -def _scan_aws_region_with_session(session, region: str) -> Tuple[List[Finding], List[dict]]: +def _scan_aws_region_with_session( + session, region: str, rules: List[Callable] +) -> Tuple[List[Finding], List[dict]]: """ Scan a single region using a pre-existing session. Used for assumed-role (multi-account) scans where the session is already scoped to the target account. @@ -324,8 +342,8 @@ def _scan_aws_region_with_session(session, region: str) -> Tuple[List[Finding], findings: List[Finding] = [] skipped_rules: List[dict] = [] - with ThreadPoolExecutor(max_workers=min(4, len(AWS_RULES))) as executor: - futures = {executor.submit(rule, session, region): rule for rule in AWS_RULES} + with ThreadPoolExecutor(max_workers=min(4, len(rules))) as executor: + futures = {executor.submit(rule, session, region): rule for rule in rules} for future in as_completed(futures): rule = futures[future] @@ -347,7 +365,9 @@ def _scan_aws_region_with_session(session, region: str) -> Tuple[List[Finding], return findings, skipped_rules -def _scan_aws_region(profile: Optional[str], region: str) -> Tuple[List[Finding], List[dict]]: +def _scan_aws_region( + profile: Optional[str], region: str, rules: List[Callable] +) -> Tuple[List[Finding], List[dict]]: session = create_aws_session(profile=profile, region=region) findings: List[Finding] = [] skipped_rules: List[dict] = [] @@ -356,13 +376,13 @@ def _scan_aws_region(profile: Optional[str], region: str) -> Tuple[List[Finding] endpoint_errors = 0 with click.progressbar( - length=len(AWS_RULES), + length=len(rules), label=f"Scanning AWS rules in {region}", show_eta=True, show_percent=True, ) as bar: - with ThreadPoolExecutor(max_workers=min(4, len(AWS_RULES))) as executor: - futures = {executor.submit(rule, session, region): rule for rule in AWS_RULES} + with ThreadPoolExecutor(max_workers=min(4, len(rules))) as executor: + futures = {executor.submit(rule, session, region): rule for rule in rules} for future in as_completed(futures): rule = futures[future] diff --git a/cleancloud/scan/command.py b/cleancloud/scan/command.py index dc378a9..df8fe6b 100644 --- a/cleancloud/scan/command.py +++ b/cleancloud/scan/command.py @@ -44,7 +44,7 @@ discover_org_accounts, scan_multiple_accounts, ) -from cleancloud.providers.aws.scan import AWS_RULES, scan_aws_with_region_selection +from cleancloud.providers.aws.scan import AWS_AI_RULES, AWS_RULES, scan_aws_with_region_selection from cleancloud.providers.azure.scan import AZURE_RULES, scan_azure_with_region_selection from cleancloud.providers.gcp.scan import ( GCP_RULES, @@ -186,6 +186,13 @@ default=False, help="Disable post-scan feedback prompt (recommended for CI/CD runs)", ) +@click.option( + "--category", + type=click.Choice(["hygiene", "ai", "all"]), + default="hygiene", + show_default=True, + help="Rule category to run: hygiene (default), ai (AI/ML waste), or all", +) def scan( provider: str, region: Optional[str], @@ -204,6 +211,7 @@ def scan( config: Optional[str], ignore_tag: List[str], no_feedback: bool, + category: str, multi_account_file: Optional[str], accounts_inline: Optional[str], scan_org: bool, @@ -251,10 +259,29 @@ def scan( if used: raise click.UsageError(f"{flag} is only supported with --provider gcp") + if category == "ai" and provider != "aws": + raise click.UsageError( + "--category ai is only supported with --provider aws (SageMaker rules). " + "AI/ML rules for Azure and GCP are on the roadmap." + ) + + # Build the AWS rule list based on --category + if provider == "aws": + if category == "hygiene": + aws_rules_to_run = AWS_RULES + elif category == "ai": + aws_rules_to_run = AWS_AI_RULES + else: # all + aws_rules_to_run = AWS_RULES + AWS_AI_RULES + else: + aws_rules_to_run = AWS_RULES # unused for non-AWS but keeps type consistent + click.echo() click.echo("Starting CleanCloud scan...") click.echo() click.echo(f"Provider: {provider}") + if provider == "aws" and category != "hygiene": + click.echo(f"Category: {category}") click.echo() try: @@ -318,6 +345,7 @@ def scan( profile=profile, max_concurrent=concurrency, per_account_regions=per_account_regions, + rules=aws_rules_to_run, ) # Aggregate findings and metadata from all accounts @@ -335,7 +363,10 @@ def scan( elif provider == "aws": region_selection_mode, findings, regions_scanned, skipped_rules = ( scan_aws_with_region_selection( - profile=profile, region=region, all_regions=all_regions + profile=profile, + region=region, + all_regions=all_regions, + rules=aws_rules_to_run, ) ) @@ -453,7 +484,7 @@ def scan( # Add provider-specific fields if provider == "aws": summary["region_selection_mode"] = region_selection_mode - summary["total_rules"] = len(AWS_RULES) + summary["total_rules"] = len(aws_rules_to_run) elif provider == "azure": summary["total_rules"] = len(AZURE_RULES) summary["subscription_selection_mode"] = subscription_selection_mode @@ -541,6 +572,12 @@ def scan( else: print_human(findings) _print_summary(summary, region_selection_mode, multi_account_results or None) + if provider == "aws" and category == "hygiene": + click.echo( + "Tip: Run AI/ML cost checks with: " + "cleancloud scan --provider aws --category ai" + ) + click.echo() # Community prompt (all output modes) click.echo() diff --git a/docs/aws.md b/docs/aws.md index 01263f6..3c28e8d 100644 --- a/docs/aws.md +++ b/docs/aws.md @@ -475,11 +475,24 @@ aws iam create-role \ aws iam put-role-policy \ --role-name CleanCloudReadOnlyRole \ - --policy-name CleanCloudReadOnly \ - --policy-document file://security/aws-readonly-policy.json + --policy-name CleanCloudBase \ + --policy-document file://security/aws/base-readonly.json + +aws iam put-role-policy \ + --role-name CleanCloudReadOnlyRole \ + --policy-name CleanCloudHygiene \ + --policy-document file://security/aws/hygiene-readonly.json ``` -> `security/aws-readonly-policy.json` is in the [CleanCloud repo](https://github.com/cleancloud-io/cleancloud/blob/main/security/aws-readonly-policy.json). Download it first or clone the repo, then run the commands above. +> The policy files are in the [CleanCloud repo](https://github.com/cleancloud-io/cleancloud/tree/main/security/aws). Download or clone the repo first, then run the commands above. +> +> To also enable AI/ML rules (`--category ai`), attach the AI policy: +> ```bash +> aws iam put-role-policy \ +> --role-name CleanCloudReadOnlyRole \ +> --policy-name CleanCloudAI \ +> --policy-document file://security/aws/ai-readonly.json +> ``` Repeat for each spoke account — takes under 30 seconds per account. diff --git a/docs/ci.md b/docs/ci.md index eee0354..3f7b3d8 100644 --- a/docs/ci.md +++ b/docs/ci.md @@ -166,6 +166,7 @@ The simplest way to add CleanCloud to GitHub Actions — one step, no pip instal | Input | Description | AWS | Azure | GCP | |---|---|:---:|:---:|:---:| | `provider` | `aws`, `azure`, or `gcp` (required) | ✓ | ✓ | ✓ | +| `category` | `hygiene` (default), `ai` (SageMaker, AWS-only), or `all` | ✓ | — | — | | `region` | Single region/location filter | ✓ | ✓ | — | | `all-regions` | Scan all active regions | ✓ | — | — | | `org` | Auto-discover all AWS Organization accounts | ✓ | — | — | @@ -173,13 +174,13 @@ The simplest way to add CleanCloud to GitHub Actions — one step, no pip instal | `multi-account` | Path to accounts config YAML | ✓ | — | — | | `role-name` | Cross-account role name (default: `CleanCloudReadOnlyRole`) | ✓ | — | — | | `external-id` | External ID for cross-account role assumption | ✓ | — | — | -| `concurrency` | Parallel account scan limit | ✓ | — | — | -| `timeout` | Total scan timeout in seconds | ✓ | — | — | +| `concurrency` | Parallel account scan limit | ✓ | — | ✓ | +| `timeout` | Total scan timeout in seconds | ✓ | — | ✓ | | `per-account-regions` | Detect active regions per account (slower, more accurate) | ✓ | — | — | | `subscription` | Comma-separated subscription IDs | — | ✓ | — | | `management-group` | Management Group ID for subscription discovery | — | ✓ | — | -| `project` | Comma-separated GCP project IDs | — | — | ✓ | | `all-projects` | Scan all accessible GCP projects | — | — | ✓ | +| `project` | GCP project ID (repeatable) | — | — | ✓ | | `fail-on-confidence` | Fail on `LOW`, `MEDIUM`, or `HIGH` confidence findings | ✓ | ✓ | ✓ | | `fail-on-cost` | Fail if estimated waste exceeds this USD amount | ✓ | ✓ | ✓ | | `fail-on-findings` | Fail on any finding | ✓ | ✓ | ✓ | @@ -475,6 +476,61 @@ jobs: retention-days: 30 ``` +### AWS AI/ML Scan (SageMaker) + +Run SageMaker idle endpoint detection separately — requires the `ai-readonly.json` policy on your IAM role. + +```yaml +name: CleanCloud AI/ML Scan + +on: + schedule: + - cron: '0 9 * * 1' # Weekly on Monday at 9 AM + +permissions: + id-token: write + contents: read + +jobs: + cleancloud-ai: + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v4 + + - name: Configure AWS credentials (OIDC) + uses: aws-actions/configure-aws-credentials@v4 + with: + role-to-assume: arn:aws:iam::${{ vars.AWS_ACCOUNT_ID }}:role/CleanCloudCIReadOnly + aws-region: us-east-1 + + - name: Install CleanCloud + run: pip install cleancloud + + - name: Validate AI permissions + run: cleancloud doctor --provider aws --category ai + + - name: Run AI/ML scan + run: | + cleancloud scan \ + --provider aws \ + --category ai \ + --all-regions \ + --output json \ + --output-file ai-scan-results.json \ + --fail-on-confidence HIGH + + - name: Upload scan results + if: always() + uses: actions/upload-artifact@v4 + with: + name: cleancloud-ai-scan-results + path: ai-scan-results.json + retention-days: 30 +``` + +> To run hygiene and AI rules together, use `--category all`. + ### Azure with OIDC (Recommended) ```yaml diff --git a/docs/infosec-readiness.md b/docs/infosec-readiness.md index f5df311..439cd86 100644 --- a/docs/infosec-readiness.md +++ b/docs/infosec-readiness.md @@ -266,33 +266,46 @@ The IAM Proof Pack includes: --- -#### 1. AWS IAM Policy (Read-Only) +#### 1. AWS IAM Policies (Read-Only) -**Policy:** [`security/aws-readonly-policy.json`](../security/aws-readonly-policy.json) +AWS permissions are split across three composable policy files under [`security/aws/`](../security/aws/): + +| File | Purpose | Required for | +|------|---------|--------------| +| [`base-readonly.json`](../security/aws/base-readonly.json) | STS identity + CloudWatch metrics | All scans | +| [`hygiene-readonly.json`](../security/aws/hygiene-readonly.json) | EC2, RDS, ELB, S3, logs | `--category hygiene` (default) | +| [`ai-readonly.json`](../security/aws/ai-readonly.json) | SageMaker | `--category ai` | **Verification:** ```bash -# Download policy -curl -o aws-readonly-policy.json \ - https://raw.githubusercontent.com/cleancloud-io/cleancloud/main/security/aws-readonly-policy.json +# Download and verify each policy +for f in base-readonly.json hygiene-readonly.json ai-readonly.json; do + curl -o $f \ + https://raw.githubusercontent.com/cleancloud-io/cleancloud/main/security/aws/$f + echo "=== $f ===" && cat $f | jq '.Statement[].Action[]' \ + | grep -iE '(delete|put|create|update|tag)' || echo "PASS: no write permissions" +done +``` -# Verify no write/delete/tag permissions -cat aws-readonly-policy.json | jq '.Statement[].Action[]' | grep -iE '(delete|put|create|update|tag)' -# Expected: No results (exit code 1) +**Attach to OIDC role (hygiene scan):** -# Create IAM policy (optional - for testing) -aws iam create-policy --policy-name CleanCloudReadOnly \ - --policy-document file://aws-readonly-policy.json +```bash +aws iam put-role-policy --role-name CleanCloudCIReadOnly \ + --policy-name CleanCloudBase \ + --policy-document file://base-readonly.json + +aws iam put-role-policy --role-name CleanCloudCIReadOnly \ + --policy-name CleanCloudHygiene \ + --policy-document file://hygiene-readonly.json ``` -**Attach to OIDC role:** +**Additionally, for AI/ML scans (`--category ai`):** ```bash -# Attach to existing role -aws iam attach-role-policy \ - --role-name CleanCloudCIReadOnly \ - --policy-arn arn:aws:iam::123456789012:policy/CleanCloudReadOnly +aws iam put-role-policy --role-name CleanCloudCIReadOnly \ + --policy-name CleanCloudAI \ + --policy-document file://ai-readonly.json ``` --- @@ -344,21 +357,22 @@ az role assignment create \ # File: verify-aws-policy.sh # Verifies AWS IAM policy is read-only -POLICY_FILE="aws-readonly-policy.json" +for POLICY_FILE in base-readonly.json hygiene-readonly.json ai-readonly.json; do -echo " Verifying AWS IAM Policy: $POLICY_FILE" + echo " Verifying AWS IAM Policy: $POLICY_FILE" -# Check for forbidden actions -FORBIDDEN=$(cat $POLICY_FILE | jq -r '.Statement[].Action[]?' | grep -iE '(delete|put|create|update|tag|modify|terminate|reboot|stop|start)') + # Check for forbidden actions + FORBIDDEN=$(cat $POLICY_FILE | jq -r '.Statement[].Action[]?' | grep -iE '(delete|put|create|update|tag|modify|terminate|reboot|stop|start)') -if [ -z "$FORBIDDEN" ]; then - echo " PASS: No write/delete/tag permissions found" - exit 0 -else - echo " FAIL: Found forbidden permissions:" - echo "$FORBIDDEN" - exit 1 -fi + if [ -z "$FORBIDDEN" ]; then + echo " PASS: No write/delete/tag permissions found" + else + echo " FAIL: Found forbidden permissions:" + echo "$FORBIDDEN" + exit 1 + fi + +done ``` **Usage:** @@ -551,16 +565,20 @@ git clone https://github.com/cleancloud-io/cleancloud.git cd cleancloud # IAM Proof Pack files: -# - security/aws-readonly-policy.json (AWS IAM policy) +# - security/aws/base-readonly.json (AWS base policy — all scans) +# - security/aws/hygiene-readonly.json (AWS hygiene rules policy) +# - security/aws/ai-readonly.json (AWS AI/ML rules policy) # - docs/infosec-readiness.md (this document) # - tests/cleancloud/safety/ (automated safety tests) ``` -**Quick Download (AWS IAM Policy):** +**Quick Download (AWS IAM Policies):** ```bash -curl -o aws-readonly-policy.json \ - https://raw.githubusercontent.com/cleancloud-io/cleancloud/main/security/aws-readonly-policy.json +for f in base-readonly.json hygiene-readonly.json ai-readonly.json; do + curl -o $f \ + https://raw.githubusercontent.com/cleancloud-io/cleancloud/main/security/aws/$f +done ``` --- @@ -1065,7 +1083,7 @@ CleanCloud output may contain: #### AWS IAM Policy -The minimum required permissions are **read-only** — see [`security/aws-readonly-policy.json`](../security/aws-readonly-policy.json) for the canonical policy. +The minimum required permissions are **read-only** — see [`security/aws/`](../security/aws/) for the canonical policy files (`base-readonly.json` + `hygiene-readonly.json` for default scans; add `ai-readonly.json` for `--category ai`). **Characteristics:** - Zero `Delete*`, `Put*`, `Create*`, `Update*`, or `Tag*` permissions diff --git a/docs/rules.md b/docs/rules.md index 14010df..9592b3a 100644 --- a/docs/rules.md +++ b/docs/rules.md @@ -1,6 +1,6 @@ # CleanCloud Rules -Complete reference for all 30 hygiene rules implemented by CleanCloud. +Complete reference for all 31 rules implemented by CleanCloud (30 hygiene + 1 AI/ML). --- @@ -50,6 +50,7 @@ Every finding includes a confidence level: | `aws.rds.snapshot.old` | Storage | Manual RDS snapshots older than 90 days | | `aws.cloudwatch.logs.infinite_retention` | Observability | Log groups with no retention policy | | `aws.resource.untagged` | Governance | EC2/S3/CloudWatch resources with zero tags | +| `aws.sagemaker.endpoint.idle` | AI/ML | SageMaker endpoints with zero invocations 14+ days *(opt-in: `--category ai`)* | **Azure:** @@ -696,6 +697,44 @@ Confidence thresholds and signal weighting are documented in [confidence.md](con - `s3:GetBucketTagging` - `logs:DescribeLogGroups` +### AI/ML Waste + +#### Idle SageMaker Endpoints + +**Rule ID:** `aws.sagemaker.endpoint.idle` + +**Category:** `ai` + +**What it detects:** SageMaker inference endpoints in `InService` state with zero invocations over 14+ days. GPU-backed endpoints (`ml.g4dn`, `ml.g5`, `ml.p3`, `ml.p4d`, `ml.p5`, Inferentia, Trainium) are flagged as HIGH risk due to significantly higher hourly cost. + +**Confidence:** +- **HIGH:** Zero invocations for the full 14-day window (endpoint age ≥ 14 days) +- **MEDIUM:** Zero invocations but endpoint is 7–13 days old + +**Risk:** +- **HIGH:** GPU/accelerator-backed instance (`ml.g4dn.*`, `ml.g5.*`, `ml.p3.*`, `ml.p4d.*`, etc.) +- **MEDIUM:** CPU-backed instance + +**Why this matters:** +- SageMaker endpoints accrue charges continuously while `InService`, regardless of traffic +- GPU-backed endpoints cost $500–$23K/month depending on instance type +- Endpoints deployed for experiments or demos are frequently abandoned after initial testing +- Multi-variant endpoints multiply the cost per variant + +**Estimated monthly cost:** +- `ml.g4dn.xlarge` — ~$531/month +- `ml.g5.xlarge` — ~$600/month +- `ml.p3.2xlarge` — ~$2,754/month +- `ml.p4d.24xlarge` — ~$23,596/month +- `ml.m5.xlarge` — ~$188/month + +**Required permissions:** +- `sagemaker:ListEndpoints` +- `sagemaker:DescribeEndpoint` +- `cloudwatch:GetMetricStatistics` + +> **Not run by default.** AI/ML rules are opt-in to avoid surprising users who don't use these services. Run with `cleancloud scan --provider aws --category ai` (or `--category all` to combine with hygiene rules). If the permissions above are not granted, the rule is gracefully skipped and reported in the skipped rules section — it will not fail the scan. Attach [`security/aws/ai-readonly.json`](../security/aws/ai-readonly.json) to your IAM role to enable this rule. + --- ## Azure Rules @@ -1479,8 +1518,14 @@ This guarantees trust for long-running CI/CD integrations. ## Coming Soon +**AI/ML (all providers):** +- Azure ML compute clusters idle (min_instances > 0, no jobs running) +- Vertex AI endpoints with zero predictions (GCP) +- SageMaker notebook instances left running unused (AWS) +- Orphaned SageMaker training artifacts in S3 (AWS) + **AWS:** -- S3 lifecycle gaps, idle SageMaker endpoints, Redshift idle +- S3 lifecycle gaps, Redshift idle, NAT Gateway routing waste **Azure:** - Azure Firewall idle, AKS node pool idle, Azure Batch unused pools diff --git a/docs/safety.md b/docs/safety.md index 9de2c9d..d935b08 100644 --- a/docs/safety.md +++ b/docs/safety.md @@ -60,7 +60,10 @@ cleancloud/ │ ├── test_gcp_runtime_readonly.py │ └── test_gcp_iam_roles_readonly.py ├── security/ # Canonical IAM policies, role definitions, and verification scripts -│ ├── aws-readonly-policy.json +│ ├── aws/ +│ │ ├── base-readonly.json # STS + CloudWatch (required for all scans) +│ │ ├── hygiene-readonly.json # EC2, RDS, ELB, S3, logs (--category hygiene) +│ │ └── ai-readonly.json # SageMaker etc. (--category ai) │ ├── azure-readonly-role.json │ ├── gcp-readonly-roles.json │ ├── verify-aws-policy.sh @@ -70,7 +73,7 @@ cleancloud/ - **`cleancloud/safety/`** → allowlists defining permitted read-only SDK methods - **`tests/cleancloud/safety/`** → static, runtime, and policy/role safety tests -- **`security/`** → canonical AWS IAM policy, Azure role definition, and verification scripts +- **`security/aws/`** → split AWS IAM policies: base (all scans), hygiene, ai; plus Azure and GCP definitions --- @@ -92,9 +95,9 @@ cleancloud/ ### IAM Policy Test - File: `tests/cleancloud/safety/aws/test_aws_iam_policy_readonly.py` -- Purpose: Ensure `security/aws-readonly-policy.json` grants **read-only permissions only**. -- Checks: No `Delete*`, `Put*`, `Update*`, `Create*` actions. -- CI failure occurs if policy grants unsafe actions. +- Purpose: Ensure all policies under `security/aws/` grant **read-only permissions only**. +- Checks: No `Delete*`, `Put*`, `Update*`, `Create*` actions in any of the three policy files. +- CI failure occurs if any policy grants unsafe actions. --- diff --git a/security/README.md b/security/README.md index 934aef9..decea2b 100644 --- a/security/README.md +++ b/security/README.md @@ -14,10 +14,12 @@ This directory contains the **IAM Proof Pack** - a collection of artifacts that | File | Description | |------|-------------| -| `aws-readonly-policy.json` | AWS IAM policy with minimum required read-only permissions | +| `aws/base-readonly.json` | AWS IAM policy — STS + CloudWatch (required for all scans) | +| `aws/hygiene-readonly.json` | AWS IAM policy — EC2, RDS, ELB, S3, logs (`--category hygiene`, default) | +| `aws/ai-readonly.json` | AWS IAM policy — SageMaker etc. (`--category ai`) | | `azure-readonly-role.json` | Azure custom role definition with minimum required read-only permissions | | `gcp-readonly-roles.json` | GCP predefined IAM roles required for read-only scanning | -| `verify-aws-policy.sh` | Script to verify AWS IAM policy contains no write/delete permissions | +| `verify-aws-policy.sh` | Script to verify AWS IAM policies contain no write/delete permissions | | `verify-azure-role.sh` | Script to verify Azure role is read-only | | `verify-gcp-roles.sh` | Script to verify GCP service account has only read-only roles | @@ -27,52 +29,32 @@ This directory contains the **IAM Proof Pack** - a collection of artifacts that ### 1. AWS IAM Policy Verification -**Verify the policy file:** +**Verify all three policy files:** ```bash -./verify-aws-policy.sh aws-readonly-policy.json +./verify-aws-policy.sh ``` -**Expected output:** -``` -Verifying AWS IAM Policy: aws-readonly-policy.json +The script iterates over `aws/base-readonly.json`, `aws/hygiene-readonly.json`, and `aws/ai-readonly.json` automatically. -PASS: No write/delete/tag permissions found +**Attach to IAM role (hygiene scan — default):** -Allowed actions: -cloudwatch:GetMetricStatistics -ec2:DescribeAddresses -ec2:DescribeImages -ec2:DescribeInstances -ec2:DescribeNatGateways -ec2:DescribeNetworkInterfaces -ec2:DescribeRegions -ec2:DescribeSecurityGroups -ec2:DescribeSnapshots -ec2:DescribeVolumes -elasticloadbalancing:DescribeLoadBalancers -elasticloadbalancing:DescribeTargetGroups -elasticloadbalancing:DescribeTargetHealth -logs:DescribeLogGroups -rds:DescribeDBInstances -rds:DescribeDBSnapshots -s3:GetBucketTagging -s3:ListAllMyBuckets -sts:GetCallerIdentity +```bash +aws iam put-role-policy --role-name CleanCloudCIReadOnly \ + --policy-name CleanCloudBase \ + --policy-document file://aws/base-readonly.json + +aws iam put-role-policy --role-name CleanCloudCIReadOnly \ + --policy-name CleanCloudHygiene \ + --policy-document file://aws/hygiene-readonly.json ``` -**Create IAM policy in AWS:** +**Additionally, for AI/ML scans (`--category ai`):** ```bash -# Create IAM policy -aws iam create-policy \ - --policy-name CleanCloudReadOnly \ - --policy-document file://aws-readonly-policy.json - -# Attach to existing role (e.g., OIDC role) -aws iam attach-role-policy \ - --role-name CleanCloudCIReadOnly \ - --policy-arn arn:aws:iam::123456789012:policy/CleanCloudReadOnly +aws iam put-role-policy --role-name CleanCloudCIReadOnly \ + --policy-name CleanCloudAI \ + --policy-document file://aws/ai-readonly.json ``` --- diff --git a/security/aws/ai-readonly.json b/security/aws/ai-readonly.json new file mode 100644 index 0000000..0303717 --- /dev/null +++ b/security/aws/ai-readonly.json @@ -0,0 +1,14 @@ +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "SageMakerReadOnly", + "Effect": "Allow", + "Action": [ + "sagemaker:ListEndpoints", + "sagemaker:DescribeEndpoint" + ], + "Resource": "*" + } + ] +} diff --git a/security/aws/base-readonly.json b/security/aws/base-readonly.json new file mode 100644 index 0000000..1a6640e --- /dev/null +++ b/security/aws/base-readonly.json @@ -0,0 +1,18 @@ +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "STSIdentity", + "Effect": "Allow", + "Action": "sts:GetCallerIdentity", + "Resource": "*" + }, + { + "Sid": "CloudWatchMetrics", + "Effect": "Allow", + "Action": "cloudwatch:GetMetricStatistics", + "Resource": "*", + "Comment": "Required by both hygiene rules (NAT gateway, RDS) and AI rules (SageMaker invocations)" + } + ] +} diff --git a/security/aws-readonly-policy.json b/security/aws/hygiene-readonly.json similarity index 81% rename from security/aws-readonly-policy.json rename to security/aws/hygiene-readonly.json index 09adc83..158a478 100644 --- a/security/aws-readonly-policy.json +++ b/security/aws/hygiene-readonly.json @@ -37,12 +37,9 @@ "Resource": "*" }, { - "Sid": "CloudWatchReadOnly", + "Sid": "CloudWatchLogsReadOnly", "Effect": "Allow", - "Action": [ - "logs:DescribeLogGroups", - "cloudwatch:GetMetricStatistics" - ], + "Action": "logs:DescribeLogGroups", "Resource": "*" }, { @@ -53,12 +50,6 @@ "s3:GetBucketTagging" ], "Resource": "*" - }, - { - "Sid": "STSIdentity", - "Effect": "Allow", - "Action": "sts:GetCallerIdentity", - "Resource": "*" } ] } diff --git a/security/verify-aws-policy.sh b/security/verify-aws-policy.sh index 01d6245..1b07048 100755 --- a/security/verify-aws-policy.sh +++ b/security/verify-aws-policy.sh @@ -1,33 +1,43 @@ #!/bin/bash # File: verify-aws-policy.sh -# Verifies AWS IAM policy is read-only -# Usage: ./verify-aws-policy.sh [policy-file] +# Verifies all AWS IAM policies under security/aws/ are read-only +# Usage: ./verify-aws-policy.sh set -e -POLICY_FILE="${1:-aws-readonly-policy.json}" - -if [ ! -f "$POLICY_FILE" ]; then - echo "❌ ERROR: Policy file not found: $POLICY_FILE" - exit 1 -fi - -echo "🔍 Verifying AWS IAM Policy: $POLICY_FILE" -echo "" - -# Check for forbidden actions (excluding read operations) -FORBIDDEN=$(cat "$POLICY_FILE" | jq -r '.Statement[].Action[]?' 2>/dev/null | grep -iE '^[^:]+:(Delete|Put|Create|Update|Modify|Terminate|Reboot|Stop|Start|Attach|Detach|Tag|Untag)[A-Z]' || true) - -if [ -z "$FORBIDDEN" ]; then - echo "✅ PASS: No write/delete/tag permissions found" - echo "" - echo "Allowed actions:" - cat "$POLICY_FILE" | jq -r '.Statement[].Action[]?' 2>/dev/null | sort | uniq - echo "" - exit 0 -else - echo "❌ FAIL: Found forbidden permissions:" - echo "$FORBIDDEN" - echo "" - exit 1 -fi +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +POLICY_FILES=( + "$SCRIPT_DIR/aws/base-readonly.json" + "$SCRIPT_DIR/aws/hygiene-readonly.json" + "$SCRIPT_DIR/aws/ai-readonly.json" +) + +FAILED=0 + +for POLICY_FILE in "${POLICY_FILES[@]}"; do + POLICY_NAME=$(basename "$POLICY_FILE") + + if [ ! -f "$POLICY_FILE" ]; then + echo "❌ ERROR: Policy file not found: $POLICY_FILE" + FAILED=1 + continue + fi + + echo "🔍 Verifying AWS IAM Policy: $POLICY_NAME" + + FORBIDDEN=$(cat "$POLICY_FILE" | jq -r '.Statement[].Action[]?' 2>/dev/null | grep -iE '^[^:]+:(Delete|Put|Create|Update|Modify|Terminate|Reboot|Stop|Start|Attach|Detach|Tag|Untag)[A-Z]' || true) + + if [ -z "$FORBIDDEN" ]; then + echo "✅ PASS: No write/delete/tag permissions found" + echo " Allowed actions:" + cat "$POLICY_FILE" | jq -r '.Statement[].Action[]?' 2>/dev/null | sort | uniq | sed 's/^/ /' + echo "" + else + echo "❌ FAIL: Found forbidden permissions:" + echo "$FORBIDDEN" + echo "" + FAILED=1 + fi +done + +exit $FAILED diff --git a/tests/cleancloud/cli/test_cli_graceful_degradation.py b/tests/cleancloud/cli/test_cli_graceful_degradation.py index 6379e74..74a01e7 100644 --- a/tests/cleancloud/cli/test_cli_graceful_degradation.py +++ b/tests/cleancloud/cli/test_cli_graceful_degradation.py @@ -84,7 +84,7 @@ def test_cli_shows_skipped_rules_in_output(monkeypatch): """Scan output includes skipped rules section when permissions are missing.""" monkeypatch.setattr( "cleancloud.providers.aws.scan._scan_aws_region", - lambda profile, region: ([_fake_finding()], _SKIPPED), + lambda profile, region, rules: ([_fake_finding()], _SKIPPED), ) runner = CliRunner() @@ -104,7 +104,7 @@ def test_cli_exits_0_with_skipped_rules_and_no_policy_flags(monkeypatch): """Skipped rules alone do not trigger a non-zero exit code.""" monkeypatch.setattr( "cleancloud.providers.aws.scan._scan_aws_region", - lambda profile, region: ([], _SKIPPED), + lambda profile, region, rules: ([], _SKIPPED), ) runner = CliRunner() @@ -120,7 +120,7 @@ def test_cli_findings_still_reported_alongside_skipped_rules(monkeypatch): """Findings from successful rules are still reported when other rules are skipped.""" monkeypatch.setattr( "cleancloud.providers.aws.scan._scan_aws_region", - lambda profile, region: ([_fake_finding("vol-99")], _SKIPPED), + lambda profile, region, rules: ([_fake_finding("vol-99")], _SKIPPED), ) runner = CliRunner() @@ -138,7 +138,7 @@ def test_cli_no_skipped_section_when_all_rules_pass(monkeypatch): """No skipped rules section shown when all rules complete successfully.""" monkeypatch.setattr( "cleancloud.providers.aws.scan._scan_aws_region", - lambda profile, region: ([_fake_finding()], []), + lambda profile, region, rules: ([_fake_finding()], []), ) runner = CliRunner() diff --git a/tests/cleancloud/cli/test_cli_tag_precedence.py b/tests/cleancloud/cli/test_cli_tag_precedence.py index 301ca53..3c6d375 100644 --- a/tests/cleancloud/cli/test_cli_tag_precedence.py +++ b/tests/cleancloud/cli/test_cli_tag_precedence.py @@ -76,7 +76,7 @@ def test_cli_ignore_tag_overrides_yaml(monkeypatch, tmp_path): # Patch AWS scan to return fixed findings monkeypatch.setattr( "cleancloud.providers.aws.scan._scan_aws_region", - lambda profile, region: (findings, []), + lambda profile, region, rules: (findings, []), ) runner = CliRunner() diff --git a/tests/cleancloud/cli/test_scan_exit_codes.py b/tests/cleancloud/cli/test_scan_exit_codes.py deleted file mode 100644 index 21e0064..0000000 --- a/tests/cleancloud/cli/test_scan_exit_codes.py +++ /dev/null @@ -1,49 +0,0 @@ -from unittest.mock import patch - -import pytest -from click.testing import CliRunner - -import cleancloud.cli as cli_module -from cleancloud.cli import cli - -pytest.skip( - "Azure tests disabled for now", - allow_module_level=True, -) - - -class FakeResult: - def __init__(self, confidence: str): - self.confidence = confidence - - def to_dict(self): - return {"confidence": self.confidence} - - -def test_cli_exits_2_on_high_confidence(): - runner = CliRunner() - fake_results = [FakeResult("High")] - - with patch.object(cli_module, "run_scan", return_value=fake_results): - result = runner.invoke(cli, ["scan"]) - - assert result.exit_code == 2 - - -def test_cli_does_not_fail_on_medium_confidence(): - runner = CliRunner() - fake_results = [FakeResult("Medium")] - - with patch.object(cli_module, "run_scan", return_value=fake_results): - result = runner.invoke(cli, ["scan"]) - - assert result.exit_code == 0 - - -def test_cli_no_findings_exit_0(): - runner = CliRunner() - - with patch.object(cli_module, "run_scan", return_value=[]): - result = runner.invoke(cli, ["scan"]) - - assert result.exit_code == 0 diff --git a/tests/cleancloud/cli/test_scan_provider_flag_validation.py b/tests/cleancloud/cli/test_scan_provider_flag_validation.py index ae74959..66d9e97 100644 --- a/tests/cleancloud/cli/test_scan_provider_flag_validation.py +++ b/tests/cleancloud/cli/test_scan_provider_flag_validation.py @@ -181,3 +181,20 @@ def test_per_account_regions_rejected_for_gcp(runner): assert result.exit_code != 0 assert "--per-account-regions" in result.output assert "aws" in result.output.lower() + + +# --------------------------------------------------------------------------- +# --category ai rejected for non-AWS providers +# --------------------------------------------------------------------------- + + +def test_category_ai_rejected_for_gcp(runner): + result = runner.invoke(cli, ["scan", "--provider", "gcp", "--category", "ai"]) + assert result.exit_code != 0 + assert "aws" in result.output.lower() + + +def test_category_ai_rejected_for_azure(runner): + result = runner.invoke(cli, ["scan", "--provider", "azure", "--category", "ai"]) + assert result.exit_code != 0 + assert "aws" in result.output.lower() diff --git a/tests/cleancloud/providers/aws/test_aws_sagemaker_endpoint_idle.py b/tests/cleancloud/providers/aws/test_aws_sagemaker_endpoint_idle.py new file mode 100644 index 0000000..042ffd1 --- /dev/null +++ b/tests/cleancloud/providers/aws/test_aws_sagemaker_endpoint_idle.py @@ -0,0 +1,557 @@ +from datetime import datetime, timedelta, timezone +from unittest.mock import MagicMock + +from cleancloud.providers.aws.rules.sagemaker_endpoint_idle import ( + find_idle_sagemaker_endpoints, +) + + +def _make_session(sagemaker, cloudwatch): + session = MagicMock() + + def client_side_effect(service_name, *args, **kwargs): + if service_name == "sagemaker": + return sagemaker + if service_name == "cloudwatch": + return cloudwatch + raise ValueError(f"Unexpected service: {service_name}") + + session.client.side_effect = client_side_effect + return session + + +def _make_endpoint(name="test-endpoint", age_days=30): + now = datetime.now(timezone.utc) + return { + "EndpointName": name, + "EndpointArn": f"arn:aws:sagemaker:us-east-1:123456789012:endpoint/{name}", + "CreationTime": now - timedelta(days=age_days), + "LastModifiedTime": now - timedelta(days=age_days), + "EndpointStatus": "InService", + } + + +def _make_describe_response( + instance_type="ml.m5.xlarge", variant_count=1, desired_instance_count=1 +): + """Build a describe_endpoint response with DesiredInstanceCount on each variant.""" + variants = [ + { + "VariantName": f"variant-{i}", + "CurrentInstanceType": instance_type, + "CurrentInstanceCount": desired_instance_count, + "DesiredInstanceCount": desired_instance_count, + } + for i in range(variant_count) + ] + return {"ProductionVariants": variants} + + +def _no_invocations(): + """CloudWatch returns no datapoints for an idle endpoint. + + SageMaker only publishes the Invocations metric when invocations actually occur. + An endpoint with zero invocations has no metric series — empty datapoints. + """ + return {"Datapoints": []} + + +def _has_invocations(): + return {"Datapoints": [{"Sum": 500.0, "Timestamp": datetime.now(timezone.utc)}]} + + +# --------------------------------------------------------------------------- +# Core detection +# --------------------------------------------------------------------------- + + +def test_idle_cpu_endpoint_detected(): + """Idle CPU endpoint with zero invocations should be flagged as MEDIUM risk, HIGH confidence.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + f = findings[0] + assert f.rule_id == "aws.sagemaker.endpoint.idle" + assert f.resource_type == "aws.sagemaker.endpoint" + assert f.resource_id == "test-endpoint" + assert f.confidence.value == "high" + assert f.risk.value == "medium" + assert f.details["is_gpu"] is False + assert f.details["instance_type"] == "ml.m5.xlarge" + assert f.details["age_days"] == 30 + assert f.details["total_instances"] == 1 + + +def test_idle_gpu_endpoint_detected_high_risk(): + """Idle GPU endpoint should be flagged as HIGH risk with elevated cost estimate.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.p3.2xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + f = findings[0] + assert f.risk.value == "high" + assert f.details["is_gpu"] is True + assert f.details["instance_type"] == "ml.p3.2xlarge" + assert f.estimated_monthly_cost_usd == 2754.0 + + +def test_active_endpoint_skipped(): + """Endpoint with invocations should NOT be flagged.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + cloudwatch.get_metric_statistics.return_value = _has_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 0 + + +def test_young_endpoint_skipped(): + """Endpoint younger than minimum threshold should NOT be flagged.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=3)]}] + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 0 + + +def test_timezone_naive_creation_time_handled(): + """boto3 may return timezone-naive CreationTime; endpoint should still be correctly aged.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + # Use a naive datetime (no tzinfo) — older boto3 behaviour + naive_create_time = datetime.now() - timedelta(days=30) + assert naive_create_time.tzinfo is None + endpoint = _make_endpoint(age_days=30) + endpoint["CreationTime"] = naive_create_time + + paginator.paginate.return_value = [{"Endpoints": [endpoint]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + # Should be flagged — age correctly computed despite naive timestamp + assert len(findings) == 1 + assert findings[0].details["age_days"] >= 29 # allow off-by-one from sub-second timing + + +def test_no_endpoints_returns_empty(): + """No endpoints should return empty findings.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": []}] + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert findings == [] + + +# --------------------------------------------------------------------------- +# Scaled-to-zero endpoints +# --------------------------------------------------------------------------- + + +def test_scaled_to_zero_endpoint_skipped(): + """Endpoint with all variants scaled to zero instances should NOT be flagged (no compute cost).""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response( + "ml.m5.xlarge", desired_instance_count=0 + ) + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 0 + + +def test_missing_desired_instance_count_treated_as_zero(): + """DesiredInstanceCount missing from API response should be treated as 0, not 1.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + # No DesiredInstanceCount key — AWS response omits it + sagemaker.describe_endpoint.return_value = { + "ProductionVariants": [{"VariantName": "v1", "CurrentInstanceType": "ml.m5.xlarge"}] + } + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + # total_instances = 0 (missing treated as 0) → scaled-to-zero, no cost → skip + assert len(findings) == 0 + + +def test_partial_scaled_to_zero_still_flagged(): + """Multi-variant endpoint where only some variants are zero should still be flagged if total > 0.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + # 2 variants: one with 1 instance, one with 0 + sagemaker.describe_endpoint.return_value = { + "ProductionVariants": [ + {"VariantName": "v1", "CurrentInstanceType": "ml.m5.xlarge", "DesiredInstanceCount": 1}, + {"VariantName": "v2", "CurrentInstanceType": "ml.m5.xlarge", "DesiredInstanceCount": 0}, + ] + } + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + assert findings[0].details["total_instances"] == 1 + + +# --------------------------------------------------------------------------- +# Empty datapoints — unknown state +# --------------------------------------------------------------------------- + + +def test_empty_datapoints_on_old_endpoint_treated_as_idle(): + """Old endpoint with no CloudWatch data (metric series never initialized) should be flagged. + + CloudWatch only publishes SageMaker Invocations when invocations occur. An endpoint + that has never been called has no metric series. Since the age guard already filters + out endpoints < 7 days old, empty datapoints on a 30-day-old endpoint means idle. + """ + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = {"Datapoints": []} + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + assert findings[0].confidence.value == "high" + + +# --------------------------------------------------------------------------- +# Effective window (age vs idle period) +# --------------------------------------------------------------------------- + + +def test_effective_window_capped_to_age(): + """For an endpoint younger than days_idle, the effective window is capped to age.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + # age=12, days_idle=14 → effective_window=12 + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=12)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + assert findings[0].details["idle_window_days"] == 12 + assert findings[0].details["idle_days_threshold"] == 14 + + +def test_very_small_effective_window_skipped(): + """Effective window < 3 days is too narrow for a reliable conclusion — skip. + + Setup: days_idle=2, age=8 + - Early age guard: age=8 >= max(2//2=1, 7) = 7 → passes + - effective_window = min(2, 8) = 2 < 3 → skipped by the window guard + """ + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=8)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1", days_idle=2) + + assert len(findings) == 0 + + +# --------------------------------------------------------------------------- +# Confidence levels +# --------------------------------------------------------------------------- + + +def test_high_confidence_for_old_endpoint(): + """Endpoint older than days_idle should be HIGH confidence.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert findings[0].confidence.value == "high" + + +def test_medium_confidence_for_borderline_age(): + """Endpoint at 75% of idle threshold should be MEDIUM confidence.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + # age_days=11, int(14 * 0.75)=10 → 11 >= 10 → MEDIUM + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=11)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + assert findings[0].confidence.value == "medium" + + +def test_borderline_age_below_threshold_skipped(): + """Endpoint below the 75% confidence threshold should be skipped.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + # age_days=8, int(14 * 0.75)=10 → 8 < 10 → skip + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=8)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.m5.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 0 + + +# --------------------------------------------------------------------------- +# GPU family detection +# --------------------------------------------------------------------------- + + +def test_g4dn_instance_detected_as_gpu(): + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.g4dn.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert findings[0].details["is_gpu"] is True + assert findings[0].risk.value == "high" + assert findings[0].estimated_monthly_cost_usd == 531.0 + + +def test_inf1_instance_detected_as_gpu(): + """Inferentia instances should be classified as GPU-class (accelerator).""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response("ml.inf1.xlarge") + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert findings[0].details["is_gpu"] is True + + +# --------------------------------------------------------------------------- +# Multi-variant cost scaling +# --------------------------------------------------------------------------- + + +def test_multi_variant_cost_scaled(): + """Cost estimate should sum across all production variants × DesiredInstanceCount.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = _make_describe_response( + "ml.m5.xlarge", variant_count=3 + ) + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + # ml.m5.xlarge = $188/month × 3 variants × 1 instance each + assert findings[0].estimated_monthly_cost_usd == 188.0 * 3 + assert findings[0].details["variant_count"] == 3 + assert findings[0].details["total_instances"] == 3 + + +def test_multi_variant_mixed_instance_types_cost(): + """Cost should sum per-variant when variants have different instance types.""" + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.return_value = { + "ProductionVariants": [ + { + "VariantName": "cpu", + "CurrentInstanceType": "ml.m5.xlarge", + "DesiredInstanceCount": 2, + }, + { + "VariantName": "gpu", + "CurrentInstanceType": "ml.g4dn.xlarge", + "DesiredInstanceCount": 1, + }, + ] + } + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 1 + f = findings[0] + # ml.m5.xlarge × 2 + ml.g4dn.xlarge × 1 = 188×2 + 531×1 = 907 + assert f.estimated_monthly_cost_usd == 188.0 * 2 + 531.0 * 1 + # GPU because second variant is GPU + assert f.details["is_gpu"] is True + assert f.details["total_instances"] == 3 + + +# --------------------------------------------------------------------------- +# Resilience +# --------------------------------------------------------------------------- + + +def test_cloudwatch_failure_treated_as_active(): + """If CloudWatch metrics fail, endpoint should NOT be flagged (avoid false positives).""" + from botocore.exceptions import ClientError + + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + cloudwatch.get_metric_statistics.side_effect = ClientError( + {"Error": {"Code": "InternalError", "Message": "Service error"}}, + "GetMetricStatistics", + ) + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 0 + + +def test_describe_endpoint_failure_skips_endpoint(): + """If describe_endpoint fails, endpoint is skipped (unknown state — don't accuse).""" + from botocore.exceptions import ClientError + + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.return_value = [{"Endpoints": [_make_endpoint(age_days=30)]}] + sagemaker.describe_endpoint.side_effect = ClientError( + {"Error": {"Code": "ValidationException", "Message": "Not found"}}, + "DescribeEndpoint", + ) + cloudwatch.get_metric_statistics.return_value = _no_invocations() + + session = _make_session(sagemaker, cloudwatch) + findings = find_idle_sagemaker_endpoints(session, "us-east-1") + + assert len(findings) == 0 + + +def test_permission_error_raised(): + """AccessDenied on ListEndpoints should raise PermissionError.""" + from botocore.exceptions import ClientError + + sagemaker = MagicMock() + cloudwatch = MagicMock() + + paginator = sagemaker.get_paginator.return_value + paginator.paginate.side_effect = ClientError( + {"Error": {"Code": "AccessDenied", "Message": "Access denied"}}, + "ListEndpoints", + ) + + session = _make_session(sagemaker, cloudwatch) + + try: + find_idle_sagemaker_endpoints(session, "us-east-1") + assert False, "Expected PermissionError" + except PermissionError as e: + assert "sagemaker:ListEndpoints" in str(e) + + +# --------------------------------------------------------------------------- +# RULE_METADATA +# --------------------------------------------------------------------------- + + +def test_rule_metadata_present(): + """Rule must expose RULE_METADATA with correct fields.""" + from cleancloud.providers.aws.rules.sagemaker_endpoint_idle import RULE_METADATA + + assert RULE_METADATA["id"] == "aws.sagemaker.endpoint.idle" + assert RULE_METADATA["category"] == "ai" + assert RULE_METADATA["service"] == "sagemaker" + assert RULE_METADATA["cost_impact"] == "high" diff --git a/tests/cleancloud/providers/aws/test_aws_scan_graceful_degradation.py b/tests/cleancloud/providers/aws/test_aws_scan_graceful_degradation.py index f91e289..f819cdc 100644 --- a/tests/cleancloud/providers/aws/test_aws_scan_graceful_degradation.py +++ b/tests/cleancloud/providers/aws/test_aws_scan_graceful_degradation.py @@ -49,11 +49,9 @@ def _error_rule(session, region): @patch("cleancloud.providers.aws.scan.create_aws_session") def test_permission_error_goes_to_skipped_not_failed(mock_session): """A rule raising PermissionError is recorded in skipped_rules, not as a scan failure.""" - with patch( - "cleancloud.providers.aws.scan.AWS_RULES", - [_good_rule, _permission_error_rule], - ): - findings, skipped_rules = _scan_aws_region(profile=None, region="us-east-1") + findings, skipped_rules = _scan_aws_region( + profile=None, region="us-east-1", rules=[_good_rule, _permission_error_rule] + ) assert len(findings) == 1 assert len(skipped_rules) == 1 @@ -64,11 +62,11 @@ def test_permission_error_goes_to_skipped_not_failed(mock_session): @patch("cleancloud.providers.aws.scan.create_aws_session") def test_all_permission_errors_returns_empty_no_exception(mock_session): """All rules failing with PermissionError returns ([], skipped_all) without RuntimeError.""" - with patch( - "cleancloud.providers.aws.scan.AWS_RULES", - [_permission_error_rule, _another_permission_error_rule], - ): - findings, skipped_rules = _scan_aws_region(profile=None, region="us-east-1") + findings, skipped_rules = _scan_aws_region( + profile=None, + region="us-east-1", + rules=[_permission_error_rule, _another_permission_error_rule], + ) assert findings == [] assert len(skipped_rules) == 2 @@ -80,11 +78,9 @@ def test_all_permission_errors_returns_empty_no_exception(mock_session): @patch("cleancloud.providers.aws.scan.create_aws_session") def test_mixed_success_and_permission_error(mock_session): """Findings from successful rules are returned alongside skipped rule info.""" - with patch( - "cleancloud.providers.aws.scan.AWS_RULES", - [_good_rule, _permission_error_rule], - ): - findings, skipped_rules = _scan_aws_region(profile=None, region="us-east-1") + findings, skipped_rules = _scan_aws_region( + profile=None, region="us-east-1", rules=[_good_rule, _permission_error_rule] + ) assert len(findings) == 1 assert findings[0].resource_id == "vol-1" @@ -94,11 +90,9 @@ def test_mixed_success_and_permission_error(mock_session): @patch("cleancloud.providers.aws.scan.create_aws_session") def test_non_permission_error_still_counted_as_failure(mock_session): """Non-PermissionError exceptions are still counted as rule failures (existing behavior).""" - with patch( - "cleancloud.providers.aws.scan.AWS_RULES", - [_good_rule, _error_rule], - ): - findings, skipped_rules = _scan_aws_region(profile=None, region="us-east-1") + findings, skipped_rules = _scan_aws_region( + profile=None, region="us-east-1", rules=[_good_rule, _error_rule] + ) # Good rule still returns findings assert len(findings) == 1 @@ -109,15 +103,12 @@ def test_non_permission_error_still_counted_as_failure(mock_session): @patch("cleancloud.providers.aws.scan.create_aws_session") def test_skipped_rules_deduplicated_across_regions(mock_session): """scan_aws_regions deduplicates the same skipped rule reported by multiple regions.""" - with patch( - "cleancloud.providers.aws.scan.AWS_RULES", - [_good_rule, _permission_error_rule], - ): - # Both regions will skip _permission_error_rule — should only appear once in summary - findings, skipped_rules = scan_aws_regions( - profile=None, - regions_to_scan=["us-east-1", "eu-west-1"], - ) + # Both regions will skip _permission_error_rule — should only appear once in summary + findings, skipped_rules = scan_aws_regions( + profile=None, + regions_to_scan=["us-east-1", "eu-west-1"], + rules=[_good_rule, _permission_error_rule], + ) assert len(findings) == 2 # one finding per region # Same rule skipped in both regions — deduplicated to one entry @@ -128,10 +119,10 @@ def test_skipped_rules_deduplicated_across_regions(mock_session): @patch("cleancloud.providers.aws.scan.create_aws_session") def test_region_still_stamped_on_findings(mock_session): """Region is correctly set on findings even when some rules are skipped.""" - with patch( - "cleancloud.providers.aws.scan.AWS_RULES", - [_good_rule, _permission_error_rule], - ): - findings, _ = _scan_aws_region(profile=None, region="ap-southeast-1") + findings, _ = _scan_aws_region( + profile=None, + region="ap-southeast-1", + rules=[_good_rule, _permission_error_rule], + ) assert all(f.region == "ap-southeast-1" for f in findings) diff --git a/tests/cleancloud/providers/aws/test_multi_account.py b/tests/cleancloud/providers/aws/test_multi_account.py index eed258f..1fb15c4 100644 --- a/tests/cleancloud/providers/aws/test_multi_account.py +++ b/tests/cleancloud/providers/aws/test_multi_account.py @@ -136,7 +136,7 @@ def test_scan_account_all_regions_uses_assumed_session( # Region discovery must use the assumed session, not hub mock_get_regions.assert_called_once_with(assumed_session) - mock_scan.assert_called_once_with(assumed_session, ["us-east-1", "eu-west-1"]) + mock_scan.assert_called_once_with(assumed_session, ["us-east-1", "eu-west-1"], rules=None) @patch("cleancloud.providers.aws.multi_account.scan_aws_regions_with_session") @@ -160,7 +160,7 @@ def test_scan_account_all_regions_falls_back_to_us_east_1_when_none_detected( regions_override=None, # triggers per-account discovery ) - mock_scan.assert_called_once_with(mock_assume.return_value, ["us-east-1"]) + mock_scan.assert_called_once_with(mock_assume.return_value, ["us-east-1"], rules=None) assert result.regions_scanned == ["us-east-1"] diff --git a/tests/cleancloud/safety/aws/test_aws_iam_policy_readonly.py b/tests/cleancloud/safety/aws/test_aws_iam_policy_readonly.py index a38433e..f2cd983 100644 --- a/tests/cleancloud/safety/aws/test_aws_iam_policy_readonly.py +++ b/tests/cleancloud/safety/aws/test_aws_iam_policy_readonly.py @@ -5,16 +5,21 @@ from cleancloud.safety.aws.allowlist import FORBIDDEN_AWS_API_PREFIXES -POLICY_PATH = Path("security/aws-readonly-policy.json") +POLICY_FILES = [ + Path("security/aws/base-readonly.json"), + Path("security/aws/hygiene-readonly.json"), + Path("security/aws/ai-readonly.json"), +] @pytest.mark.safety @pytest.mark.aws -def test_aws_iam_policy_is_strictly_read_only(): +@pytest.mark.parametrize("policy_path", POLICY_FILES, ids=lambda p: p.name) +def test_aws_iam_policy_is_strictly_read_only(policy_path): """ - Ensure the published AWS IAM policy never grants mutating permissions. + Ensure all published AWS IAM policies never grant mutating permissions. """ - policy = json.loads(POLICY_PATH.read_text()) + policy = json.loads(policy_path.read_text()) for statement in policy.get("Statement", []): actions = statement.get("Action", []) @@ -23,4 +28,6 @@ def test_aws_iam_policy_is_strictly_read_only(): for action in actions: for forbidden in FORBIDDEN_AWS_API_PREFIXES: - assert forbidden not in action, f"Forbidden IAM action detected: {action}" + assert ( + forbidden not in action + ), f"Forbidden IAM action in {policy_path.name}: {action}" diff --git a/tests/cleancloud/test_output_markdown.py b/tests/cleancloud/test_output_markdown.py index e97ad2a..853ebc9 100644 --- a/tests/cleancloud/test_output_markdown.py +++ b/tests/cleancloud/test_output_markdown.py @@ -50,7 +50,7 @@ def test_no_findings_output(): summary = _make_summary([]) summary["total_findings"] = 0 output = write_markdown([], summary) - assert "No hygiene issues detected" in output + assert "No issues detected" in output def test_findings_table_present(): diff --git a/tests/e2e/aws/test_aws_ai_rules_smoke.py b/tests/e2e/aws/test_aws_ai_rules_smoke.py new file mode 100644 index 0000000..164f0e2 --- /dev/null +++ b/tests/e2e/aws/test_aws_ai_rules_smoke.py @@ -0,0 +1,44 @@ +from datetime import datetime + +import boto3 +import pytest + +from cleancloud.core.finding import Finding +from cleancloud.providers.aws.rules.sagemaker_endpoint_idle import ( + find_idle_sagemaker_endpoints, +) + + +@pytest.mark.e2e +@pytest.mark.aws +def test_aws_ai_rules_run_without_error(): + session = boto3.Session() + region = "us-east-1" + + rules = [ + find_idle_sagemaker_endpoints, + ] + + all_results = [] + skipped = [] + for rule in rules: + try: + rule_results = rule(session, region) + except PermissionError: + skipped.append(rule.__name__) + continue + assert isinstance( + rule_results, list + ), f"{rule.__name__} returned {type(rule_results)} instead of list" + all_results.extend(rule_results) + + if skipped: + pytest.skip(f"Rules skipped (missing permissions): {', '.join(skipped)}") + + for f in all_results: + assert isinstance(f, Finding) + assert f.provider == "aws" + assert f.rule_id.startswith("aws.sagemaker.") + assert f.resource_id + assert f.region + assert f.detected_at and isinstance(f.detected_at, datetime)