# 🌀 Week 11-12 · Notebook 08 · Advanced Orchestration with Kubernetes (GKE)

This notebook explores deploying the **Manufacturing Copilot** to Google Kubernetes Engine (GKE) for maximum control, advanced deployment strategies, and complex networking scenarios. This represents the most advanced deployment option for our capstone project.

> ⭐ **IMPORTANT**: The capstone project now includes **production-ready Kubernetes deployment** with complete Helm charts and kubectl manifests!
> 
> **✅ What's Already Implemented:**
> - 📦 Complete Helm chart in `../capstone_project/charts/manufacturing-copilot/`
> - 📋 Plain Kubernetes manifests in `../capstone_project/kubernetes/`
> - 🔧 Deployment scripts: `deploy-k8s.sh` and `deploy-k8s.ps1`
> - 📖 Comprehensive guide: `../capstone_project/KUBERNETES_DEPLOYMENT.md`
> - 🎯 Features: HPA, PDB, NetworkPolicy, Ingress, Prometheus, Security Contexts
> 
> This notebook teaches **advanced concepts** (Argo Rollouts, Service Mesh) that build upon the production deployment.

## 🎯 Learning Objectives

- **Package with Helm:** Use Helm, the Kubernetes package manager, to create a reusable and configurable chart for deploying the copilot application.
- **Configure Advanced Autoscaling:** Implement the Horizontal Pod Autoscaler (HPA) to automatically scale the number of application pods based on CPU utilization or custom metrics.
- **Implement Canary Rollouts:** Use a progressive delivery controller like Argo Rollouts to safely release new versions by gradually shifting traffic and running automated analysis.
- **Secure with a Service Mesh:** Introduce Anthos Service Mesh (ASM) or Istio to enforce mutual TLS (mTLS) for secure pod-to-pod communication and to gain deep traffic telemetry.


## 🧩 Scenario: High-Availability, Multi-Service Deployment

As the Manufacturing Copilot becomes business-critical, the requirements for its deployment have become more stringent. Cloud Run is excellent, but the operations team needs more control.

**New Requirements:**
1.  **Complex Service Topologies:** The application is no longer a single container. It now includes a sidecar container for real-time retrieval, which must be deployed alongside the main API container in the same pod.
2.  **Zero-Downtime, Risk-Managed Rollouts:** New versions must be rolled out to a small subset of users first (a "canary" release). The system must automatically monitor key metrics (like hallucination rate) and roll back if they degrade.
3.  **Strict Security:** All network traffic *between* services inside the Kubernetes cluster must be encrypted (mTLS).
4.  **Cost-Effective Scaling:** The application must scale based on real-time demand, but the team also needs to prevent service disruption during voluntary maintenance (e.g., node upgrades), which requires a `PodDisruptionBudget`.


## 🧱 Packaging with Helm

Helm allows us to package all our Kubernetes manifests (`Deployment`, `Service`, `HPA`, etc.) into a single, versioned, and configurable "chart."

**✅ Production Helm Chart (Already Implemented!):**

The Manufacturing Copilot already has a complete, production-ready Helm chart at:

```
capstone_project/charts/manufacturing-copilot/
├── Chart.yaml                     # Metadata (v1.0.0)
├── values.yaml                    # Default production values
├── values-dev.yaml                # Development environment overrides
├── values-prod.yaml               # Production environment overrides
└── templates/
    ├── _helpers.tpl               # Template helpers
    ├── deployment.yaml            # Main app deployment with security contexts
    ├── service.yaml               # ClusterIP service
    ├── ingress.yaml               # Ingress with TLS support
    ├── hpa.yaml                   # Horizontal Pod Autoscaler (2-10 replicas)
    ├── pdb.yaml                   # Pod Disruption Budget
    ├── configmap.yaml             # Application configuration
    ├── secret.yaml                # Secrets + PVC for ChromaDB
    ├── serviceaccount.yaml        # Kubernetes ServiceAccount
    ├── networkpolicy.yaml         # Network security policies
    └── servicemonitor.yaml        # Prometheus monitoring integration
```

**🚀 Deploy the Production Chart:**

```bash
# Quick deploy (development)
./capstone_project/scripts/kubernetes/deploy-k8s.sh dev

# Production deployment
helm install copilot ./capstone_project/charts/manufacturing-copilot \
  --namespace manufacturing-copilot \
  --create-namespace \
  --values ./capstone_project/charts/manufacturing-copilot/values-prod.yaml \
  --set secrets.huggingfaceToken="hf_your_token_here"
```

**📚 See Full Documentation:** `../capstone_project/KUBERNETES_DEPLOYMENT.md`

In [None]:
# 📖 Reference: Production values.yaml from Manufacturing Copilot
# The actual production chart has much more comprehensive configuration!

# Let's examine the real production values.yaml
from pathlib import Path

# Path to the actual production Helm chart
production_chart = Path("../capstone_project/charts/manufacturing-copilot")

if production_chart.exists():
    print("✅ Production Helm Chart Found!\n")
    print("=" * 70)
    
    # Read the actual values.yaml
    values_file = production_chart / "values.yaml"
    if values_file.exists():
        print("📄 Production values.yaml (first 60 lines):\n")
        with open(values_file, 'r') as f:
            lines = f.readlines()[:60]
            print(''.join(lines))
        print(f"\n... and {len(open(values_file).readlines()) - 60} more lines")
        print("\n💡 Full file at: capstone_project/charts/manufacturing-copilot/values.yaml")
    
    print("\n" + "=" * 70)
    print("\n🎯 Key Features of Production Chart:")
    print("  • Image: abhayra12/manufacturing-copilot (configurable)")
    print("  • Replicas: 2-10 with HPA (CPU & memory metrics)")
    print("  • Security: Non-root, read-only FS, dropped capabilities")
    print("  • Storage: 10Gi PVC for ChromaDB persistence")
    print("  • Monitoring: Prometheus ServiceMonitor")
    print("  • Networking: Ingress with TLS, NetworkPolicy")
    print("  • HA: Pod Disruption Budget, anti-affinity")
    
    print("\n🌍 Environment-Specific Values:")
    print("  • values-dev.yaml: 1 replica, lower resources, DEBUG logs")
    print("  • values-prod.yaml: 3-20 replicas, HA, GKE Workload Identity")
    
else:
    print("⚠️ Production chart not found at expected location")
    print("📍 Expected: ../capstone_project/charts/manufacturing-copilot/")
    print("\n💡 This notebook teaches concepts - the full implementation")
    print("   is in the capstone project directory!")

In [None]:
# 🔍 Explore the Production Helm Templates
# Let's examine what templates are included in the production chart

from pathlib import Path
import yaml

production_chart = Path("../capstone_project/charts/manufacturing-copilot")
templates_dir = production_chart / "templates"

if templates_dir.exists():
    print("📦 Production Helm Chart Templates:\n")
    print("=" * 70)
    
    templates = sorted(templates_dir.glob("*.yaml"))
    for i, template in enumerate(templates, 1):
        print(f"{i}. {template.name}")
        
        # Show first few lines of each template
        with open(template, 'r') as f:
            lines = f.readlines()[:5]
            for line in lines:
                print(f"   {line.rstrip()}")
        print()
    
    print("=" * 70)
    print(f"\n✅ Total Templates: {len(templates)}")
    print("\n🎯 Quick Deploy Commands:")
    print("\n# Development (Minikube/Local):")
    print("cd ../capstone_project")
    print("./scripts/kubernetes/deploy-k8s.sh dev")
    print("\n# Production (GKE/EKS/AKS):")
    print("helm install copilot ./charts/manufacturing-copilot \\")
    print("  --namespace manufacturing-copilot \\")
    print("  --create-namespace \\")
    print("  --values ./charts/manufacturing-copilot/values-prod.yaml \\")
    print('  --set secrets.huggingfaceToken="hf_your_token"')
    
else:
    print("⚠️ Templates directory not found")
    print("💡 Make sure you're running this from the week-11-12-production-capstone directory")

### 🧩 `deployment.yaml` Template with a Sidecar

This Helm template defines the main `Deployment` resource. It uses values from `values.yaml` (like `{{ .Values.image.repository }}`) to make the manifest configurable.

Note the two containers defined: `api` (our main FastAPI app) and `retriever-sidecar`.

```yaml
# --- charts/manufacturing-copilot/templates/deployment.yaml ---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-copilot
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ .Release.Name }}-copilot
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}-copilot
    spec:
      containers:
        # Main application container
        - name: api
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          env:
            - name: LOG_LEVEL
              value: {{ .Values.config.logLevel | quote }}
            - name: PLANT_ID
              value: {{ .Values.config.plantId | quote }}
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          resources:
            {{- toYaml .Values.resources | nindent 12 }}

        # Sidecar container for retrieval
        - name: retriever-sidecar
          image: "us-docker.pkg.dev/vertex-ai/vector-search-sidecar:latest"
          args:
            - "--db-uri=$(CHROMA_DB_URI)"
            - "--port=8081"
          env:
            - name: CHROMA_DB_URI
              valueFrom:
                secretKeyRef:
                  name: chroma-db-secret
                  key: uri
          ports:
            - name: grpc
              containerPort: 8081
```


## 🔄 Progressive Delivery with Argo Rollouts

Standard Kubernetes deployments are too basic for our needs. We will use **Argo Rollouts**, a Kubernetes controller that provides advanced deployment strategies like canary and blue-green.

**Key Features of our Canary Strategy:**
-   **Phased Rollout:** The new version is first rolled out to only 10% of pods.
-   **Automated Analysis:** After a waiting period, the system automatically queries Prometheus to check our `copilot_hallucination_rate` metric.
-   **Automatic Rollback:** If the hallucination rate for the canary version is higher than the stable version, Argo Rollouts automatically aborts the rollout and scales the new version down to zero.
-   **Manual Promotion:** If the analysis passes, the rollout pauses for a manual approval step before proceeding to 100%.


In [None]:
# --- templates/rollout.yaml (for Argo Rollouts) ---

argo_rollout_yaml = dedent("""
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: copilot-rollout
spec:
  replicas: {{ .Values.replicaCount }}
  strategy:
    canary:
      # Defines the two services that manage traffic to stable and canary pods
      canaryService: copilot-canary
      stableService: copilot-stable
      steps:
      - setWeight: 10  # 1. Send 10% of traffic to the new version
      - pause: { duration: 15m } # 2. Wait 15 minutes for metrics to collect
      
      # 3. Run an automated analysis against Prometheus
      - analysis:
          templates:
            - templateName: hallucination-check
          args:
            - name: service-name
              value: copilot-canary
      
      - setWeight: 50 # 4. If analysis passes, increase traffic to 50%
      - pause: {} # 5. Pause indefinitely for manual promotion
      
  # ... the rest of the pod template (selector, spec, etc.) is the same as the Deployment ...
""")

(helm_dir / "templates").mkdir(exist_ok=True)
(helm_dir / "templates" / "rollout.yaml").write_text(argo_rollout_yaml)

print("--- Argo Rollout Snippet ---")
print(argo_rollout_yaml)


## 🛡️ Zero-Trust Security with a Service Mesh

A service mesh like **Anthos Service Mesh (ASM)** or **Istio** provides a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable.

**How we will use it:**
1.  **Automatic mTLS:** The service mesh automatically injects a proxy sidecar into each of our pods. This proxy encrypts and decrypts all incoming and outgoing traffic, ensuring mutual TLS (mTLS) without any changes to our application code.
2.  **Fine-Grained Access Control:** We will apply a `PeerAuthentication` policy to enforce `STRICT` mTLS mode, meaning no unencrypted traffic is allowed within the mesh.
3.  **Traffic Telemetry:** The proxies collect detailed metrics, logs, and traces for all traffic, giving us deep visibility into how our services are communicating. This data can be visualized in tools like Kiali or the GCP console.


## 🧮 Final Capstone Readiness Review

This notebook concludes the core technical topics of the course. Before starting the final capstone project, let's review the production-readiness of our proposed system.

| Area                        | Technology Stack                               | Status & Confidence Level |
| --------------------------- | ---------------------------------------------- | ------------------------- |
| **1. MLOps Foundation**     | MLflow                                         | ✅ High                   |
| **2. API Service**          | FastAPI, Pydantic                              | ✅ High                   |
| **3. Containerization**     | Docker (Multi-stage), Trivy                    | ✅ High                   |
| **4. Monitoring**           | Prometheus, Grafana                            | ✅ High                   |
| **5. CI/CD Pipeline**       | GitHub Actions                                 | ✅ High                   |
| **6. IaC (Infrastructure)** | Terraform                                      | ✅ High                   |
| **7. Deployment (Simple)**  | GCP Cloud Run                                  | ✅ High                   |
| **8. Deployment (Advanced)**| GKE, Helm, Argo Rollouts, Service Mesh         | ⚠️ Medium (Complex)        |

**Conclusion:** We have covered all the necessary components to build and deploy a production-grade GenAI application. The choice between Cloud Run (simpler, serverless) and GKE (more control, more complex) for the final deployment will depend on the specific requirements of the capstone project.


## 🧪 Lab Assignment: Deploy Manufacturing Copilot to Kubernetes

### Part 1: Deploy the Production Chart (Recommended First)

**Use the actual Manufacturing Copilot Helm chart:**

1.  **Choose Your Kubernetes Environment:**
    
    **Option A: Minikube (Local Testing)**
    ```bash
    # Start minikube
    minikube start --cpus=4 --memory=8192
    minikube addons enable ingress
    
    # Deploy with the script
    cd ../capstone_project
    ./scripts/kubernetes/deploy-k8s.sh dev
    
    # Or manually with Helm
    helm install copilot ./charts/manufacturing-copilot \
      --namespace manufacturing-copilot \
      --create-namespace \
      --values ./charts/manufacturing-copilot/values-dev.yaml \
      --set secrets.huggingfaceToken="hf_your_token_here"
    
    # Access the API
    minikube tunnel  # In a separate terminal
    # Or: kubectl port-forward svc/manufacturing-copilot 8080:8080 -n manufacturing-copilot
    ```
    
    **Option B: GKE (Production)**
    ```bash
    # Create GKE cluster
    gcloud container clusters create copilot-cluster \
      --region=us-central1 \
      --num-nodes=2 \
      --machine-type=e2-standard-4 \
      --enable-autoscaling \
      --min-nodes=2 \
      --max-nodes=10
    
    # Get credentials
    gcloud container clusters get-credentials copilot-cluster --region=us-central1
    
    # Deploy production
    cd ../capstone_project
    helm install copilot ./charts/manufacturing-copilot \
      --namespace manufacturing-copilot \
      --create-namespace \
      --values ./charts/manufacturing-copilot/values-prod.yaml \
      --set secrets.huggingfaceToken="hf_your_token_here" \
      --set image.tag="v1.0.0"
    ```
    
    **📖 Full Deployment Guide:** `../capstone_project/KUBERNETES_DEPLOYMENT.md`

2.  **Verify the Deployment:**
    ```bash
    # Check pods
    kubectl get pods -n manufacturing-copilot
    
    # Check HPA
    kubectl get hpa -n manufacturing-copilot
    
    # Check PDB
    kubectl get pdb -n manufacturing-copilot
    
    # View logs
    kubectl logs -n manufacturing-copilot -l app=manufacturing-copilot --tail=50
    
    # Test the API
    kubectl port-forward svc/manufacturing-copilot 8080:8080 -n manufacturing-copilot
    # Open http://localhost:8080/docs
    ```

### Part 2: Advanced Concepts (Optional - Requires Argo Rollouts & Service Mesh)

**After mastering the production deployment, explore advanced topics:**

3.  **Install Argo Rollouts:**
    ```bash
    kubectl create namespace argo-rollouts
    kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
    
    # Install kubectl plugin
    curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
    chmod +x kubectl-argo-rollouts-linux-amd64
    sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
    ```

4.  **Convert Deployment to Rollout:**
    - Modify `deployment.yaml` to use Argo Rollout CRD
    - Implement canary strategy with the example from this notebook
    - Test automated rollbacks with metrics analysis

5.  **Enable Service Mesh (GKE with Anthos):**
    ```bash
    gcloud container clusters update copilot-cluster \
      --enable-managed-service-mesh \
      --region=us-central1
    
    # Apply mTLS policy
    kubectl apply -f - <<EOF
    apiVersion: security.istio.io/v1beta1
    kind: PeerAuthentication
    metadata:
      name: default
      namespace: manufacturing-copilot
    spec:
      mtls:
        mode: STRICT
    EOF
    ```

### Part 3: Production Readiness Validation

6.  **Test High Availability:**
    ```bash
    # Trigger pod failure
    kubectl delete pod -n manufacturing-copilot -l app=manufacturing-copilot --grace-period=0
    
    # Verify PDB prevents complete shutdown
    # Verify new pods start automatically
    kubectl get pods -n manufacturing-copilot -w
    ```

7.  **Test Autoscaling:**
    ```bash
    # Generate load
    kubectl run -it --rm load-generator --image=busybox --restart=Never -- /bin/sh -c \
      "while true; do wget -q -O- http://manufacturing-copilot.manufacturing-copilot.svc.cluster.local:8080/health; done"
    
    # Watch HPA scale up
    kubectl get hpa -n manufacturing-copilot -w
    ```

8.  **Review Monitoring:**
    ```bash
    # Check Prometheus metrics (if enabled)
    kubectl port-forward -n monitoring svc/prometheus 9090:9090
    # Open http://localhost:9090
    # Query: rate(http_requests_total{namespace="manufacturing-copilot"}[5m])
    ```

### Cleanup

```bash
# Uninstall Helm release
helm uninstall copilot --namespace manufacturing-copilot

# Delete namespace
kubectl delete namespace manufacturing-copilot

# Delete GKE cluster (if using GKE)
gcloud container clusters delete copilot-cluster --region=us-central1
```

## ✅ Checklist for this Notebook

### Production Deployment (Required)
- [X] ✅ Production-ready Helm chart created in `../capstone_project/charts/manufacturing-copilot/`
- [X] ✅ Plain Kubernetes manifests available in `../capstone_project/kubernetes/`
- [X] ✅ Deployment automation scripts (Bash + PowerShell) created
- [X] ✅ HPA configured for auto-scaling (2-10 replicas)
- [X] ✅ PodDisruptionBudget ensures high availability
- [X] ✅ NetworkPolicy for security implemented
- [X] ✅ Prometheus ServiceMonitor for monitoring
- [X] ✅ Security contexts (non-root, read-only FS) configured
- [X] ✅ Persistent storage for ChromaDB with PVC
- [X] ✅ Multi-cloud support (GKE, EKS, AKS annotations)
- [X] ✅ Comprehensive deployment guide: `../capstone_project/KUBERNETES_DEPLOYMENT.md`

### Hands-On Lab (Recommended)
- [ ] **TODO:** Deploy to local Minikube or cloud GKE cluster
- [ ] **TODO:** Verify all components (pods, services, HPA, PDB)
- [ ] **TODO:** Test the API endpoints
- [ ] **TODO:** Trigger pod failures and observe self-healing
- [ ] **TODO:** Generate load and watch HPA auto-scale

### Advanced Concepts (Optional)
- [ ] **OPTIONAL:** Install and configure Argo Rollouts for canary deployments
- [ ] **OPTIONAL:** Convert Deployment to Rollout with canary strategy
- [ ] **OPTIONAL:** Enable service mesh (Istio/Anthos) for mTLS
- [ ] **OPTIONAL:** Implement automated rollback based on metrics analysis

## 📚 References and Further Reading

### Production Deployment Documentation
-   **[Manufacturing Copilot Kubernetes Deployment Guide](../capstone_project/KUBERNETES_DEPLOYMENT.md)** - ⭐ **Start Here!**
-   **[Capstone Project README](../capstone_project/README.md)** - Deployment options overview
-   **[Implementation Guide](../capstone_project/IMPLEMENTATION_GUIDE.md)** - K8s architecture details

### Official Kubernetes & Helm Documentation
-   [Helm Documentation](https://helm.sh/docs/)
-   [Google Kubernetes Engine (GKE) Documentation](https://cloud.google.com/kubernetes-engine/docs)
-   [Amazon EKS Documentation](https://docs.aws.amazon.com/eks/)
-   [Azure AKS Documentation](https://docs.microsoft.com/en-us/azure/aks/)
-   [Kubernetes PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)
-   [Kubernetes HPA](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)

### Advanced Topics (Argo Rollouts & Service Mesh)
-   [Argo Rollouts Documentation](https://argo-rollouts.readthedocs.io/en/stable/)
-   [Istio Documentation](https://istio.io/latest/docs/)
-   [Anthos Service Mesh (ASM)](https://cloud.google.com/anthos/service-mesh)

### Production Best Practices
-   [Production Checklist for Kubernetes](https://learnk8s.io/production-best-practices)
-   [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/security-best-practices/)
-   [Helm Best Practices](https://helm.sh/docs/chart_best_practices/)