# üîÑ Cluster Update Automation System
## Automated Instance Type Updates with Complete Backup & Rollback Capability

---

## üìã Overview

This notebook automates the process of updating Databricks cluster instance types based on optimization recommendations from the cost analysis system. It includes comprehensive safety features, complete configuration backup, and dashboard-friendly change tracking.

### Key Capabilities

‚úÖ **Safe Updates**: Dry-run mode, validation checks, 10-second countdown  
‚úÖ **Complete Backup**: Full cluster configuration snapshots (before/after)  
‚úÖ **Rollback Ready**: JSON configs for reverting any cluster  
‚úÖ **Dashboard Friendly**: User-readable change summaries and impact analysis  
‚úÖ **Batch Tracking**: Execution labels for easy filtering and audit trails  
‚úÖ **Cross-Workspace**: Updates clusters across multiple workspaces  
‚úÖ **Setting Preservation**: Maintains policies, tags, configs, scripts, security  

---

## üéØ Use Cases

1. **Cost Optimization**: Apply recommended instance type downsizing
2. **Standardization**: Migrate clusters to approved instance types
3. **Policy Enforcement**: Update clusters while preserving governance policies
4. **Change Tracking**: Build dashboards showing what changed and why
5. **Rollback**: Revert clusters to previous configurations if needed

---

## üìä Input/Output Tables

### Input Table
* **`cluster_opportunities`** - Optimization recommendations with suggested instance types

### Output Tables
* **`cluster_update_log`** - Execution tracking (validation, status, errors, savings)
* **`cluster_config_backup`** - Complete before/after configs + dashboard fields

---

## ‚öôÔ∏è Prerequisites

### Required Permissions
* **Workspace Access**: Ability to connect to target workspaces
* **Cluster Edit**: Permission to modify cluster configurations
* **Table Access**: Read from opportunities table, write to log tables
* **Token Authentication**: Valid Databricks token in notebook context

### Required Data
* **cluster_opportunities table** must exist with recommendations
* **Clusters must be STOPPED** (cannot update running clusters)
* **Non-pooled clusters only** (instance pools define their own types)

### Dependencies
* **databricks-sdk** Python package (installed in cell 1)
* **System tables**: `system.access.workspaces_latest` for workspace list

---

## üöÄ Quick Start Guide

### Step 1: Configure Parameters
Set the widgets at the top:
* **dry_run**: `true` (preview) or `false` (live update)
* **workspaces**: Specific workspace names or "All"
* **catalog**: Target catalog name
* **schema**: Target schema name

### Step 2: Review Preview
* Check cluster breakdown by workspace
* Review potential savings
* Verify configuration settings

### Step 3: Execute
* **Dry-run first**: Always test with `dry_run=true`
* **Review results**: Check validation and update status
* **Go live**: Set `dry_run=false` and re-run

### Step 4: Monitor Results
* View execution summary and status breakdown
* Check backup table for change details
* Use sample queries for dashboard building

---

## üõ°Ô∏è Safety Features

### Validation Checks
* ‚úÖ Cluster must be STOPPED (not RUNNING or PENDING)
* ‚úÖ Cluster must NOT use instance pools
* ‚úÖ Current config must match expected state (prevents double-updates)
* ‚úÖ Workspace must be accessible (handles 403 errors gracefully)
* ‚úÖ Deployment name must exist

### Dry-Run Mode
* Preview all changes without modifying clusters
* Creates backup entries showing what WOULD change
* Validates all checks without calling clusters.edit()
* Safe to run multiple times

### Countdown Timer
* 10-second countdown before execution
* Clear warning about DRY_RUN vs LIVE_UPDATE mode
* Opportunity to cancel if needed

### Error Handling
* Cross-workspace connectivity issues
* Certificate validation failures
* API rate limiting (pauses every 10 clusters)
* Detailed error logging with stack traces

---

## üìà Dashboard Building

The `cluster_config_backup` table includes dashboard-friendly fields:

### Change Analysis
* `change_summary` - Human-readable description
* `change_impact` - MINOR, MODERATE, or MAJOR
* `change_categories` - What changed (instance_type, policy, etc.)
* `total_changes_count` - Number of settings modified

### Boolean Filters
* `instance_type_changed`, `policy_changed`, `spark_config_changed`
* `tags_changed`, `init_scripts_changed`, `autotermination_changed`
* `runtime_engine_changed`, `security_mode_changed`

### Before/After Metrics
* Instance types, policy IDs, autoscale settings
* Spark config counts, custom tag counts, init script counts
* Runtime engine, security mode, autotermination minutes

**See Cell 17** for 7 sample dashboard queries

---

## üîÑ Rollback Process

### Using the Backup Table

1. **Query backups**:
```sql
SELECT backup_id, cluster_name, before_config, backup_timestamp
FROM {catalog}.{schema}.cluster_config_backup
WHERE cluster_name = 'my-cluster'
  AND update_status = 'SUCCESS'
  AND is_reverted = false
ORDER BY backup_timestamp DESC
LIMIT 1
```

2. **Parse JSON config**:
```python
import json
config = json.loads(backup.before_config)
```

3. **Apply config** using Databricks SDK (in separate revert notebook)

4. **Mark as reverted**:
```sql
UPDATE {catalog}.{schema}.cluster_config_backup
SET is_reverted = true,
    revert_timestamp = current_timestamp(),
    reverted_by_user = current_user()
WHERE backup_id = '<backup_id>'
```

---

## üìù Execution Workflow

```
1. Setup & Configuration (Cells 1-5)
   ‚îú‚îÄ Install SDK
   ‚îú‚îÄ Create widgets
   ‚îú‚îÄ Display cluster breakdown
   ‚îú‚îÄ Review configuration
   ‚îî‚îÄ 10-second countdown

2. Core Functions (Cells 6-9)
   ‚îú‚îÄ Import libraries
   ‚îú‚îÄ Define update functions
   ‚îú‚îÄ Define dashboard helpers
   ‚îî‚îÄ Main orchestration function

3. Testing (Cell 10)
   ‚îî‚îÄ Unit tests for dry-run safety

4. Table Setup (Cells 11-14)
   ‚îú‚îÄ Create update log table
   ‚îú‚îÄ Create backup table
   ‚îî‚îÄ Enhance with dashboard fields

5. Execution (Cells 15-16)
   ‚îú‚îÄ Execute updates
   ‚îî‚îÄ Save results to tables

6. Results & Documentation (Cells 17-24)
   ‚îú‚îÄ Sample dashboard queries
   ‚îú‚îÄ Documentation
   ‚îî‚îÄ Interactive result viewers
```

---

## ‚ö†Ô∏è Important Notes

### Before Running
* ‚úÖ Always test with `dry_run=true` first
* ‚úÖ Ensure clusters are STOPPED
* ‚úÖ Verify workspace access permissions
* ‚úÖ Review cluster breakdown and savings
* ‚úÖ Check that cluster_opportunities table is up-to-date

### During Execution
* ‚è±Ô∏è Processing time: ~1-2 seconds per cluster
* üîÑ Rate limiting: Pauses every 10 clusters
* üìä Progress displayed for each cluster
* ‚ùå Errors logged but don't stop batch

### After Execution
* üìà Review execution summary
* ‚úÖ Check validation and update status
* üíæ Backup entries created for all attempts
* üîç Use execution_label for filtering

---

## üÜò Troubleshooting

### Common Issues

**"Cluster in RUNNING state"**
* Stop the cluster before updating
* Cannot update running or pending clusters

**"Cluster uses instance pools"**
* Instance pools define node types
* Cannot change instance types for pooled clusters

**"Cluster config mismatch"**
* Cluster was already updated
* Current config doesn't match expected state
* Re-run cost analysis to get fresh recommendations

**"Connectivity failed to workspace"**
* Cross-workspace access denied
* Check token permissions
* Verify workspace is accessible

**"No deployment name found"**
* Workspace metadata incomplete
* Check system.access.workspaces_latest table

---

## üìû Support

For issues or questions:
1. Check execution logs in `cluster_update_log` table
2. Review error_details column for stack traces
3. Use execution_label to filter specific runs
4. Check backup table for configuration snapshots

---

**Version**: 2.0  
**Last Updated**: 2025-12-11  
**Maintained By**: Platform Engineering Team

In [0]:
%pip install databricks-sdk --quiet
dbutils.library.restartPython()

In [0]:
# Get list of workspaces from the system table (available in all workspaces)
workspaces_list = spark.table("system.access.workspaces_latest") \
    .select("workspace_name") \
    .distinct() \
    .orderBy("workspace_name") \
    .toPandas()['workspace_name'].tolist()

# Add "All" as the first option
workspace_options = ["All"] + workspaces_list

# Create widgets for configuration parameters
dbutils.widgets.dropdown("dry_run", "true", ["true", "false"], "Dry Run Mode")
dbutils.widgets.dropdown("workspaces", "All", workspace_options, "Target Workspaces")
dbutils.widgets.text("catalog", "ex_dash_temp", "Catalog Name")
dbutils.widgets.text("schema", "billing_forecast", "Schema Name")

# Get widget values and create variables for use in all subsequent cells
catalog = dbutils.widgets.get("catalog")
schema = dbutils.widgets.get("schema")
full_schema = f"{catalog}.{schema}"

displayHTML(f"""
<div style="padding: 15px; background-color: #e8f5e9; border-left: 5px solid #4caf50; margin: 10px 0;">
    <h3 style="margin-top: 0; color: #2e7d32;">‚úì Configuration Widgets Created</h3>
    <ul style="color: #1b5e20;">
        <li><strong>Dry Run Mode:</strong> Set to 'true' for preview only, 'false' for actual updates</li>
        <li><strong>Target Workspaces:</strong> Select specific workspace or 'All' to process all workspaces</li>
        <li><strong>Catalog Name:</strong> Output catalog for tables (default: ex_dash_temp)</li>
        <li><strong>Schema Name:</strong> Output schema for tables (default: billing_forecast)</li>
    </ul>
    <p style="margin-top: 10px; color: #1565c0; font-style: italic;">‚úì Using system.access.workspaces_latest (available in all workspaces - no cloning needed!)</p>
    <p style="margin-top: 10px; color: #6a1b9a; font-weight: bold;">üìä All tables will be read/written to: <code style="background-color: #e1bee7; padding: 3px 8px; border-radius: 3px;">{full_schema}</code></p>
</div>
""")

In [0]:
from pyspark.sql import functions as F

# Get widget values first
dry_run_str = dbutils.widgets.get("dry_run")
workspaces_str = dbutils.widgets.get("workspaces")

# Convert to usable variables
DRY_RUN = dry_run_str.lower() == "true"
WORKSPACES_TO_UPDATE = workspaces_str.strip() if workspaces_str != "All" else ""

# Load opportunities data using full_schema variable
opportunities_df = spark.table(f"{full_schema}.cluster_opportunities")

# Use system.access.workspaces_latest and extract deployment_name from workspace_url
# This table is available in ALL workspaces - no need to clone!
workspaces_df = spark.table("system.access.workspaces_latest") \
    .withColumn(
        "deployment_name",
        F.regexp_extract(F.col("workspace_url"), r"https://([^.]+)\.cloud\.databricks\.com", 1)
    ) \
    .select(
        F.col("workspace_id").cast("long").alias("workspace_id"),
        "workspace_name",
        "deployment_name"
    )

# Join to get workspace details
cluster_data = opportunities_df.join(
    workspaces_df,
    on="workspace_name",
    how="left"
)

# Filter by workspace if specified
filtered_cluster_data = cluster_data
if WORKSPACES_TO_UPDATE:
    workspace_list = [ws.strip() for ws in WORKSPACES_TO_UPDATE.split(",")]
    filtered_cluster_data = cluster_data.filter(F.col("workspace_name").isin(workspace_list))

# Get cluster counts by workspace
workspace_counts = (
    filtered_cluster_data
    .groupBy("workspace_name")
    .agg(
        F.count("*").alias("cluster_count"),
        F.sum("validated_savings").alias("total_savings")
    )
    .orderBy(F.col("cluster_count").desc())
    .collect()
)

total_clusters = sum([row.cluster_count for row in workspace_counts])
total_savings = sum([float(row.total_savings) if row.total_savings else 0 for row in workspace_counts])

# Build HTML table for workspace breakdown
workspace_table_html = f"""
<div style="border: 2px solid #1976d2; padding: 20px; margin: 20px 0; background-color: #e3f2fd; border-radius: 8px;">
    <h3 style="color: #0d47a1; margin-top: 0;">üìã Cluster Update Plan by Workspace</h3>
    <p style="margin: 5px 0; color: #1565c0; font-size: 13px;">üíæ Source: {full_schema}.cluster_opportunities</p>
    <table style="width: 100%; border-collapse: collapse; margin-top: 15px;">
        <tr style="background-color: #90caf9;">
            <th style="padding: 12px; text-align: left; border: 1px solid #64b5f6; color: #0d47a1;">Workspace</th>
            <th style="padding: 12px; text-align: center; border: 1px solid #64b5f6; color: #0d47a1;">Clusters to Update</th>
            <th style="padding: 12px; text-align: right; border: 1px solid #64b5f6; color: #0d47a1;">Potential Savings (USD)</th>
        </tr>
"""

for row in workspace_counts:
    savings_display = f"${row.total_savings:,.2f}" if row.total_savings else "$0.00"
    workspace_table_html += f"""
        <tr style="background-color: #ffffff;">
            <td style="padding: 10px; border: 1px solid #64b5f6; font-weight: bold;">{row.workspace_name}</td>
            <td style="padding: 10px; border: 1px solid #64b5f6; text-align: center; font-size: 18px; color: #1976d2;">{row.cluster_count}</td>
            <td style="padding: 10px; border: 1px solid #64b5f6; text-align: right; font-weight: bold; color: #2e7d32;">{savings_display}</td>
        </tr>
    """

# Add total row
workspace_table_html += f"""
        <tr style="background-color: #bbdefb; font-weight: bold;">
            <td style="padding: 12px; border: 1px solid #64b5f6; color: #0d47a1;">TOTAL</td>
            <td style="padding: 12px; border: 1px solid #64b5f6; text-align: center; font-size: 20px; color: #0d47a1;">{total_clusters}</td>
            <td style="padding: 12px; border: 1px solid #64b5f6; text-align: right; font-size: 18px; color: #1b5e20;">${total_savings:,.2f}</td>
        </tr>
    </table>
</div>
"""

displayHTML(workspace_table_html)

In [0]:
# Get widget values
dry_run_str = dbutils.widgets.get("dry_run")
workspaces_str = dbutils.widgets.get("workspaces")

# Convert dry_run to boolean
DRY_RUN = dry_run_str.lower() == "true"
WORKSPACES_TO_UPDATE = workspaces_str.strip() if workspaces_str != "All" else ""

# Determine mode styling
if DRY_RUN:
    mode_color = "#4caf50"
    mode_bg = "#e8f5e9"
    mode_icon = "üü¢"
    mode_text = "DRY RUN (PREVIEW ONLY - NO CHANGES)"
else:
    mode_color = "#f44336"
    mode_bg = "#ffebee"
    mode_icon = "üî¥"
    mode_text = "LIVE UPDATE (WILL MODIFY CLUSTERS)"

# Determine workspace display
if WORKSPACES_TO_UPDATE:
    workspace_display = WORKSPACES_TO_UPDATE
else:
    workspace_display = "ALL WORKSPACES"

# Create HTML display
html_content = f"""
<div style="border: 3px solid {mode_color}; padding: 20px; margin: 20px 0; background-color: {mode_bg}; border-radius: 8px;">
    <h2 style="text-align: center; color: {mode_color}; margin-top: 0;">
        ‚ö†Ô∏è CLUSTER UPDATE CONFIGURATION ‚ö†Ô∏è
    </h2>
    <hr style="border: 1px solid {mode_color}; margin: 20px 0;">
    
    <div style="font-size: 16px; line-height: 2;">
        <div style="margin: 15px 0;">
            <strong style="font-size: 18px;">üîß EXECUTION MODE:</strong>
            <span style="font-size: 20px; font-weight: bold; color: {mode_color}; margin-left: 20px;">
                {mode_icon} {mode_text}
            </span>
        </div>
        
        <div style="margin: 15px 0;">
            <strong style="font-size: 18px;">üåê TARGET WORKSPACES:</strong>
            <span style="font-size: 18px; font-weight: bold; color: #1976d2; margin-left: 20px;">
                {workspace_display}
            </span>
        </div>
    </div>
    
    <hr style="border: 1px solid {mode_color}; margin: 20px 0;">
    
    <div style="padding: 15px; background-color: white; border-radius: 5px; margin-top: 15px;">
"""

if not DRY_RUN:
    html_content += """
        <div style="color: #d32f2f; font-weight: bold; font-size: 16px;">
            ‚ö†Ô∏è WARNING: Live update mode is enabled!<br>
            ‚ö†Ô∏è Clusters will be ACTUALLY MODIFIED after validation.
        </div>
    """
else:
    html_content += """
        <div style="color: #388e3c; font-weight: bold; font-size: 16px;">
            ‚úì Safe mode: Dry run will only preview changes without modifying clusters.
        </div>
    """

html_content += """
    </div>
    
    <div style="text-align: center; margin-top: 20px; font-size: 18px; font-weight: bold; color: #ff6f00;">
        ‚è±Ô∏è Starting in 10 seconds...
    </div>
</div>
"""

displayHTML(html_content)

In [0]:
import time

# 10-second countdown
for i in range(10, 0, -1):
    displayHTML(f"""
    <div style="text-align: center; padding: 20px; background-color: #fff3e0; border: 2px solid #ff9800; border-radius: 8px;">
        <h2 style="color: #e65100; margin: 0;">
            ‚è±Ô∏è Starting in <span style="font-size: 36px; color: #ff6f00;">{i}</span> seconds...
        </h2>
        <p style="color: #bf360c; margin-top: 10px; font-size: 14px;">(Press Stop to cancel)</p>
    </div>
    """)
    time.sleep(1)

displayHTML("""
<div style="text-align: center; padding: 25px; background-color: #e3f2fd; border: 3px solid #2196f3; border-radius: 8px;">
    <h1 style="color: #0d47a1; margin: 0;">
        üöÄ STARTING CLUSTER UPDATE PROCESS
    </h1>
</div>
""")

In [0]:
from databricks.sdk import WorkspaceClient
from databricks.sdk.core import Config
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import json
import time

# Get authentication token from notebook context
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

print("‚úì Authentication token loaded successfully")

In [0]:
from databricks.sdk import WorkspaceClient
import json

def get_workspace_client(deployment_name, token):
    """Create authenticated workspace client for a specific workspace"""
    host = f"https://{deployment_name}.cloud.databricks.com"
    
    # Use token authentication
    return WorkspaceClient(
        host=host,
        token=token
    )

def serialize_cluster_config(config):
    """Convert cluster config to JSON string for storage"""
    if "error" in config:
        return json.dumps({"error": config["error"]})
    
    # Create a serializable version of the config
    serializable_config = {}
    for key, value in config.items():
        if key == "cluster_object":
            continue  # Skip the full object
        elif value is None:
            serializable_config[key] = None
        elif hasattr(value, 'as_dict'):
            # Databricks SDK objects with as_dict method
            serializable_config[key] = value.as_dict()
        elif isinstance(value, (str, int, float, bool)):
            serializable_config[key] = value
        elif isinstance(value, dict):
            serializable_config[key] = value
        elif isinstance(value, list):
            serializable_config[key] = [item.as_dict() if hasattr(item, 'as_dict') else item for item in value]
        else:
            serializable_config[key] = str(value)
    
    return json.dumps(serializable_config, indent=2)

def analyze_config_changes(before_config, after_config):
    """Analyze changes between before and after configs for dashboard display
    
    Returns dict with change analysis fields
    """
    if "error" in before_config or not after_config:
        return {
            "change_categories": None,
            "total_changes_count": 0,
            "change_impact": "UNKNOWN",
            "instance_type_changed": False,
            "policy_changed": False,
            "spark_config_changed": False,
            "tags_changed": False,
            "init_scripts_changed": False,
            "autotermination_changed": False,
            "runtime_engine_changed": False,
            "security_mode_changed": False,
            "autoscale_changed": False,
            "change_summary": "No changes detected",
            "change_details": ""
        }
    
    changes = []
    change_details = []
    change_count = 0
    
    # Check instance type changes
    instance_changed = (
        before_config.get("driver_instance_type") != after_config.get("driver_instance_type") or
        before_config.get("worker_instance_type") != after_config.get("worker_instance_type")
    )
    if instance_changed:
        changes.append("instance_type")
        change_count += 1
        change_details.append(
            f"Instance Types: Driver {before_config.get('driver_instance_type')} ‚Üí {after_config.get('driver_instance_type')}, "
            f"Worker {before_config.get('worker_instance_type')} ‚Üí {after_config.get('worker_instance_type')}"
        )
    
    # Check policy changes
    policy_changed = before_config.get("policy_id") != after_config.get("policy_id")
    if policy_changed:
        changes.append("policy")
        change_count += 1
        change_details.append(f"Policy: {before_config.get('policy_id')} ‚Üí {after_config.get('policy_id')}")
    
    # Check Spark config changes
    before_spark = before_config.get("spark_conf") or {}
    after_spark = after_config.get("spark_conf") or {}
    spark_changed = before_spark != after_spark
    if spark_changed:
        changes.append("spark_config")
        change_count += 1
        change_details.append(f"Spark Configs: {len(before_spark)} ‚Üí {len(after_spark)} settings")
    
    # Check custom tags changes
    before_tags = before_config.get("custom_tags") or {}
    after_tags = after_config.get("custom_tags") or {}
    tags_changed = before_tags != after_tags
    if tags_changed:
        changes.append("tags")
        change_count += 1
        change_details.append(f"Custom Tags: {len(before_tags)} ‚Üí {len(after_tags)} tags")
    
    # Check init scripts changes
    before_scripts = before_config.get("init_scripts") or []
    after_scripts = after_config.get("init_scripts") or []
    scripts_changed = len(before_scripts) != len(after_scripts)
    if scripts_changed:
        changes.append("init_scripts")
        change_count += 1
        change_details.append(f"Init Scripts: {len(before_scripts)} ‚Üí {len(after_scripts)} scripts")
    
    # Check autotermination changes
    autoterm_changed = before_config.get("autotermination_minutes") != after_config.get("autotermination_minutes")
    if autoterm_changed:
        changes.append("autotermination")
        change_count += 1
        change_details.append(
            f"Autotermination: {before_config.get('autotermination_minutes')} ‚Üí {after_config.get('autotermination_minutes')} minutes"
        )
    
    # Check runtime engine changes
    before_engine = str(before_config.get("runtime_engine", "STANDARD")).upper()
    after_engine = str(after_config.get("runtime_engine", "STANDARD")).upper()
    engine_changed = before_engine != after_engine
    if engine_changed:
        changes.append("runtime_engine")
        change_count += 1
        change_details.append(f"Runtime Engine: {before_engine} ‚Üí {after_engine}")
    
    # Check security mode changes
    security_changed = before_config.get("data_security_mode") != after_config.get("data_security_mode")
    if security_changed:
        changes.append("security_mode")
        change_count += 1
        change_details.append(
            f"Security Mode: {before_config.get('data_security_mode')} ‚Üí {after_config.get('data_security_mode')}"
        )
    
    # Check autoscale changes
    before_min = before_config.get("min_workers")
    after_min = after_config.get("min_workers")
    before_max = before_config.get("max_workers")
    after_max = after_config.get("max_workers")
    autoscale_changed = before_min != after_min or before_max != after_max
    if autoscale_changed:
        changes.append("autoscale")
        change_count += 1
        change_details.append(f"Autoscale: {before_min}-{before_max} ‚Üí {after_min}-{after_max} workers")
    
    # Determine impact level
    if instance_changed or policy_changed or security_changed:
        impact = "MAJOR"
    elif change_count >= 3:
        impact = "MODERATE"
    elif change_count > 0:
        impact = "MINOR"
    else:
        impact = "NONE"
    
    # Build change summary
    if not changes:
        summary = "No changes detected"
    else:
        summary = ", ".join(change_details[:3])  # First 3 changes
        if len(change_details) > 3:
            summary += f" (+{len(change_details)-3} more)"
    
    return {
        "change_categories": ",".join(changes) if changes else None,
        "total_changes_count": change_count,
        "change_impact": impact,
        "instance_type_changed": instance_changed,
        "policy_changed": policy_changed,
        "spark_config_changed": spark_changed,
        "tags_changed": tags_changed,
        "init_scripts_changed": scripts_changed,
        "autotermination_changed": autoterm_changed,
        "runtime_engine_changed": engine_changed,
        "security_mode_changed": security_changed,
        "autoscale_changed": autoscale_changed,
        "change_summary": summary,
        "change_details": " | ".join(change_details)
    }

def get_cluster_current_config(ws_client, cluster_id):
    """Fetch current cluster configuration with ALL settings to preserve"""
    try:
        cluster = ws_client.clusters.get(cluster_id=cluster_id)
        return {
            # Basic configuration
            "driver_instance_type": cluster.driver_node_type_id,
            "worker_instance_type": cluster.node_type_id,
            "min_workers": cluster.autoscale.min_workers if cluster.autoscale else None,
            "max_workers": cluster.autoscale.max_workers if cluster.autoscale else None,
            "num_workers": cluster.num_workers if not cluster.autoscale else None,
            "state": cluster.state.value if cluster.state else None,
            "spark_version": cluster.spark_version,
            "cluster_name": cluster.cluster_name,
            "autoscale": cluster.autoscale,
            
            # CRITICAL: Preserve all cluster settings
            "policy_id": cluster.policy_id,
            "spark_conf": cluster.spark_conf,
            "custom_tags": cluster.custom_tags,
            "init_scripts": cluster.init_scripts,
            "cluster_log_conf": cluster.cluster_log_conf,
            "ssh_public_keys": cluster.ssh_public_keys,
            "aws_attributes": cluster.aws_attributes,
            "spark_env_vars": cluster.spark_env_vars,
            "enable_elastic_disk": cluster.enable_elastic_disk,
            "enable_local_disk_encryption": cluster.enable_local_disk_encryption,
            "instance_pool_id": cluster.instance_pool_id,
            "driver_instance_pool_id": cluster.driver_instance_pool_id,
            "data_security_mode": cluster.data_security_mode,
            "runtime_engine": cluster.runtime_engine,
            "autotermination_minutes": cluster.autotermination_minutes,
            "single_user_name": cluster.single_user_name,
            "docker_image": cluster.docker_image,
            
            "cluster_object": cluster  # Store full cluster object for reference
        }
    except Exception as e:
        return {"error": str(e)}

def validate_cluster_config(current_config, expected_driver, expected_worker, expected_min, expected_max):
    """Validate if current cluster config matches expected values from opportunities table"""
    if "error" in current_config:
        return False, f"Failed to fetch cluster: {current_config['error']}"
    
    # Check if instance types match (before update)
    driver_matches = current_config.get("driver_instance_type") == expected_driver
    worker_matches = current_config.get("worker_instance_type") == expected_worker
    
    # For validation, we check if current config matches what we expect to change FROM
    # This prevents updating clusters that have already been modified
    validation_msg = []
    if not driver_matches:
        validation_msg.append(f"Driver mismatch: current={current_config.get('driver_instance_type')}, expected={expected_driver}")
    if not worker_matches:
        validation_msg.append(f"Worker mismatch: current={current_config.get('worker_instance_type')}, expected={expected_worker}")
    
    is_valid = driver_matches and worker_matches
    msg = "; ".join(validation_msg) if validation_msg else "Configuration matches expected state"
    
    return is_valid, msg

print("‚úì Cluster update functions defined using Databricks SDK with token authentication")
print("‚úì FIXED: Now capturing and preserving ALL cluster settings during updates")
print("‚úì NEW: Added config serialization for backup table")
print("‚úì NEW: Added change analysis function for dashboard-friendly fields")
print("  - Policy, Spark config, custom tags, init scripts, log config")
print("  - SSH keys, AWS attributes, environment variables, disk settings")
print("  - Instance pools, security mode, runtime engine, autotermination")

In [0]:
def populate_dashboard_fields(backup_entry, before_config, after_config):
    """Populate dashboard-friendly fields in backup entry
    
    Args:
        backup_entry: Dict with basic backup fields
        before_config: Before configuration dict
        after_config: After configuration dict (can be None for dry-run/failed)
    
    Returns:
        Updated backup_entry with all dashboard fields populated
    """
    # Extract counts from before config
    before_spark_conf = before_config.get("spark_conf") or {}
    before_tags = before_config.get("custom_tags") or {}
    before_scripts = before_config.get("init_scripts") or []
    
    backup_entry["before_spark_config_count"] = len(before_spark_conf)
    backup_entry["before_custom_tags_count"] = len(before_tags)
    backup_entry["before_init_scripts_count"] = len(before_scripts)
    backup_entry["before_autotermination_minutes"] = before_config.get("autotermination_minutes")
    backup_entry["before_runtime_engine"] = str(before_config.get("runtime_engine", "STANDARD")).upper() if before_config.get("runtime_engine") else None
    backup_entry["before_data_security_mode"] = str(before_config.get("data_security_mode")) if before_config.get("data_security_mode") else None
    
    # Extract counts from after config (if available)
    if after_config and "error" not in after_config:
        after_spark_conf = after_config.get("spark_conf") or {}
        after_tags = after_config.get("custom_tags") or {}
        after_scripts = after_config.get("init_scripts") or []
        
        backup_entry["after_spark_config_count"] = len(after_spark_conf)
        backup_entry["after_custom_tags_count"] = len(after_tags)
        backup_entry["after_init_scripts_count"] = len(after_scripts)
        backup_entry["after_autotermination_minutes"] = after_config.get("autotermination_minutes")
        backup_entry["after_runtime_engine"] = str(after_config.get("runtime_engine", "STANDARD")).upper() if after_config.get("runtime_engine") else None
        backup_entry["after_data_security_mode"] = str(after_config.get("data_security_mode")) if after_config.get("data_security_mode") else None
        
        # Analyze changes
        change_analysis = analyze_config_changes(before_config, after_config)
    else:
        # No after config (dry-run or failed) - use before config values
        backup_entry["after_spark_config_count"] = len(before_spark_conf)
        backup_entry["after_custom_tags_count"] = len(before_tags)
        backup_entry["after_init_scripts_count"] = len(before_scripts)
        backup_entry["after_autotermination_minutes"] = before_config.get("autotermination_minutes")
        backup_entry["after_runtime_engine"] = backup_entry["before_runtime_engine"]
        backup_entry["after_data_security_mode"] = backup_entry["before_data_security_mode"]
        
        # For dry-run, simulate the instance type change
        simulated_after = before_config.copy()
        simulated_after["driver_instance_type"] = backup_entry.get("after_driver_instance")
        simulated_after["worker_instance_type"] = backup_entry.get("after_worker_instance")
        change_analysis = analyze_config_changes(before_config, simulated_after)
    
    # Add change analysis fields
    backup_entry["change_categories"] = change_analysis["change_categories"]
    backup_entry["total_changes_count"] = change_analysis["total_changes_count"]
    backup_entry["change_impact"] = change_analysis["change_impact"]
    backup_entry["instance_type_changed"] = change_analysis["instance_type_changed"]
    backup_entry["policy_changed"] = change_analysis["policy_changed"]
    backup_entry["spark_config_changed"] = change_analysis["spark_config_changed"]
    backup_entry["tags_changed"] = change_analysis["tags_changed"]
    backup_entry["init_scripts_changed"] = change_analysis["init_scripts_changed"]
    backup_entry["autotermination_changed"] = change_analysis["autotermination_changed"]
    backup_entry["runtime_engine_changed"] = change_analysis["runtime_engine_changed"]
    backup_entry["security_mode_changed"] = change_analysis["security_mode_changed"]
    backup_entry["autoscale_changed"] = change_analysis["autoscale_changed"]
    backup_entry["change_summary"] = change_analysis["change_summary"]
    backup_entry["change_details"] = change_analysis["change_details"]
    
    return backup_entry

print("‚úì Dashboard field population helper function defined")
print("  - Extracts setting counts from configs")
print("  - Analyzes changes between before/after")
print("  - Generates user-friendly summaries")

In [0]:
def update_cluster_with_recommendation(row, dry_run=True, batch_metadata=None):
    """Process a single cluster update based on recommendation
    
    Args:
        row: Cluster data row
        dry_run: If True, only preview changes without updating
        batch_metadata: Dict containing batch tracking info (batch_id, start_time, etc.)
    """
    # Initialize batch metadata if not provided
    if batch_metadata is None:
        batch_metadata = {
            "batch_id": "unknown",
            "execution_label": "unknown",
            "batch_start_time": datetime.now(),
            "batch_end_time": None,
            "execution_mode": "DRY_RUN" if dry_run else "LIVE_UPDATE",
            "workspace_filter_applied": "",
            "total_clusters_in_batch": 0,
            "executed_by_user": "unknown"
        }
    
    log_entry = {
        # Batch metadata
        "batch_id": batch_metadata["batch_id"],
        "execution_label": batch_metadata.get("execution_label", "unknown"),
        "batch_start_time": batch_metadata["batch_start_time"],
        "batch_end_time": batch_metadata.get("batch_end_time"),
        "execution_mode": batch_metadata["execution_mode"],
        "workspace_filter_applied": batch_metadata["workspace_filter_applied"],
        "total_clusters_in_batch": batch_metadata["total_clusters_in_batch"],
        "executed_by_user": batch_metadata["executed_by_user"],
        
        # Individual cluster details
        "log_id": f"{row.cluster_id}_{int(time.time() * 1000)}",
        "cluster_id": row.cluster_id,
        "cluster_name": row.cluster_name,
        "workspace_name": row.workspace_name,
        "workspace_id": row.workspace_id,
        "deployment_url": f"https://{row.deployment_name}.cloud.databricks.com" if row.deployment_name else None,
        "action_type": row.action_item,
        "recommendation": row.recommendation,
        "current_driver_instance": row.driver_instance_type,
        "current_worker_instance": row.worker_instance_type,
        "suggested_driver_instance": row.suggested_driver_instance,
        "suggested_worker_instance": row.suggested_worker_instance,
        "current_min_workers": row.min_workers,
        "current_max_workers": row.max_workers,
        "validation_status": "PENDING",
        "validation_message": "",
        "update_status": "PENDING",
        "update_message": "",
        "dry_run": dry_run,
        "validated_savings": row.validated_savings,
        "execution_timestamp": datetime.now(),
        "error_details": None,
        "implementation_notes": row.implementation_notes if hasattr(row, 'implementation_notes') else None
    }
    
    # Initialize backup entry (will be populated if update proceeds)
    backup_entry = None
    
    try:
        # Skip if no deployment name
        if not row.deployment_name:
            log_entry["validation_status"] = "FAILED"
            log_entry["validation_message"] = "No deployment name found for workspace"
            log_entry["update_status"] = "SKIPPED"
            log_entry["update_message"] = "Skipped: No deployment name found for workspace"
            return log_entry, None
        
        # Create workspace client using token authentication
        try:
            ws_client = get_workspace_client(row.deployment_name, token)
        except Exception as client_error:
            error_msg = str(client_error)
            log_entry["validation_status"] = "FAILED"
            if "403" in error_msg or "Cert validation failed" in error_msg or "certificate" in error_msg.lower():
                log_entry["validation_message"] = f"Connectivity failed to workspace '{row.workspace_name}' - Cross-workspace access denied or certificate validation failed"
                log_entry["update_message"] = f"Skipped: Cannot connect to workspace '{row.workspace_name}' - Cross-workspace access issue"
            else:
                log_entry["validation_message"] = f"Failed to create workspace client: {error_msg[:200]}"
                log_entry["update_message"] = f"Skipped: Failed to create workspace client - {error_msg[:100]}"
            log_entry["update_status"] = "SKIPPED"
            log_entry["error_details"] = error_msg
            return log_entry, None
        
        # Get current cluster configuration (BEFORE state)
        before_config = get_cluster_current_config(ws_client, row.cluster_id)
        
        # Check for connectivity errors
        if "error" in before_config:
            error_msg = before_config["error"]
            log_entry["validation_status"] = "FAILED"
            if "403" in error_msg or "Cert validation failed" in error_msg or "certificate" in error_msg.lower():
                log_entry["validation_message"] = f"Connectivity failed to workspace '{row.workspace_name}' - Cross-workspace access denied or certificate validation failed"
                log_entry["update_message"] = f"Skipped: Cannot fetch cluster from workspace '{row.workspace_name}' - Cross-workspace access issue"
            else:
                log_entry["validation_message"] = f"Failed to fetch cluster: {error_msg[:200]}"
                log_entry["update_message"] = f"Skipped: Failed to fetch cluster - {error_msg[:100]}"
            log_entry["update_status"] = "SKIPPED"
            log_entry["error_details"] = error_msg
            return log_entry, None
        
        # Extract cluster creator/owner from cluster object
        cluster_creator = None
        if "cluster_object" in before_config and before_config["cluster_object"]:
            cluster_obj = before_config["cluster_object"]
            cluster_creator = cluster_obj.creator_user_name if hasattr(cluster_obj, 'creator_user_name') else None
        
        # Check if cluster uses instance pools - cannot update instance types for pooled clusters
        uses_instance_pool = before_config.get("instance_pool_id") or before_config.get("driver_instance_pool_id")
        if uses_instance_pool:
            log_entry["validation_status"] = "FAILED"
            log_entry["validation_message"] = "Cluster uses instance pools - cannot change instance types (pools define the node types)"
            log_entry["update_status"] = "SKIPPED"
            log_entry["update_message"] = "Skipped: Cluster uses instance pools - instance types are controlled by the pool configuration"
            return log_entry, None
        
        # Check if cluster is in RUNNING or PENDING state
        cluster_state = before_config.get("state")
        if cluster_state in ["RUNNING", "PENDING"]:
            log_entry["validation_status"] = "FAILED"
            log_entry["validation_message"] = f"Cluster is in {cluster_state} state - cannot update running or starting clusters"
            log_entry["update_status"] = "SKIPPED"
            log_entry["update_message"] = f"Skipped: Cluster in {cluster_state} state - stop cluster before updating"
            return log_entry, None
        
        # Validate cluster configuration matches expected state
        is_valid, validation_msg = validate_cluster_config(
            before_config,
            row.driver_instance_type,
            row.worker_instance_type,
            row.min_workers,
            row.max_workers
        )
        
        log_entry["validation_status"] = "PASSED" if is_valid else "FAILED"
        log_entry["validation_message"] = validation_msg
        
        if not is_valid:
            log_entry["update_status"] = "SKIPPED"
            log_entry["update_message"] = f"Skipped: Cluster config mismatch - {validation_msg[:150]}"
            return log_entry, None
        
        # Proceed with update if validation passed
        if dry_run:
            log_entry["update_status"] = "DRY_RUN"
            preserved_settings = []
            if before_config.get('policy_id'):
                preserved_settings.append(f"Policy: {before_config['policy_id']}")
            if before_config.get('spark_conf'):
                preserved_settings.append(f"Spark configs: {len(before_config['spark_conf'])} settings")
            if before_config.get('init_scripts'):
                preserved_settings.append(f"Init scripts: {len(before_config['init_scripts'])}")
            if before_config.get('custom_tags'):
                preserved_settings.append(f"Custom tags: {len(before_config['custom_tags'])}")
            
            settings_note = f" | Preserving: {', '.join(preserved_settings)}" if preserved_settings else " | No additional settings to preserve"
            log_entry["update_message"] = f"Would update: Driver {row.driver_instance_type}‚Üí{row.suggested_driver_instance}, Worker {row.worker_instance_type}‚Üí{row.suggested_worker_instance}{settings_note}"
            
            # Create backup entry for dry run (for preview)
            backup_entry = {
                "backup_id": f"{row.cluster_id}_{int(time.time() * 1000)}_dryrun",
                "batch_id": batch_metadata["batch_id"],
                "execution_label": batch_metadata["execution_label"],
                "backup_timestamp": datetime.now(),
                "cluster_id": row.cluster_id,
                "cluster_name": row.cluster_name,
                "cluster_creator": cluster_creator,  # ADDED: Cluster owner
                "workspace_name": row.workspace_name,
                "workspace_id": row.workspace_id,
                "deployment_url": log_entry["deployment_url"],
                "update_status": "DRY_RUN",
                "update_reason": row.recommendation,
                "updated_by_user": batch_metadata["executed_by_user"],
                "before_config": serialize_cluster_config(before_config),
                "before_driver_instance": before_config.get("driver_instance_type"),
                "before_worker_instance": before_config.get("worker_instance_type"),
                "before_policy_id": before_config.get("policy_id"),
                "before_autoscale_min": before_config.get("min_workers"),
                "before_autoscale_max": before_config.get("max_workers"),
                "before_num_workers": before_config.get("num_workers"),
                "after_config": None,  # Not updated in dry run
                "after_driver_instance": row.suggested_driver_instance,
                "after_worker_instance": row.suggested_worker_instance,
                "after_policy_id": before_config.get("policy_id"),  # Would be preserved
                "after_autoscale_min": before_config.get("min_workers"),
                "after_autoscale_max": before_config.get("max_workers"),
                "after_num_workers": before_config.get("num_workers"),
                "is_reverted": False,
                "revert_timestamp": None,
                "revert_batch_id": None,
                "reverted_by_user": None
            }
            # Populate dashboard fields
            backup_entry = populate_dashboard_fields(backup_entry, before_config, None)
            
        else:
            # Perform actual update with all required parameters
            try:
                # Build edit parameters with all required fields
                edit_params = {
                    "cluster_id": row.cluster_id,
                    "cluster_name": before_config["cluster_name"],
                    "spark_version": before_config["spark_version"],
                    "node_type_id": row.suggested_worker_instance,
                    "driver_node_type_id": row.suggested_driver_instance
                }
                
                # CRITICAL FIX: Preserve ALL cluster settings
                if before_config.get("policy_id"):
                    edit_params["policy_id"] = before_config["policy_id"]
                
                if before_config.get("spark_conf"):
                    edit_params["spark_conf"] = before_config["spark_conf"]
                
                if before_config.get("custom_tags"):
                    edit_params["custom_tags"] = before_config["custom_tags"]
                
                if before_config.get("init_scripts"):
                    edit_params["init_scripts"] = before_config["init_scripts"]
                
                if before_config.get("cluster_log_conf"):
                    edit_params["cluster_log_conf"] = before_config["cluster_log_conf"]
                
                if before_config.get("ssh_public_keys"):
                    edit_params["ssh_public_keys"] = before_config["ssh_public_keys"]
                
                if before_config.get("aws_attributes"):
                    edit_params["aws_attributes"] = before_config["aws_attributes"]
                
                if before_config.get("spark_env_vars"):
                    edit_params["spark_env_vars"] = before_config["spark_env_vars"]
                
                if before_config.get("enable_elastic_disk") is not None:
                    edit_params["enable_elastic_disk"] = before_config["enable_elastic_disk"]
                
                if before_config.get("enable_local_disk_encryption") is not None:
                    edit_params["enable_local_disk_encryption"] = before_config["enable_local_disk_encryption"]
                
                # NOTE: Do NOT set instance_pool_id or driver_instance_pool_id here
                # Instance pools and node_type_id are mutually exclusive
                # We already filtered out pooled clusters above
                
                if before_config.get("data_security_mode"):
                    edit_params["data_security_mode"] = before_config["data_security_mode"]
                
                if before_config.get("runtime_engine"):
                    edit_params["runtime_engine"] = before_config["runtime_engine"]
                
                if before_config.get("autotermination_minutes") is not None:
                    edit_params["autotermination_minutes"] = before_config["autotermination_minutes"]
                
                if before_config.get("single_user_name"):
                    edit_params["single_user_name"] = before_config["single_user_name"]
                
                if before_config.get("docker_image"):
                    edit_params["docker_image"] = before_config["docker_image"]
                
                # Add autoscale or num_workers based on current configuration
                if before_config["autoscale"]:
                    edit_params["autoscale"] = before_config["autoscale"]
                else:
                    edit_params["num_workers"] = before_config["num_workers"]
                
                # PERFORM THE UPDATE
                ws_client.clusters.edit(**edit_params)
                
                # Get AFTER configuration
                time.sleep(2)  # Brief pause to let API update
                after_config = get_cluster_current_config(ws_client, row.cluster_id)
                
                # Build success message with preserved settings
                preserved_items = []
                if before_config.get('policy_id'):
                    preserved_items.append("policy")
                if before_config.get('spark_conf'):
                    preserved_items.append("spark configs")
                if before_config.get('init_scripts'):
                    preserved_items.append("init scripts")
                if before_config.get('custom_tags'):
                    preserved_items.append("custom tags")
                if before_config.get('autotermination_minutes'):
                    preserved_items.append("autotermination")
                
                preserved_note = f" | Preserved: {', '.join(preserved_items)}" if preserved_items else ""
                log_entry["update_status"] = "SUCCESS"
                log_entry["update_message"] = f"Updated: Driver {row.driver_instance_type}‚Üí{row.suggested_driver_instance}, Worker {row.worker_instance_type}‚Üí{row.suggested_worker_instance}{preserved_note}"
                
                # Create backup entry with BEFORE and AFTER configs
                backup_entry = {
                    "backup_id": f"{row.cluster_id}_{int(time.time() * 1000)}",
                    "batch_id": batch_metadata["batch_id"],
                    "execution_label": batch_metadata["execution_label"],
                    "backup_timestamp": datetime.now(),
                    "cluster_id": row.cluster_id,
                    "cluster_name": row.cluster_name,
                    "cluster_creator": cluster_creator,  # ADDED: Cluster owner
                    "workspace_name": row.workspace_name,
                    "workspace_id": row.workspace_id,
                    "deployment_url": log_entry["deployment_url"],
                    "update_status": "SUCCESS",
                    "update_reason": row.recommendation,
                    "updated_by_user": batch_metadata["executed_by_user"],
                    "before_config": serialize_cluster_config(before_config),
                    "before_driver_instance": before_config.get("driver_instance_type"),
                    "before_worker_instance": before_config.get("worker_instance_type"),
                    "before_policy_id": before_config.get("policy_id"),
                    "before_autoscale_min": before_config.get("min_workers"),
                    "before_autoscale_max": before_config.get("max_workers"),
                    "before_num_workers": before_config.get("num_workers"),
                    "after_config": serialize_cluster_config(after_config),
                    "after_driver_instance": after_config.get("driver_instance_type"),
                    "after_worker_instance": after_config.get("worker_instance_type"),
                    "after_policy_id": after_config.get("policy_id"),
                    "after_autoscale_min": after_config.get("min_workers"),
                    "after_autoscale_max": after_config.get("max_workers"),
                    "after_num_workers": after_config.get("num_workers"),
                    "is_reverted": False,
                    "revert_timestamp": None,
                    "revert_batch_id": None,
                    "reverted_by_user": None
                }
                # Populate dashboard fields
                backup_entry = populate_dashboard_fields(backup_entry, before_config, after_config)
                
            except Exception as update_error:
                error_msg = str(update_error)
                log_entry["update_status"] = "FAILED"
                if "403" in error_msg or "Cert validation failed" in error_msg or "certificate" in error_msg.lower():
                    log_entry["update_message"] = f"Failed: Connectivity issue to workspace '{row.workspace_name}' during update"
                else:
                    log_entry["update_message"] = f"Failed: {error_msg[:200]}"
                log_entry["error_details"] = error_msg
                
                # Create backup entry for failed update (still save before config)
                backup_entry = {
                    "backup_id": f"{row.cluster_id}_{int(time.time() * 1000)}_failed",
                    "batch_id": batch_metadata["batch_id"],
                    "execution_label": batch_metadata["execution_label"],
                    "backup_timestamp": datetime.now(),
                    "cluster_id": row.cluster_id,
                    "cluster_name": row.cluster_name,
                    "cluster_creator": cluster_creator,  # ADDED: Cluster owner
                    "workspace_name": row.workspace_name,
                    "workspace_id": row.workspace_id,
                    "deployment_url": log_entry["deployment_url"],
                    "update_status": "FAILED",
                    "update_reason": row.recommendation,
                    "updated_by_user": batch_metadata["executed_by_user"],
                    "before_config": serialize_cluster_config(before_config),
                    "before_driver_instance": before_config.get("driver_instance_type"),
                    "before_worker_instance": before_config.get("worker_instance_type"),
                    "before_policy_id": before_config.get("policy_id"),
                    "before_autoscale_min": before_config.get("min_workers"),
                    "before_autoscale_max": before_config.get("max_workers"),
                    "before_num_workers": before_config.get("num_workers"),
                    "after_config": None,
                    "after_driver_instance": None,
                    "after_worker_instance": None,
                    "after_policy_id": None,
                    "after_autoscale_min": None,
                    "after_autoscale_max": None,
                    "after_num_workers": None,
                    "is_reverted": False,
                    "revert_timestamp": None,
                    "revert_batch_id": None,
                    "reverted_by_user": None
                }
                # Populate dashboard fields
                backup_entry = populate_dashboard_fields(backup_entry, before_config, None)
            
    except Exception as e:
        error_msg = str(e)
        log_entry["validation_status"] = "ERROR"
        log_entry["update_status"] = "FAILED"
        log_entry["error_details"] = error_msg
        if "403" in error_msg or "Cert validation failed" in error_msg or "certificate" in error_msg.lower():
            log_entry["validation_message"] = f"Connectivity failed to workspace '{row.workspace_name}' - Cross-workspace access denied or certificate validation failed"
            log_entry["update_message"] = f"Failed: Cannot connect to workspace '{row.workspace_name}'"
        else:
            log_entry["validation_message"] = f"Error: {error_msg[:200]}"
            log_entry["update_message"] = f"Failed: {error_msg[:200]}"
    
    return log_entry, backup_entry

print("‚úì Main orchestration function defined with token authentication and implementation_notes support")
print("‚úì PRODUCTION-READY: Now capturing cluster_creator (owner) in all backup entries")
print("‚úì FIXED: Now preserving ALL cluster settings during updates")
print("‚úì FIXED: Instance pool clusters are now properly detected and skipped")
print("‚úì NEW: Now capturing before/after configs for rollback capability")
print("‚úì NEW: Dashboard-friendly fields automatically populated for all backups")
print("  - Policy, Spark configs, custom tags, init scripts, log config")
print("  - SSH keys, AWS attributes, environment variables, disk settings")
print("  - Security mode, runtime engine, autotermination")
print("  - Change summaries, impact levels, and user-friendly descriptions")

In [0]:
# Create logging table schema using full_schema variable

spark.sql(f"""
CREATE TABLE IF NOT EXISTS {full_schema}.cluster_update_log (
  -- Batch Identification
  batch_id STRING COMMENT 'Unique identifier for this execution batch',
  execution_label STRING COMMENT 'Human-readable label: YYYY-MM-DD_HH-MM_MODE_WORKSPACES for easy filtering',
  batch_start_time TIMESTAMP COMMENT 'When this batch execution started',
  batch_end_time TIMESTAMP COMMENT 'When this batch execution completed',
  execution_mode STRING COMMENT 'DRY_RUN or LIVE_UPDATE',
  workspace_filter_applied STRING COMMENT 'Workspace filter used (empty = all workspaces)',
  total_clusters_in_batch LONG COMMENT 'Total number of clusters processed in this batch',
  executed_by_user STRING COMMENT 'User who executed this batch',
  
  -- Individual Cluster Details
  log_id STRING COMMENT 'Unique identifier for this specific cluster update',
  cluster_id STRING,
  cluster_name STRING,
  workspace_name STRING,
  workspace_id LONG,
  deployment_url STRING,
  action_type STRING,
  recommendation STRING,
  current_driver_instance STRING,
  current_worker_instance STRING,
  suggested_driver_instance STRING,
  suggested_worker_instance STRING,
  current_min_workers LONG,
  current_max_workers LONG,
  validation_status STRING,
  validation_message STRING,
  update_status STRING,
  update_message STRING,
  dry_run BOOLEAN,
  validated_savings DECIMAL(35,2),
  execution_timestamp TIMESTAMP COMMENT 'When this specific cluster was processed',
  error_details STRING,
  implementation_notes STRING COMMENT 'Notes about instance type constraints or requirements'
)
USING DELTA
COMMENT 'Log table for cluster update activities with batch tracking for filtering and analysis'
""")

displayHTML(f"""
<div style="padding: 15px; background-color: #e8f5e9; border-left: 5px solid #4caf50; margin: 10px 0;">
    <h3 style="margin-top: 0; color: #2e7d32;">‚úì Logging table ready</h3>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Table:</strong> <code style="background-color: #c8e6c9; padding: 2px 6px; border-radius: 3px;">{full_schema}.cluster_update_log</code></p>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>New:</strong> Added implementation_notes column for tracking constraints</p>
</div>
""")

In [0]:
# Create cluster configuration backup table for rollback capability
# Production-ready: Works in new workspaces, includes all critical fields

try:
    # Create table with all fields
    spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {full_schema}.cluster_config_backup (
      -- Backup Identification
      backup_id STRING COMMENT 'Unique identifier for this backup record',
      batch_id STRING COMMENT 'Links to cluster_update_log batch_id',
      execution_label STRING COMMENT 'Links to cluster_update_log execution_label',
      backup_timestamp TIMESTAMP COMMENT 'When this backup was created',
      
      -- Cluster Identification
      cluster_id STRING COMMENT 'Databricks cluster ID',
      cluster_name STRING COMMENT 'Cluster name at time of backup',
      cluster_creator STRING COMMENT 'Cluster owner/creator username - CRITICAL for ownership tracking',
      workspace_name STRING COMMENT 'Workspace containing the cluster',
      workspace_id LONG COMMENT 'Workspace ID - CRITICAL for cross-workspace operations',
      deployment_url STRING COMMENT 'Full workspace URL',
      
      -- Update Context
      update_status STRING COMMENT 'SUCCESS, FAILED, DRY_RUN, or SKIPPED - links to cluster_update_log',
      update_reason STRING COMMENT 'Why the update was performed',
      updated_by_user STRING COMMENT 'User who performed the update',
      
      -- BEFORE Configuration (Complete JSON)
      before_config STRING COMMENT 'Complete cluster configuration BEFORE update (JSON format)',
      before_driver_instance STRING COMMENT 'Driver instance type before update',
      before_worker_instance STRING COMMENT 'Worker instance type before update',
      before_policy_id STRING COMMENT 'Cluster policy ID before update',
      before_autoscale_min INT COMMENT 'Min workers before update',
      before_autoscale_max INT COMMENT 'Max workers before update',
      before_num_workers INT COMMENT 'Fixed workers before update',
      before_spark_config_count INT COMMENT 'Number of Spark configs before update',
      before_custom_tags_count INT COMMENT 'Number of custom tags before update',
      before_init_scripts_count INT COMMENT 'Number of init scripts before update',
      before_autotermination_minutes INT COMMENT 'Autotermination setting before update',
      before_runtime_engine STRING COMMENT 'Runtime engine before update',
      before_data_security_mode STRING COMMENT 'Data security mode before update',
      
      -- AFTER Configuration (Complete JSON)
      after_config STRING COMMENT 'Complete cluster configuration AFTER update (JSON format)',
      after_driver_instance STRING COMMENT 'Driver instance type after update',
      after_worker_instance STRING COMMENT 'Worker instance type after update',
      after_policy_id STRING COMMENT 'Cluster policy ID after update',
      after_autoscale_min INT COMMENT 'Min workers after update',
      after_autoscale_max INT COMMENT 'Max workers after update',
      after_num_workers INT COMMENT 'Fixed workers after update',
      after_spark_config_count INT COMMENT 'Number of Spark configs after update',
      after_custom_tags_count INT COMMENT 'Number of custom tags after update',
      after_init_scripts_count INT COMMENT 'Number of init scripts after update',
      after_autotermination_minutes INT COMMENT 'Autotermination setting after update',
      after_runtime_engine STRING COMMENT 'Runtime engine after update',
      after_data_security_mode STRING COMMENT 'Data security mode after update',
      
      -- Change Analysis (Dashboard-friendly)
      change_categories STRING COMMENT 'Comma-separated list of what changed',
      total_changes_count INT COMMENT 'Total number of configuration changes',
      change_impact STRING COMMENT 'MINOR, MODERATE, MAJOR, or NONE',
      instance_type_changed BOOLEAN COMMENT 'Did instance types change?',
      policy_changed BOOLEAN COMMENT 'Did cluster policy change?',
      spark_config_changed BOOLEAN COMMENT 'Did Spark configs change?',
      tags_changed BOOLEAN COMMENT 'Did custom tags change?',
      init_scripts_changed BOOLEAN COMMENT 'Did init scripts change?',
      autotermination_changed BOOLEAN COMMENT 'Did autotermination setting change?',
      runtime_engine_changed BOOLEAN COMMENT 'Did runtime engine change?',
      security_mode_changed BOOLEAN COMMENT 'Did security mode change?',
      autoscale_changed BOOLEAN COMMENT 'Did autoscale settings change?',
      change_summary STRING COMMENT 'Human-readable summary of changes',
      change_details STRING COMMENT 'Detailed list of all changes',
      
      -- Rollback Tracking
      is_reverted BOOLEAN COMMENT 'Has this cluster been reverted to before_config?',
      revert_timestamp TIMESTAMP COMMENT 'When the cluster was reverted (if applicable)',
      revert_batch_id STRING COMMENT 'Batch ID of the revert operation',
      reverted_by_user STRING COMMENT 'User who performed the revert'
    )
    USING DELTA
    COMMENT 'Backup table storing complete before/after cluster configurations for rollback capability'
    """)
    
    # For existing tables, ensure cluster_creator column exists (added in v2.0)
    # This handles upgrades from older versions
    try:
        existing_columns = spark.sql(f"DESCRIBE TABLE {full_schema}.cluster_config_backup").toPandas()
        has_cluster_creator = 'cluster_creator' in existing_columns['col_name'].values
        
        if not has_cluster_creator:
            print("‚ö†Ô∏è  Upgrading existing table: Adding cluster_creator column...")
            spark.sql(f"""
                ALTER TABLE {full_schema}.cluster_config_backup 
                ADD COLUMN cluster_creator STRING 
                COMMENT 'Cluster owner/creator username - CRITICAL for ownership tracking'
                AFTER cluster_name
            """)
            print("‚úÖ Successfully added cluster_creator column to existing table")
    except Exception as alter_error:
        if "already exists" in str(alter_error).lower():
            pass  # Column already exists, no action needed
        else:
            print(f"‚ö†Ô∏è  Warning during column check: {str(alter_error)}")
    
    # Verify table was created successfully
    try:
        table_check = spark.sql(f"DESCRIBE TABLE {full_schema}.cluster_config_backup")
        table_exists = table_check.count() > 0
    except:
        table_exists = False
    
    if table_exists:
        # Get table info for verification
        table_info = spark.sql(f"DESCRIBE TABLE {full_schema}.cluster_config_backup")
        column_count = table_info.count()
        
        # Verify critical fields
        columns_df = table_info.toPandas()
        critical_fields = ['cluster_creator', 'workspace_id', 'cluster_id', 'cluster_name']
        missing_fields = [f for f in critical_fields if f not in columns_df['col_name'].values]
        
        if missing_fields:
            displayHTML(f"""
            <div style="padding: 15px; background-color: #fff3e0; border-left: 5px solid #ff9800; margin: 10px 0;">
                <h3 style="margin-top: 0; color: #e65100;">‚ö†Ô∏è Warning: Missing Critical Fields</h3>
                <p style="margin: 5px 0; color: #bf360c;"><strong>Table:</strong> <code style="background-color: #ffe0b2; padding: 2px 6px; border-radius: 3px;">{full_schema}.cluster_config_backup</code></p>
                <p style="margin: 5px 0; color: #bf360c;"><strong>Missing Fields:</strong> {', '.join(missing_fields)}</p>
                <p style="margin: 5px 0; color: #bf360c;"><strong>Action:</strong> Table may need manual schema update</p>
            </div>
            """)
        else:
            displayHTML(f"""
            <div style="padding: 15px; background-color: #e8f5e9; border-left: 5px solid #4caf50; margin: 10px 0;">
                <h3 style="margin-top: 0; color: #2e7d32;">‚úì Cluster Config Backup Table Ready</h3>
                <p style="margin: 5px 0; color: #1b5e20;"><strong>Table:</strong> <code style="background-color: #c8e6c9; padding: 2px 6px; border-radius: 3px;">{full_schema}.cluster_config_backup</code></p>
                <p style="margin: 5px 0; color: #1b5e20;"><strong>Columns:</strong> {column_count} fields defined</p>
                <p style="margin: 5px 0; color: #1b5e20;"><strong>Status:</strong> Production-ready ‚úì</p>
                <p style="margin: 5px 0; color: #1b5e20;"><strong>Critical Fields Verified:</strong></p>
                <ul style="color: #1b5e20; margin: 5px 0; padding-left: 20px;">
                    <li><strong>cluster_creator</strong> ‚úì - Cluster owner/creator for ownership tracking</li>
                    <li><strong>workspace_id</strong> ‚úì - Workspace ID for cross-workspace operations</li>
                    <li><strong>cluster_id</strong> ‚úì - Unique cluster identifier</li>
                    <li><strong>cluster_name</strong> ‚úì - Human-readable cluster name</li>
                </ul>
                <p style="margin: 5px 0; color: #1b5e20;"><strong>Features:</strong></p>
                <ul style="color: #1b5e20; margin: 5px 0; padding-left: 20px;">
                    <li>Complete JSON configuration snapshots (before/after)</li>
                    <li>Links to update logs via batch_id and execution_label</li>
                    <li>Rollback tracking (is_reverted, revert_timestamp)</li>
                    <li>Key fields extracted for easy querying</li>
                    <li>Dashboard-ready change analysis fields (13 metrics)</li>
                </ul>
            </div>
            """)
    else:
        raise Exception(f"Table creation appeared to succeed but table not found in {full_schema}")
        
except Exception as e:
    displayHTML(f"""
    <div style="padding: 15px; background-color: #ffebee; border-left: 5px solid #f44336; margin: 10px 0;">
        <h3 style="margin-top: 0; color: #c62828;">‚úó Error Creating Backup Table</h3>
        <p style="margin: 5px 0; color: #b71c1c;"><strong>Error:</strong> {str(e)}</p>
        <p style="margin: 5px 0; color: #b71c1c;"><strong>Table:</strong> {full_schema}.cluster_config_backup</p>
        <p style="margin: 5px 0; color: #b71c1c;"><strong>Action Required:</strong> Check catalog/schema permissions and try again</p>
    </div>
    """)
    raise

In [0]:
import uuid

# Configuration is loaded from widgets (see cells above)
# DRY_RUN, WORKSPACES_TO_UPDATE, and cluster_data are already set

mode_text = "DRY RUN (Preview Only)" if DRY_RUN else "LIVE UPDATE"
mode_color = "#4caf50" if DRY_RUN else "#f44336"

# Generate batch metadata
batch_start_time = datetime.now()
batch_id = str(uuid.uuid4())
current_user = spark.sql("SELECT current_user() as user").collect()[0]["user"]

# Create execution label for easy filtering
# Format: YYYY-MM-DD_HH-MM_MODE_WORKSPACES
date_str = batch_start_time.strftime('%Y-%m-%d')
time_str = batch_start_time.strftime('%H-%M')
mode_str = "DRY-RUN" if DRY_RUN else "LIVE"
workspace_str = WORKSPACES_TO_UPDATE.replace(",", "+") if WORKSPACES_TO_UPDATE else "ALL"
execution_label = f"{date_str}_{time_str}_{mode_str}_{workspace_str}"

displayHTML(f"""
<div style="padding: 15px; background-color: #e3f2fd; border-left: 5px solid #2196f3; margin: 10px 0;">
    <h3 style="margin-top: 0; color: #1565c0;">Execution Mode: <span style="color: {mode_color};">{mode_text}</span></h3>
    <p style="margin: 5px 0; color: #0d47a1;"><strong>Execution Label:</strong> <code style="background-color: #bbdefb; padding: 2px 6px; border-radius: 3px;">{execution_label}</code></p>
    <p style="margin: 5px 0; color: #0d47a1;"><strong>Batch ID:</strong> {batch_id}</p>
    <p style="margin: 5px 0; color: #0d47a1;"><strong>Executed By:</strong> {current_user}</p>
    <p style="margin: 5px 0; color: #0d47a1;"><strong>Start Time:</strong> {batch_start_time.strftime('%Y-%m-%d %H:%M:%S')}</p>
</div>
""")

# Filter by workspace if specified (cluster_data already loaded in previous cell)
filtered_cluster_data = cluster_data
if WORKSPACES_TO_UPDATE:
    workspace_list = [ws.strip() for ws in WORKSPACES_TO_UPDATE.split(",")]
    filtered_cluster_data = cluster_data.filter(F.col("workspace_name").isin(workspace_list))

# Collect cluster data
cluster_rows = filtered_cluster_data.collect()
total_clusters = len(cluster_rows)

# Create batch metadata dictionary
batch_metadata = {
    "batch_id": batch_id,
    "execution_label": execution_label,
    "batch_start_time": batch_start_time,
    "batch_end_time": None,  # Will be set after processing
    "execution_mode": "DRY_RUN" if DRY_RUN else "LIVE_UPDATE",
    "workspace_filter_applied": WORKSPACES_TO_UPDATE if WORKSPACES_TO_UPDATE else "ALL",
    "total_clusters_in_batch": total_clusters,
    "executed_by_user": current_user
}

if not cluster_rows:
    displayHTML("""
    <div style="padding: 15px; background-color: #ffebee; border-left: 5px solid #f44336; margin: 10px 0;">
        <h3 style="margin: 0; color: #c62828;">‚ö† No clusters found matching the filter criteria</h3>
    </div>
    """)
else:
    displayHTML(f"""
    <div style="padding: 15px; background-color: #fff3e0; border-left: 5px solid #ff9800; margin: 10px 0;">
        <h3 style="margin: 0; color: #e65100;">üîÑ Processing {len(cluster_rows)} clusters...</h3>
    </div>
    """)
    
    # Process each cluster
    log_entries = []
    backup_entries = []
    for idx, row in enumerate(cluster_rows, 1):
        display(f"Processing {idx}/{len(cluster_rows)}: {row.cluster_name} ({row.cluster_id}) in {row.workspace_name}")
        log_entry, backup_entry = update_cluster_with_recommendation(row, dry_run=DRY_RUN, batch_metadata=batch_metadata)
        log_entries.append(log_entry)
        if backup_entry:  # Only add if backup was created
            backup_entries.append(backup_entry)
        
        # Brief pause to avoid rate limiting
        if idx % 10 == 0:
            time.sleep(1)
    
    # Update batch end time
    batch_end_time = datetime.now()
    batch_metadata["batch_end_time"] = batch_end_time
    
    # Update all log entries with batch end time
    for entry in log_entries:
        entry["batch_end_time"] = batch_end_time
    
    duration_seconds = (batch_end_time - batch_start_time).total_seconds()
    
    displayHTML(f"""
    <div style="padding: 15px; background-color: #e8f5e9; border-left: 5px solid #4caf50; margin: 10px 0;">
        <h3 style="margin: 0; color: #2e7d32;">‚úì Processed {len(log_entries)} clusters</h3>
        <p style="margin: 5px 0; color: #1b5e20;"><strong>Batch Duration:</strong> {duration_seconds:.2f} seconds</p>
        <p style="margin: 5px 0; color: #1b5e20;"><strong>Execution Label:</strong> <code style="background-color: #c8e6c9; padding: 2px 6px; border-radius: 3px;">{execution_label}</code></p>
        <p style="margin: 5px 0; color: #1b5e20;"><strong>Batch ID:</strong> {batch_id}</p>
        <p style="margin: 5px 0; color: #1b5e20;"><strong>Config Backups Created:</strong> {len(backup_entries)}</p>
    </div>
    """)

In [0]:
# Convert log entries to DataFrame
log_schema = StructType([
    # Batch metadata
    StructField("batch_id", StringType(), True),
    StructField("execution_label", StringType(), True),
    StructField("batch_start_time", TimestampType(), True),
    StructField("batch_end_time", TimestampType(), True),
    StructField("execution_mode", StringType(), True),
    StructField("workspace_filter_applied", StringType(), True),
    StructField("total_clusters_in_batch", LongType(), True),
    StructField("executed_by_user", StringType(), True),
    
    # Individual cluster details
    StructField("log_id", StringType(), True),
    StructField("cluster_id", StringType(), True),
    StructField("cluster_name", StringType(), True),
    StructField("workspace_name", StringType(), True),
    StructField("workspace_id", LongType(), True),
    StructField("deployment_url", StringType(), True),
    StructField("action_type", StringType(), True),
    StructField("recommendation", StringType(), True),
    StructField("current_driver_instance", StringType(), True),
    StructField("current_worker_instance", StringType(), True),
    StructField("suggested_driver_instance", StringType(), True),
    StructField("suggested_worker_instance", StringType(), True),
    StructField("current_min_workers", LongType(), True),
    StructField("current_max_workers", LongType(), True),
    StructField("validation_status", StringType(), True),
    StructField("validation_message", StringType(), True),
    StructField("update_status", StringType(), True),
    StructField("update_message", StringType(), True),
    StructField("dry_run", BooleanType(), True),
    StructField("validated_savings", DecimalType(35, 2), True),
    StructField("execution_timestamp", TimestampType(), True),
    StructField("error_details", StringType(), True),
    StructField("implementation_notes", StringType(), True)
])

log_df = spark.createDataFrame(log_entries, schema=log_schema)

# Write to logging table using full_schema variable
log_df.write.mode("append").saveAsTable(f"{full_schema}.cluster_update_log")

log_count = log_df.count()

# Save backup entries if any exist
backup_count = 0
if backup_entries:
    backup_schema = StructType([
        # Basic identification
        StructField("backup_id", StringType(), True),
        StructField("batch_id", StringType(), True),
        StructField("execution_label", StringType(), True),
        StructField("backup_timestamp", TimestampType(), True),
        StructField("cluster_id", StringType(), True),
        StructField("cluster_name", StringType(), True),
        StructField("cluster_creator", StringType(), True),  # Added missing field
        StructField("workspace_name", StringType(), True),
        StructField("workspace_id", LongType(), True),
        StructField("deployment_url", StringType(), True),
        StructField("update_status", StringType(), True),
        StructField("update_reason", StringType(), True),
        StructField("updated_by_user", StringType(), True),
        
        # Before configuration
        StructField("before_config", StringType(), True),
        StructField("before_driver_instance", StringType(), True),
        StructField("before_worker_instance", StringType(), True),
        StructField("before_policy_id", StringType(), True),
        StructField("before_autoscale_min", IntegerType(), True),
        StructField("before_autoscale_max", IntegerType(), True),
        StructField("before_num_workers", IntegerType(), True),
        
        # After configuration
        StructField("after_config", StringType(), True),
        StructField("after_driver_instance", StringType(), True),
        StructField("after_worker_instance", StringType(), True),
        StructField("after_policy_id", StringType(), True),
        StructField("after_autoscale_min", IntegerType(), True),
        StructField("after_autoscale_max", IntegerType(), True),
        StructField("after_num_workers", IntegerType(), True),
        
        # Revert tracking
        StructField("is_reverted", BooleanType(), True),
        StructField("revert_timestamp", TimestampType(), True),
        StructField("revert_batch_id", StringType(), True),
        StructField("reverted_by_user", StringType(), True),
        
        # Dashboard-friendly fields - Change summary
        StructField("change_categories", StringType(), True),
        StructField("total_changes_count", IntegerType(), True),
        StructField("change_impact", StringType(), True),
        
        # Dashboard-friendly fields - Extracted settings counts
        StructField("before_spark_config_count", IntegerType(), True),
        StructField("after_spark_config_count", IntegerType(), True),
        StructField("before_custom_tags_count", IntegerType(), True),
        StructField("after_custom_tags_count", IntegerType(), True),
        StructField("before_init_scripts_count", IntegerType(), True),
        StructField("after_init_scripts_count", IntegerType(), True),
        StructField("before_autotermination_minutes", IntegerType(), True),
        StructField("after_autotermination_minutes", IntegerType(), True),
        StructField("before_runtime_engine", StringType(), True),
        StructField("after_runtime_engine", StringType(), True),
        StructField("before_data_security_mode", StringType(), True),
        StructField("after_data_security_mode", StringType(), True),
        
        # Dashboard-friendly fields - Change flags
        StructField("instance_type_changed", BooleanType(), True),
        StructField("policy_changed", BooleanType(), True),
        StructField("spark_config_changed", BooleanType(), True),
        StructField("tags_changed", BooleanType(), True),
        StructField("init_scripts_changed", BooleanType(), True),
        StructField("autotermination_changed", BooleanType(), True),
        StructField("runtime_engine_changed", BooleanType(), True),
        StructField("security_mode_changed", BooleanType(), True),
        StructField("autoscale_changed", BooleanType(), True),
        
        # Dashboard-friendly fields - User-friendly summaries
        StructField("change_summary", StringType(), True),
        StructField("change_details", StringType(), True)
    ])
    
    backup_df = spark.createDataFrame(backup_entries, schema=backup_schema)
    backup_df.write.mode("append").saveAsTable(f"{full_schema}.cluster_config_backup")
    backup_count = backup_df.count()

displayHTML(f"""
<div style="padding: 15px; background-color: #e8f5e9; border-left: 5px solid #4caf50; margin: 10px 0;">
    <h3 style="margin: 0; color: #2e7d32;">‚úì Saved results to logging tables</h3>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Update Logs:</strong> {log_count} entries saved to <code style="background-color: #c8e6c9; padding: 2px 6px; border-radius: 3px;">{full_schema}.cluster_update_log</code></p>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Config Backups:</strong> {backup_count} entries saved to <code style="background-color: #c8e6c9; padding: 2px 6px; border-radius: 3px;">{full_schema}.cluster_config_backup</code></p>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Dashboard Fields:</strong> Change summaries, impact levels, and user-friendly descriptions included</p>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Execution Label:</strong> <code style="background-color: #c8e6c9; padding: 2px 6px; border-radius: 3px;">{execution_label}</code></p>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Batch ID:</strong> {batch_id}</p>
</div>
""")

In [0]:
# Sample queries for building dashboards with cluster_config_backup table

displayHTML("""
<div style="padding: 15px; background-color: #e3f2fd; border-left: 5px solid #2196f3; margin: 10px 0;">
    <h3 style="margin-top: 0; color: #1565c0;">üìä Sample Dashboard Queries</h3>
    <p style="margin: 5px 0; color: #0d47a1;">Use these queries to build user-friendly dashboards showing cluster changes</p>
</div>
""")

print("\n" + "="*80)
print("QUERY 1: User-Friendly Change Summary")
print("="*80)
print(f"""
SELECT 
  execution_label,
  cluster_name,
  workspace_name,
  change_summary,
  change_impact,
  total_changes_count,
  backup_timestamp,
  updated_by_user
FROM {full_schema}.cluster_config_backup
WHERE update_status = 'SUCCESS'
  AND is_reverted = false
ORDER BY backup_timestamp DESC
LIMIT 50
""")

print("\n" + "="*80)
print("QUERY 2: Changes by Category (for filtering)")
print("="*80)
print(f"""
SELECT 
  cluster_name,
  workspace_name,
  CASE 
    WHEN instance_type_changed THEN 'Instance Type Changed'
    WHEN policy_changed THEN 'Policy Changed'
    WHEN spark_config_changed THEN 'Spark Config Changed'
    WHEN tags_changed THEN 'Tags Changed'
    ELSE 'Other Changes'
  END as change_type,
  before_driver_instance,
  after_driver_instance,
  before_worker_instance,
  after_worker_instance,
  change_details,
  backup_timestamp
FROM {full_schema}.cluster_config_backup
WHERE update_status = 'SUCCESS'
ORDER BY backup_timestamp DESC
""")

print("\n" + "="*80)
print("QUERY 3: Change Impact Analysis")
print("="*80)
print(f"""
SELECT 
  change_impact,
  COUNT(*) as cluster_count,
  COUNT(DISTINCT workspace_name) as workspace_count,
  AVG(total_changes_count) as avg_changes_per_cluster
FROM {full_schema}.cluster_config_backup
WHERE update_status = 'SUCCESS'
  AND is_reverted = false
GROUP BY change_impact
ORDER BY 
  CASE change_impact 
    WHEN 'MAJOR' THEN 1 
    WHEN 'MODERATE' THEN 2 
    WHEN 'MINOR' THEN 3 
    ELSE 4 
  END
""")

print("\n" + "="*80)
print("QUERY 4: Before/After Comparison for Specific Cluster")
print("="*80)
print(f"""
SELECT 
  cluster_name,
  workspace_name,
  -- Instance Types
  before_driver_instance,
  after_driver_instance,
  before_worker_instance,
  after_worker_instance,
  -- Policy
  before_policy_id,
  after_policy_id,
  -- Configurations
  before_spark_config_count,
  after_spark_config_count,
  before_custom_tags_count,
  after_custom_tags_count,
  before_init_scripts_count,
  after_init_scripts_count,
  -- Runtime
  before_runtime_engine,
  after_runtime_engine,
  before_autotermination_minutes,
  after_autotermination_minutes,
  -- Summary
  change_summary,
  backup_timestamp
FROM {full_schema}.cluster_config_backup
WHERE cluster_name = '<YOUR_CLUSTER_NAME>'
ORDER BY backup_timestamp DESC
LIMIT 1
""")

print("\n" + "="*80)
print("QUERY 5: Changes by Workspace (for workspace owners)")
print("="*80)
print(f"""
SELECT 
  workspace_name,
  COUNT(*) as total_changes,
  SUM(CASE WHEN instance_type_changed THEN 1 ELSE 0 END) as instance_changes,
  SUM(CASE WHEN policy_changed THEN 1 ELSE 0 END) as policy_changes,
  SUM(CASE WHEN spark_config_changed THEN 1 ELSE 0 END) as config_changes,
  SUM(CASE WHEN change_impact = 'MAJOR' THEN 1 ELSE 0 END) as major_changes,
  MAX(backup_timestamp) as last_change_date
FROM {full_schema}.cluster_config_backup
WHERE update_status = 'SUCCESS'
  AND is_reverted = false
GROUP BY workspace_name
ORDER BY total_changes DESC
""")

print("\n" + "="*80)
print("QUERY 6: Recent Changes Timeline")
print("="*80)
print(f"""
SELECT 
  DATE(backup_timestamp) as change_date,
  COUNT(*) as changes_count,
  COUNT(DISTINCT cluster_id) as clusters_affected,
  COUNT(DISTINCT workspace_name) as workspaces_affected,
  SUM(CASE WHEN change_impact = 'MAJOR' THEN 1 ELSE 0 END) as major_changes,
  SUM(CASE WHEN change_impact = 'MODERATE' THEN 1 ELSE 0 END) as moderate_changes,
  SUM(CASE WHEN change_impact = 'MINOR' THEN 1 ELSE 0 END) as minor_changes
FROM {full_schema}.cluster_config_backup
WHERE update_status = 'SUCCESS'
  AND backup_timestamp >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY DATE(backup_timestamp)
ORDER BY change_date DESC
""")

print("\n" + "="*80)
print("QUERY 7: Clusters Ready for Revert (if needed)")
print("="*80)
print(f"""
SELECT 
  backup_id,
  cluster_name,
  workspace_name,
  change_summary,
  change_impact,
  backup_timestamp,
  DATEDIFF(CURRENT_DATE, DATE(backup_timestamp)) as days_since_change
FROM {full_schema}.cluster_config_backup
WHERE update_status = 'SUCCESS'
  AND is_reverted = false
  AND change_impact IN ('MAJOR', 'MODERATE')
ORDER BY backup_timestamp DESC
""")

displayHTML("""
<div style="padding: 15px; background-color: #e8f5e9; border-left: 5px solid #4caf50; margin: 10px 0;">
    <h3 style="margin-top: 0; color: #2e7d32;">‚úì Dashboard Query Examples Ready</h3>
    <p style="margin: 5px 0; color: #1b5e20;">Copy these queries to your dashboard or BI tool</p>
    <p style="margin: 5px 0; color: #1b5e20;"><strong>Key Features:</strong></p>
    <ul style="color: #1b5e20; margin: 5px 0;">
        <li>User-friendly change summaries</li>
        <li>Filterable by change type, impact, workspace</li>
        <li>Before/after comparisons</li>
        <li>Timeline analysis</li>
        <li>Revert candidates identification</li>
    </ul>
</div>
""")

# üìã Cluster Config Backup Table - Complete Guide

## ‚úÖ ONE Table Serves TWO Purposes

The `cluster_config_backup` table is now enhanced to serve both:
1. **Reverting clusters** to previous configurations
2. **Building dashboards** to show users what changed

---

## üîÑ Purpose 1: Revert Capability

### Complete JSON Configs
* `before_config` - Full cluster configuration before update (JSON)
* `after_config` - Full cluster configuration after update (JSON)
* Contains ALL settings: policy, Spark configs, tags, init scripts, security, etc.

### Revert Tracking
* `is_reverted` - Has this cluster been reverted?
* `revert_timestamp` - When was it reverted?
* `revert_batch_id` - Which batch reverted it?
* `reverted_by_user` - Who performed the revert?

### Usage in Revert Notebook
```python
# Query backup to revert
backup = spark.sql(f"""
  SELECT backup_id, before_config, cluster_id
  FROM {catalog}.{schema}.cluster_config_backup
  WHERE cluster_name = 'my-cluster'
    AND update_status = 'SUCCESS'
    AND is_reverted = false
  ORDER BY backup_timestamp DESC
  LIMIT 1
""").collect()[0]

# Parse JSON and revert
import json
config = json.loads(backup.before_config)
# Use Databricks SDK to apply config...
```

---

## üìä Purpose 2: Dashboard Building

### User-Friendly Fields

**Change Summary:**
* `change_summary` - Human-readable description (e.g., "Instance Types: Driver m5.xlarge ‚Üí m5.large, Worker m5.xlarge ‚Üí m5.large")
* `change_details` - Detailed breakdown of all changes
* `change_categories` - Comma-separated list (e.g., "instance_type,policy,spark_config")
* `total_changes_count` - Number of settings changed
* `change_impact` - MINOR, MODERATE, or MAJOR

**Boolean Flags (for filtering):**
* `instance_type_changed`
* `policy_changed`
* `spark_config_changed`
* `tags_changed`
* `init_scripts_changed`
* `autotermination_changed`
* `runtime_engine_changed`
* `security_mode_changed`
* `autoscale_changed`

**Extracted Counts (before/after):**
* `before_spark_config_count` / `after_spark_config_count`
* `before_custom_tags_count` / `after_custom_tags_count`
* `before_init_scripts_count` / `after_init_scripts_count`
* `before_autotermination_minutes` / `after_autotermination_minutes`
* `before_runtime_engine` / `after_runtime_engine`
* `before_data_security_mode` / `after_data_security_mode`

**Key Instance Fields:**
* `before_driver_instance` / `after_driver_instance`
* `before_worker_instance` / `after_worker_instance`
* `before_policy_id` / `after_policy_id`
* `before_autoscale_min/max` / `after_autoscale_min/max`

### Dashboard Use Cases

1. **Change Timeline** - Show when clusters were updated
2. **Impact Analysis** - Filter by MAJOR/MODERATE/MINOR changes
3. **Workspace View** - Show changes per workspace for owners
4. **Change Type Filter** - Filter by instance type, policy, config changes
5. **Before/After Comparison** - Side-by-side view of settings
6. **Audit Trail** - Who changed what and when
7. **Revert Candidates** - Identify clusters that might need reverting

---

## üéØ Key Benefits

### Single Source of Truth
‚úÖ No data duplication  
‚úÖ Automatic sync (one write operation)  
‚úÖ Easier maintenance  
‚úÖ Lower storage costs  

### Complete Information
‚úÖ Full JSON for technical revert operations  
‚úÖ User-friendly fields for business dashboards  
‚úÖ Change tracking and audit trail  
‚úÖ Revert prevention (is_reverted flag)  

### Flexible Querying
‚úÖ Filter by change type, impact, workspace  
‚úÖ Aggregate by date, user, workspace  
‚úÖ Join with other tables for enrichment  
‚úÖ Time-series analysis of changes  

---

## üìù Next Steps

1. **Run cluster updates** (dry-run or live) to populate the table
2. **Build dashboards** using the sample queries provided
3. **Create revert notebook** for rollback capability
4. **Set up alerts** for MAJOR impact changes
5. **Share with workspace owners** for transparency

---

## üîó Related Tables

* `cluster_update_log` - Execution logs and status
* `cluster_opportunities` - Optimization recommendations
* `cluster_config_backup` - **This table** (backup + dashboard)

---

**Table Location:** `{catalog}.{schema}.cluster_config_backup`  
**Updated:** Automatically on every cluster update  
**Retention:** Permanent (until manually deleted)

In [0]:
# Verify row counts
opportunities_count = spark.table(f"{full_schema}.cluster_opportunities").count()
log_count = log_df.count()

row_match = opportunities_count == log_count
match_color = "#4caf50" if row_match else "#f44336"
match_icon = "‚úì" if row_match else "‚úó"

mode_display = "DRY RUN" if DRY_RUN else "LIVE UPDATE"
mode_color = "#4caf50" if DRY_RUN else "#f44336"

displayHTML(f"""
<div style="border: 2px solid #2196f3; padding: 20px; margin: 20px 0; background-color: #e3f2fd; border-radius: 8px;">
    <h2 style="text-align: center; color: #1565c0; margin-top: 0;">
        üìä EXECUTION SUMMARY
    </h2>
    <hr style="border: 1px solid #2196f3; margin: 15px 0;">
    
    <div style="font-size: 16px; line-height: 2; padding: 10px;">
        <div style="margin: 10px 0;">
            <strong>Mode:</strong> 
            <span style="color: {mode_color}; font-weight: bold; font-size: 18px;">{mode_display}</span>
        </div>
        <div style="margin: 10px 0;">
            <strong>Opportunities table rows:</strong> 
            <span style="font-size: 18px; color: #1976d2;">{opportunities_count}</span>
        </div>
        <div style="margin: 10px 0;">
            <strong>Log table rows (this run):</strong> 
            <span style="font-size: 18px; color: #1976d2;">{log_count}</span>
        </div>
        <div style="margin: 10px 0;">
            <strong>Row count match:</strong> 
            <span style="color: {match_color}; font-weight: bold; font-size: 18px;">{match_icon} {'YES' if row_match else 'NO'}</span>
        </div>
    </div>
</div>
""")

# Summary by status
displayHTML("""
<div style="padding: 10px; background-color: #fff3e0; border-left: 4px solid #ff9800; margin: 15px 0;">
    <h3 style="margin-top: 0; color: #e65100;">Status Breakdown:</h3>
</div>
""")
display(log_df.groupBy("validation_status", "update_status").count().orderBy("count", ascending=False))

displayHTML("""
<div style="padding: 10px; background-color: #e8f5e9; border-left: 4px solid #4caf50; margin: 15px 0;">
    <h3 style="margin-top: 0; color: #2e7d32;">Potential Savings Summary:</h3>
</div>
""")
display(log_df.groupBy("update_status").agg(
    F.count("*").alias("cluster_count"),
    F.sum("validated_savings").alias("total_savings_usd")
).orderBy("total_savings_usd", ascending=False))

In [0]:
# Display detailed results
displayHTML("""
<div style="padding: 10px; background-color: #e3f2fd; border-left: 4px solid #2196f3; margin: 15px 0;">
    <h3 style="margin-top: 0; color: #1565c0;">Detailed Log Entries:</h3>
</div>
""")

# Check if implementation_notes column exists in log_df
if 'implementation_notes' in log_df.columns:
    display(log_df.select(
        "cluster_name",
        "workspace_name",
        "validation_status",
        "update_status",
        "current_driver_instance",
        "suggested_driver_instance",
        "current_worker_instance",
        "suggested_worker_instance",
        "implementation_notes",
        "validated_savings",
        "update_message"
    ).orderBy(F.col("validated_savings").desc()))
else:
    # Fallback if column doesn't exist (old schema)
    display(log_df.select(
        "cluster_name",
        "workspace_name",
        "validation_status",
        "update_status",
        "current_driver_instance",
        "suggested_driver_instance",
        "current_worker_instance",
        "suggested_worker_instance",
        "validated_savings",
        "update_message"
    ).orderBy(F.col("validated_savings").desc()))

In [0]:
# Query batch execution history
displayHTML("""
<div style="padding: 15px; background-color: #e3f2fd; border-left: 5px solid #2196f3; margin: 15px 0;">
    <h3 style="margin-top: 0; color: #1565c0;">üìã Batch Execution History</h3>
    <p style="margin: 5px 0; color: #0d47a1;">Use execution_label for easy filtering in dashboards</p>
</div>
""")

# Get batch summary with execution_label
batch_summary = spark.sql(f"""
    SELECT 
        execution_label,
        batch_id,
        execution_mode,
        workspace_filter_applied,
        executed_by_user,
        batch_start_time,
        batch_end_time,
        ROUND((UNIX_TIMESTAMP(batch_end_time) - UNIX_TIMESTAMP(batch_start_time)), 2) as duration_seconds,
        total_clusters_in_batch,
        COUNT(*) as clusters_processed,
        SUM(CASE WHEN update_status = 'SUCCESS' THEN 1 ELSE 0 END) as successful_updates,
        SUM(CASE WHEN update_status = 'DRY_RUN' THEN 1 ELSE 0 END) as dry_run_previews,
        SUM(CASE WHEN update_status = 'SKIPPED' THEN 1 ELSE 0 END) as skipped_clusters,
        SUM(CASE WHEN update_status = 'FAILED' THEN 1 ELSE 0 END) as failed_updates,
        SUM(validated_savings) as total_potential_savings
    FROM {full_schema}.cluster_update_log
    GROUP BY 
        execution_label,
        batch_id,
        execution_mode,
        workspace_filter_applied,
        executed_by_user,
        batch_start_time,
        batch_end_time,
        total_clusters_in_batch
    ORDER BY batch_start_time DESC
    LIMIT 20
""")

display(batch_summary)

print("\n" + "="*80)
print("HOW TO USE EXECUTION LABEL FILTERING:")
print("="*80)
print(f"\n1. Copy an execution_label from the table above (e.g., '2024-11-20_15-30_DRY-RUN_prod')")
print(f"\n2. Query specific execution details:")
print(f"   SELECT * FROM {full_schema}.cluster_update_log")
print(f"   WHERE execution_label = '<your-execution-label>'")
print(f"\n3. Filter by pattern (useful in dashboards):")
print(f"   WHERE execution_label LIKE '2024-11-20%'  -- All runs on Nov 20")
print(f"   WHERE execution_label LIKE '%DRY-RUN%'    -- All dry runs")
print(f"   WHERE execution_label LIKE '%prod%'       -- All prod workspace runs")
print(f"\n4. Use in dashboard filters:")
print(f"   - Add execution_label as a dropdown filter")
print(f"   - Users can easily select specific runs")
print("="*80)

In [0]:
# Get list of execution labels from the log table
execution_labels_df = spark.sql(f"""
    SELECT DISTINCT execution_label
    FROM {full_schema}.cluster_update_log
    WHERE execution_label IS NOT NULL
    ORDER BY execution_label DESC
""")

execution_labels = (
    execution_labels_df
    .toPandas()['execution_label']
    .tolist()
)

if execution_labels:
    # Create widget for execution label selection
    dbutils.widgets.dropdown(
        "selected_execution_label", 
        execution_labels[0],
        execution_labels,
        "Select Execution Run"
    )
    
    displayHTML("""
    <div style="padding: 15px; background-color: #fff3e0; border-left: 5px solid #ff9800; margin: 10px 0;">
        <h3 style="margin-top: 0; color: #e65100;">üîç Interactive Filter</h3>
        <p style="margin: 5px 0; color: #bf360c;">Select an execution run from the dropdown above to view its details</p>
    </div>
    """)
else:
    displayHTML("""
    <div style="padding: 15px; background-color: #ffebee; border-left: 5px solid #f44336; margin: 10px 0;">
        <h3 style="margin: 0; color: #c62828;">‚ö† No execution logs found</h3>
        <p style="margin: 5px 0; color: #b71c1c;">Run the cluster update process first to generate logs</p>
    </div>
    """)

In [0]:
# Get catalog and schema from parameters
catalog = dbutils.widgets.get("catalog")
schema = dbutils.widgets.get("schema")
full_schema = f"{catalog}.{schema}"

# Get selected execution label
try:
    selected_label = dbutils.widgets.get("selected_execution_label")
    
    displayHTML(f"""
    <div style="padding: 15px; background-color: #e3f2fd; border-left: 5px solid #2196f3; margin: 15px 0;">
        <h3 style="margin-top: 0; color: #1565c0;">üìä Viewing Execution: <code style="background-color: #bbdefb; padding: 2px 6px; border-radius: 3px;">{selected_label}</code></h3>
    </div>
    """)
    
    # Query logs for selected execution
    selected_logs = spark.sql(f"""
        SELECT 
            execution_label,
            execution_mode,
            cluster_name,
            workspace_name,
            validation_status,
            update_status,
            current_driver_instance,
            suggested_driver_instance,
            current_worker_instance,
            suggested_worker_instance,
            validated_savings,
            update_message,
            execution_timestamp
        FROM {full_schema}.cluster_update_log
        WHERE execution_label = '{selected_label}'
        ORDER BY execution_timestamp
    """)
    
    # Display summary
    summary = spark.sql(f"""
        SELECT 
            COUNT(*) as total_clusters,
            SUM(CASE WHEN update_status = 'SUCCESS' THEN 1 ELSE 0 END) as successful,
            SUM(CASE WHEN update_status = 'DRY_RUN' THEN 1 ELSE 0 END) as dry_run,
            SUM(CASE WHEN update_status = 'SKIPPED' THEN 1 ELSE 0 END) as skipped,
            SUM(CASE WHEN update_status = 'FAILED' THEN 1 ELSE 0 END) as failed,
            SUM(validated_savings) as total_savings
        FROM {full_schema}.cluster_update_log
        WHERE execution_label = '{selected_label}'
    """).collect()[0]
    
    displayHTML(f"""
    <div style="border: 2px solid #2196f3; padding: 15px; margin: 15px 0; background-color: #e3f2fd; border-radius: 8px;">
        <h4 style="margin-top: 0; color: #1565c0;">Summary Statistics</h4>
        <div style="display: grid; grid-template-columns: repeat(3, 1fr); gap: 10px;">
            <div style="padding: 10px; background-color: white; border-radius: 5px;">
                <strong>Total Clusters:</strong> {summary.total_clusters}
            </div>
            <div style="padding: 10px; background-color: #e8f5e9; border-radius: 5px;">
                <strong>Successful:</strong> {summary.successful}
            </div>
            <div style="padding: 10px; background-color: #e3f2fd; border-radius: 5px;">
                <strong>Dry Run:</strong> {summary.dry_run}
            </div>
            <div style="padding: 10px; background-color: #fff3e0; border-radius: 5px;">
                <strong>Skipped:</strong> {summary.skipped}
            </div>
            <div style="padding: 10px; background-color: #ffebee; border-radius: 5px;">
                <strong>Failed:</strong> {summary.failed}
            </div>
            <div style="padding: 10px; background-color: #e8f5e9; border-radius: 5px;">
                <strong>Total Savings:</strong> ${summary.total_savings:,.2f}
            </div>
        </div>
    </div>
    """)
    
    # Display detailed logs
    displayHTML("""
    <div style="padding: 10px; background-color: #e8f5e9; border-left: 4px solid #4caf50; margin: 15px 0;">
        <h4 style="margin-top: 0; color: #2e7d32;">Detailed Cluster Updates:</h4>
    </div>
    """)
    display(selected_logs)
    
except Exception as e:
    displayHTML(f"""
    <div style="padding: 15px; background-color: #ffebee; border-left: 5px solid #f44336; margin: 15px 0;">
        <h3 style="margin: 0; color: #c62828;">‚ö† Error</h3>
        <p style="margin: 5px 0; color: #b71c1c;">{str(e)}</p>
        <p style="margin: 5px 0; color: #b71c1c;">Make sure to select an execution label from the dropdown above</p>
    </div>
    """)