# Runbook: Count MLflow Experiments from Last 3 Months

This runbook queries the cloud MLflow tracking server to count the number of experiments/runs executed in the last 3 months.

## Prerequisites
- kubectl configured for the GKE cluster
- Access to the cluster where MLflow is running
- Python environment with `mlflow` and `pandas` packages installed

## How It Works
1. Automatically authenticates with GKE cluster
2. Discovers MLflow service using kubectl
3. Sets up port-forwarding to MLflow service
4. Queries and analyzes experiment data

## Step 0: Configure GKE Connection and Discover MLflow Service

Authenticate with GKE and discover the MLflow service configuration.

In [None]:
import subprocess
import json
import time
import socket
import os

# GKE Configuration
CLOUD_PROJECT = "mtrx-hub-dev-3of"
CLOUD_REGION = "us-central1"
CLOUD_CLUSTER = "compute-cluster"

# Port configuration
LOCAL_PORT = 5001

print("=" * 60)
print("STEP 0: GKE Authentication and Service Discovery")
print("=" * 60)

# Authenticate with GKE
print("\n1. Authenticating with GKE cluster...")
auth_cmd = [
    "gcloud", "container", "clusters", "get-credentials",
    CLOUD_CLUSTER,
    "--region", CLOUD_REGION,
    "--project", CLOUD_PROJECT
]

try:
    result = subprocess.run(auth_cmd, capture_output=True, text=True, timeout=30)
    if result.returncode != 0:
        print(f"✗ Failed to authenticate: {result.stderr}")
        raise Exception("GKE authentication failed")
    print(f"✓ Successfully authenticated with GKE cluster: {CLOUD_CLUSTER}")
except subprocess.TimeoutExpired:
    print("✗ Authentication timed out")
    raise

# Discover MLflow service using kubectl
print("\n2. Discovering MLflow service...")
search_namespaces = ["mlflow"]
mlflow_service_found = False
MLFLOW_SERVICE_NAME = None
MLFLOW_NAMESPACE = None
MLFLOW_SERVICE_PORT = None

for namespace in search_namespaces:
    try:
        # Get services in namespace
        get_svc_cmd = ["kubectl", "get", "svc", "-n", namespace, "-o", "json"]
        result = subprocess.run(get_svc_cmd, capture_output=True, text=True, timeout=10)
        
        if result.returncode == 0:
            services = json.loads(result.stdout)
            
            # Look for MLflow service
            for svc in services.get("items", []):
                svc_name = svc["metadata"]["name"]
                if "mlflow-tracking" in svc_name.lower():
                    MLFLOW_SERVICE_NAME = svc_name
                    MLFLOW_NAMESPACE = namespace
                    
                    # Get the service port
                    ports = svc["spec"].get("ports", [])
                    if ports:
                        MLFLOW_SERVICE_PORT = ports[0]["port"]
                    
                    mlflow_service_found = True
                    print(f"✓ Found MLflow service: {MLFLOW_SERVICE_NAME}")
                    print(f"  Namespace: {MLFLOW_NAMESPACE}")
                    print(f"  Port: {MLFLOW_SERVICE_PORT}")
                    break
            
            if mlflow_service_found:
                break
    except Exception as e:
        print(f"  Warning: Could not check namespace {namespace}: {e}")
        continue

if not mlflow_service_found:
    print("✗ Could not find MLflow service in common namespaces")
    print(f"  Searched: {', '.join(search_namespaces)}")
    print("\n  Please set manually:")
    print("    MLFLOW_SERVICE_NAME = 'your-mlflow-service-name'")
    print("    MLFLOW_NAMESPACE = 'your-namespace'")
    print("    MLFLOW_SERVICE_PORT = 5000")
    raise Exception("MLflow service not found")

print(f"\n✓ Configuration complete!")
print(f"  Service: {MLFLOW_SERVICE_NAME}")
print(f"  Namespace: {MLFLOW_NAMESPACE}")
print(f"  Service Port: {MLFLOW_SERVICE_PORT}")
print(f"  Local Port: {LOCAL_PORT}")

## Step 1: Setup Port-Forward to MLflow

Establish port-forwarding to the MLflow service.

In [None]:
def check_port_open(port, host="127.0.0.1"):
    """Check if a port is open and listening."""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.settimeout(1)
    try:
        result = sock.connect_ex((host, port))
        sock.close()
        return result == 0
    except:
        return False

print("=" * 60)
print("STEP 1: Port-Forward Setup")
print("=" * 60)

# Check if port-forward is already running
if check_port_open(LOCAL_PORT):
    print(f"✓ Port {LOCAL_PORT} is already in use (existing port-forward detected)")
    MLFLOW_ENDPOINT = f"http://127.0.0.1:{LOCAL_PORT}"
else:
    # Start port-forward in background
    print(f"\nStarting kubectl port-forward...")
    print(f"  Forwarding: 127.0.0.1:{LOCAL_PORT} -> {MLFLOW_SERVICE_NAME}:{MLFLOW_SERVICE_PORT}")
    
    port_forward_cmd = [
        "kubectl", "port-forward",
        f"service/{MLFLOW_SERVICE_NAME}",
        f"{LOCAL_PORT}:{MLFLOW_SERVICE_PORT}",
        "-n", MLFLOW_NAMESPACE
    ]
    
    # Start in background
    process = subprocess.Popen(
        port_forward_cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True
    )
    
    # Wait for port-forward to establish
    print("  Waiting for connection to establish...")
    for i in range(15):
        time.sleep(1)
        if check_port_open(LOCAL_PORT):
            print(f"✓ Port-forward established successfully!")
            print(f"  Process PID: {process.pid}")
            print(f"  To stop later: kill {process.pid}")
            break
        print(f"    ... waiting ({i+1}/15)")
    else:
        print("✗ Port-forward did not establish within 15 seconds")
        stderr_output = process.stderr.read() if process.stderr else ""
        print(f"  Error: {stderr_output}")
        process.terminate()
        raise Exception("Failed to establish port-forward")
    
    MLFLOW_ENDPOINT = f"http://127.0.0.1:{LOCAL_PORT}"

print(f"\n✓ MLflow endpoint ready: {MLFLOW_ENDPOINT}")

## Step 2: Connect to MLflow and Verify

Connect to the MLflow tracking server and verify the connection.

In [None]:
from datetime import datetime, timedelta
from typing import Dict, List

import mlflow
from mlflow.tracking import MlflowClient
import pandas as pd

print("=" * 60)
print("STEP 2: MLflow Connection")
print("=" * 60)

# Configure MLflow
mlflow.set_tracking_uri(MLFLOW_ENDPOINT)
print(f"\nMLflow Tracking URI: {MLFLOW_ENDPOINT}")

# Test the connection
print("\nTesting connection...")
try:
    client = MlflowClient()
    test_experiments = client.search_experiments(max_results=1)
    print(f"✓ Connected successfully to MLflow!")
    print(f"  Server is responding")
except Exception as e:
    print(f"✗ Connection failed: {e}")
    print("\nTroubleshooting:")
    print("1. Verify port-forward is still running")
    print("2. Check MLflow pod is running: kubectl get pods -n", MLFLOW_NAMESPACE)
    print("3. Check MLflow logs: kubectl logs -n", MLFLOW_NAMESPACE, "-l app=mlflow")
    raise

## Step 3: Define Time Range

Calculate the timestamp for 3 months ago from today.

In [None]:
# Calculate date 3 months ago
end_date = datetime.now()
start_date = end_date - timedelta(days=90)  # Approximately 3 months

# Convert to Unix timestamp in milliseconds (MLflow format)
start_timestamp_ms = int(start_date.timestamp() * 1000)
end_timestamp_ms = int(end_date.timestamp() * 1000)

print(f"Start Date: {start_date.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"End Date: {end_date.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"\nQuerying runs between {start_date.date()} and {end_date.date()}")

## Step 4: Query All Experiments

List all experiments in the MLflow tracking server.

In [None]:
client = MlflowClient()

# Get all experiments (including archived) without limit
all_experiments = client.search_experiments(
    view_type=mlflow.entities.ViewType.ALL,
    max_results=50000  # Set very high limit to get all experiments
)

print(f"Total experiments found: {len(all_experiments)}")
print("\nExperiment List:")
for exp in all_experiments[:10]:  # Show first 10
    print(f"  - {exp.name} (ID: {exp.experiment_id})")
if len(all_experiments) > 10:
    print(f"  ... and {len(all_experiments) - 10} more")

## Step 5: Query Runs from Last 3 Months

Search for all runs across all experiments that were created in the last 3 months.

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed

# Build filter query for runs in the last 3 months
# MLflow uses milliseconds since epoch for timestamps
filter_string = f"attributes.start_time >= {start_timestamp_ms}"

def query_experiment_runs(experiment):
    """Query runs for a single experiment."""
    try:
        runs = client.search_runs(
            experiment_ids=[experiment.experiment_id],
            filter_string=filter_string,
            max_results=50000  # Set very high limit to get all experiments
        )
        return experiment, runs, None
    except Exception as e:
        return experiment, [], str(e)

print("Querying experiments in parallel...")
print(f"Total experiments to query: {len(all_experiments)}")

all_runs = []
errors = []

# Use ThreadPoolExecutor for parallel queries
# Adjust max_workers based on your needs (default: min(32, num_experiments))
max_workers = len(all_experiments)

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    # Submit all queries
    future_to_exp = {
        executor.submit(query_experiment_runs, exp): exp 
        for exp in all_experiments
    }
    
    # Process results as they complete
    completed = 0
    for future in as_completed(future_to_exp):
        completed += 1
        experiment, runs, error = future.result()
        
        if error:
            errors.append(f"{experiment.name}: {error}")
        else:
            all_runs.extend(runs)
        
        # Progress indicator
        if completed % 10 == 0 or completed == len(all_experiments):
            print(f"  Progress: {completed}/{len(all_experiments)} experiments processed")

print(f"\n✓ Query complete!")
print(f"  Total runs found in last 3 months: {len(all_runs)}")

if errors:
    print(f"\n⚠ Warnings ({len(errors)} experiments had errors):")
    for error in errors[:5]:  # Show first 5 errors
        print(f"  - {error}")
    if len(errors) > 5:
        print(f"  ... and {len(errors) - 5} more errors")

## Step 6: Analyze Results

Break down the results by experiment, status, and time period.

In [None]:
# Organize runs by experiment
runs_by_experiment: Dict[str, List] = {}
runs_by_status: Dict[str, int] = {}

for run in all_runs:
    exp_id = run.info.experiment_id
    if exp_id not in runs_by_experiment:
        runs_by_experiment[exp_id] = []
    runs_by_experiment[exp_id].append(run)
    
    # Count by status
    status = run.info.status
    runs_by_status[status] = runs_by_status.get(status, 0) + 1

# Create summary DataFrame
experiment_summary = []
for exp_id, runs in runs_by_experiment.items():
    experiment = next((e for e in all_experiments if e.experiment_id == exp_id), None)
    exp_name = experiment.name if experiment else f"Unknown ({exp_id})"
    experiment_summary.append({
        "Experiment": exp_name,
        "Experiment ID": exp_id,
        "Run Count": len(runs)
    })

df_summary = pd.DataFrame(experiment_summary).sort_values(
    "Run Count", ascending=False
)

print("\n" + "="*60)
print("SUMMARY: Runs by Experiment (Last 3 Months)")
print("="*60)
print(df_summary.to_string(index=False))

print("\n" + "="*60)
print("SUMMARY: Runs by Status")
print("="*60)
for status, count in sorted(runs_by_status.items(), key=lambda x: x[1], reverse=True):
    print(f"  {status}: {count}")

## Step 7: Detailed Statistics

Get more detailed statistics including temporal distribution.

In [None]:
# Create a DataFrame with all run details
run_details = []
for run in all_runs:
    experiment = next(
        (e for e in all_experiments if e.experiment_id == run.info.experiment_id),
        None
    )
    exp_name = experiment.name if experiment else "Unknown"
    
    start_time = datetime.fromtimestamp(run.info.start_time / 1000)
    end_time = (
        datetime.fromtimestamp(run.info.end_time / 1000)
        if run.info.end_time
        else None
    )
    duration = (
        (run.info.end_time - run.info.start_time) / 1000
        if run.info.end_time
        else None
    )
    
    run_details.append({
        "Experiment": exp_name,
        "Run ID": run.info.run_id,
        "Status": run.info.status,
        "Start Time": start_time,
        "End Time": end_time,
        "Duration (seconds)": duration,
        "User": run.info.user_id,
    })

df_runs = pd.DataFrame(run_details)

# Temporal distribution by week
df_runs["Week"] = pd.to_datetime(df_runs["Start Time"]).dt.to_period("W")
weekly_counts = df_runs.groupby("Week").size().reset_index(name="Run Count")

print("\n" + "="*60)
print("Runs per Week")
print("="*60)
print(weekly_counts.to_string(index=False))

# Average duration by experiment
avg_duration = df_runs.groupby("Experiment")["Duration (seconds)"].agg(
    ["mean", "median", "count"]
).round(2)
avg_duration.columns = ["Avg Duration (s)", "Median Duration (s)", "Count"]

print("\n" + "="*60)
print("Average Run Duration by Experiment")
print("="*60)
print(avg_duration.sort_values("Count", ascending=False).to_string())

## Step 8: Export Results (Optional)

Export the detailed results to CSV for further analysis.

In [None]:
# Export to CSV
output_file = f"/tmp/mlflow_runs_last_3_months_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df_runs.to_csv(output_file, index=False)

print(f"\nResults exported to: {output_file}")
print(f"Total rows: {len(df_runs)}")

## Summary

This runbook provides:
1. **Total run count** for the last 3 months
2. **Breakdown by experiment** showing which experiments are most active
3. **Status distribution** (finished, failed, running, etc.)
4. **Temporal distribution** showing runs per week
5. **Duration statistics** by experiment
6. **Exportable CSV** for further analysis

## Key Metrics at a Glance

In [None]:
print("\n" + "="*60)
print("KEY METRICS - LAST 3 MONTHS")
print("="*60)
print(f"Total Experiments: {len(all_experiments)}")
print(f"Experiments with Runs: {len(runs_by_experiment)}")
print(f"Total Runs: {len(all_runs)}")
print(f"\nMost Active Experiment: {df_summary.iloc[0]['Experiment']} ({df_summary.iloc[0]['Run Count']} runs)")
print(f"\nStatus Breakdown:")
for status, count in sorted(runs_by_status.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / len(all_runs)) * 100
    print(f"  {status}: {count} ({percentage:.1f}%)")

if not df_runs.empty:
    avg_weekly = weekly_counts["Run Count"].mean()
    print(f"\nAverage Runs per Week: {avg_weekly:.1f}")
print("="*60)