# Runbook: No Release PR Created After Successful KG Release

This runbook helps diagnose and fix issues when a KG release workflow succeeds but the automated PR is not created.

## Problem
The `auto-kg-release` workflow completes successfully, but the expected release PR is not automatically created in the repository.

## Root Cause
This is typically caused by the Argo Events workflow not triggering properly. The system uses:
1. **EventSource**: Watches for succeeded workflows with `trigger_release: "True"` label
2. **Sensor**: Creates a `distribute-data-release` workflow that triggers a GitHub repository dispatch event
3. **GitHub Action**: Creates the release PR

## Prerequisites
- kubectl configured for the cluster
- jq installed for JSON processing

## Step 1: Identify the Failed Workflow

Find the workflow name that succeeded but didn't trigger the release process.

In [None]:
# List recent workflows, sorted by creation time
! gcloud container clusters get-credentials compute-cluster --region us-central1 --project mtrx-hub-dev-3of
! kubectl get workflows -n argo-workflows --sort-by=.metadata.creationTimestamp | tail -20

# Example workflow name: auto-kg-release-v0-11-4-256629c0

## Step 2: Check EventSource and Sensor Status

Verify that the Argo Events components are running properly.

In [None]:
# Check EventSource status
!kubectl get eventsource -n data-release

# Check Sensor status
!kubectl get sensor -n data-release

# Check if pods are running
!kubectl get pods -n data-release

## Step 3: Check EventSource Logs

Look for errors or issues in the EventSource that watches for workflow completions.

In [None]:
# View EventSource logs (looking for errors or missed events)
!kubectl logs -n data-release -l eventsource-name=build-data-release-eventsource --tail=100

## Step 4: Check Sensor Logs

Check if the sensor received the event and attempted to trigger the distribute workflow.

In [None]:
# View Sensor logs
!kubectl logs -n data-release -l sensor-name=build-data-release-sensor-post-release --tail=100

## Step 5: Check if Distribute Workflow Was Created

The sensor should create a `distribute-data-release-*` workflow when triggered.

In [None]:
# Look for distribute-data-release workflows
! kubectl get workflows -n data-release --sort-by=.metadata.creationTimestamp

**Note**: If you see a `distribute-data-release-*` workflow was created and shows status `Succeeded`, skip to **Step 9** to check the GitHub Action. Steps 6-8 are only needed if the workflow wasn't created or failed.

## Step 6: Manual Trigger (Solution)

If the event didn't fire automatically, manually trigger it by recreating the workflow with trigger labels.

**Important**: Replace `WORKFLOW_NAME` with your actual workflow name (e.g., `auto-kg-release-v0-11-4-256629c0`)

In [None]:
# Set the workflow name here
WORKFLOW_NAME = "auto-kg-release-v0-12-0-35761b01"  # REPLACE THIS

# Extract version and git SHA from the workflow name
# Format: auto-kg-release-vX-Y-Z-GITSHA
import re
match = re.search(r'v(\d+-\d+-\d+)-([a-f0-9]+)', WORKFLOW_NAME)
if match:
    version = match.group(1).replace('-', '.')
    git_sha = match.group(2)
    print(f"Detected version: v{version}")
    print(f"Detected git SHA: {git_sha}")
else:
    print("Warning: Could not parse version and git SHA from workflow name")
    version = "0.0.0"
    git_sha = "unknown"

In [None]:
# Recreate the workflow with trigger labels
# This uses kubectl + jq to copy the workflow and add the required labels
!kubectl get workflow {WORKFLOW_NAME} -n argo-workflows -o json | \
  jq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.creationTimestamp, .metadata.managedFields, .status) | \
      .metadata.generateName = "test-trigger-release-" | \
      del(.metadata.name) | \
      .metadata.labels["trigger_release"] = "True" | \
      .metadata.labels["release_version"] = "v{version}" | \
      .metadata.labels["git_sha"] = "{git_sha}"' | \
  kubectl create -f -

## Step 7: Verify the Trigger Worked

Watch for the new workflow to be created and monitor the event flow.

In [None]:
# Watch for the test workflow to complete
!kubectl get workflows -n argo-workflows | grep test-trigger-release

# Check if distribute workflow was created
!kubectl get workflows -n data-release | grep distribute-data-release | tail -5

## Step 8: Check Distribute Workflow Logs

If the distribute workflow was created, check its logs to see if the GitHub dispatch succeeded.

In [None]:
# Get the most recent distribute workflow name
distribute_workflow = !kubectl get workflows -n data-release --sort-by=.metadata.creationTimestamp | grep distribute-data-release | tail -1 | awk '{{print $1}}'

if distribute_workflow and distribute_workflow[0]:
    workflow_name = distribute_workflow[0]
    print(f"Checking logs for: {workflow_name}")
    !kubectl logs -n data-release -l workflows.argoproj.io/workflow={workflow_name} --tail=50
else:
    print("No distribute-data-release workflow found")

## Step 9: Check GitHub Action (If No Errors Found)

If the distribute workflow succeeded but no PR was created, check the upstream GitHub Action that creates the release PR.

The GitHub dispatch event triggers this workflow:  
**https://github.com/everycure-org/matrix/actions/workflows/create-release-pr.yml**

In [None]:
# Open the GitHub Actions page in your browser
import webbrowser

github_actions_url = "https://github.com/everycure-org/matrix/actions/workflows/create-release-pr.yml"
print(f"Opening GitHub Actions: {github_actions_url}")
print("\nThings to check:")
print("1. Look for recent workflow runs triggered by 'distribute-release' event")
print("2. Check if any runs failed or are still in progress")
print("3. Review logs for any errors in the PR creation process")
print("4. Verify the workflow was triggered with correct release_version and git_fingerprint")

# Uncomment the line below to auto-open in browser
# webbrowser.open(github_actions_url)

### Alternative: Use GitHub CLI to Check Action Runs

If you have the GitHub CLI (`gh`) installed, you can check recent workflow runs directly:

In [None]:
# Check recent runs of the create-release-pr workflow
!gh run list --repo everycure-org/matrix --workflow=create-release-pr.yml --limit 10

# To view logs of a specific run (replace RUN_ID with the actual ID from above)
# !gh run view RUN_ID --repo everycure-org/matrix --log

## Common Issues and Solutions

### Issue 1: EventSource not detecting workflows
**Symptom**: No logs in EventSource showing workflow detection  
**Solution**: Check RBAC permissions on the `data-release-service-account`

### Issue 2: Sensor not triggering workflow
**Symptom**: EventSource sees the workflow, but Sensor doesn't create distribute workflow  
**Solution**: Check Sensor logs for permission errors or missing parameters

### Issue 3: Distribute workflow fails
**Symptom**: Workflow created but GitHub dispatch fails  
**Solution**: Check the `gh-password` secret has valid GitHub token

### Issue 4: GitHub Action doesn't create PR
**Symptom**: Dispatch succeeds but no PR created  
**Solution**: Check GitHub Actions tab in the repository for workflow errors

## Next Steps

After running this runbook:
1. Monitor the GitHub repository for the release PR to appear
2. If still no PR, check GitHub Actions workflow runs
3. Consider investigating why the original workflow didn't have trigger labels

## Testing: Trigger Large Payload Test Workflow

To test NATS payload size limits and verify the EventBus configuration handles large workflow objects (e.g., after fixing `maxPayload` settings), you can create a test workflow with a ~1.3 MB payload.

This is useful for:
- Verifying NATS `maxPayload` configuration is applied correctly
- Testing EventSource `payloadFilter` is working
- Reproducing "nats: maximum payload exceeded" errors

In [None]:
# Generate a test workflow with ~1.3 MB payload (exceeds default 1MB NATS limit)
import json
import random

workflow = {
    "apiVersion": "argoproj.io/v1alpha1",
    "kind": "Workflow",
    "metadata": {
        "generateName": f"test-large-payload-1-5mb-{random.randint(1000, 9999)}-",
        "namespace": "argo-workflows",
        "labels": {
            "trigger_release": "True",
            "workflows.argoproj.io/phase": "Succeeded",
            "release_version": "v0.99.99-test",
            "git_sha": "test123456"
        }
    },
    "spec": {
        "entrypoint": "main",
        "templates": [
            {
                "name": "main",
                "container": {
                    "image": "alpine:latest",
                    "command": ["echo", "Testing large payload"]
                }
            }
        ]
    },
    "status": {
        "phase": "Succeeded",
        "startedAt": "2025-11-04T00:00:00Z",
        "finishedAt": "2025-11-04T00:01:00Z",
        "storedTemplates": []
    }
}

# Add large dummy data to reach ~1.3MB
dummy_data = "x" * 75000  # 75KB chunks

for i in range(18):  # 18 * 75KB = ~1.35MB
    template = {
        "name": f"large-template-{i}",
        "container": {
            "image": "dummy:latest",
            "command": ["sh", "-c"],
            "args": [dummy_data]
        }
    }
    workflow["status"]["storedTemplates"].append(template)

# Save to file
with open('/tmp/test-large-payload-workflow.json', 'w') as f:
    json.dump(workflow, f, indent=2)

import os
size = os.path.getsize('/tmp/test-large-payload-workflow.json')
print(f"Generated test workflow: {size:,} bytes ({size/1024/1024:.2f} MB)")
print(f"This is {size/1048576:.1f}x the default 1MB NATS limit")
print(f"\nFile saved to: /tmp/test-large-payload-workflow.json")

In [None]:
# Apply the test workflow to the cluster
!kubectl create -f /tmp/test-large-payload-workflow.json

# Note: This should succeed in creating the workflow
# The test is whether the EventSource can publish this large event to NATS

In [None]:
# Check EventSource logs for success or "maximum payload exceeded" error
!kubectl logs -n data-release -l eventsource-name=build-data-release-eventsource --tail=10

# Expected results:
# - If NATS limit is too small (default 1MB): "error":"failed after retries: nats: maximum payload exceeded"
# - If payloadFilter is working: "Succeeded to publish an event"
# - If maxPayload is properly configured: "Succeeded to publish an event"

### Cleanup Test Workflow

After testing, clean up the test workflow:

In [None]:
# Delete test workflows
!kubectl delete workflow -n argo-workflows -l release_version=v0.99.99-test

# Also clean up any triggered distribute workflows (if any were created)
!kubectl delete workflow -n data-release -l app=distribute-data-release --field-selector metadata.name!=auto-*