# Troubleshooting Production Issues

This notebook serves as an operational runbook for diagnosing and resolving common issues in the production Self-Critique pipeline. It provides code examples and strategies for handling errors, timeouts, and performance degradation.

## Learning Objectives

- **Error Handling**: Implement robust error handling for API calls and data parsing.
- **Retry Strategies**: Use exponential backoff to handle transient issues like rate limiting.
- **Failure Classification**: Categorize errors to understand root causes.
- **Debugging Techniques**: Learn how to diagnose issues using logs and metrics.

---


## Section 1: Common Failure Modes & Recovery

| Failure Mode | Cause | Recovery Strategy |
| :--- | :--- | :--- |
| API Rate Limiting | Exceeding allowed requests/minute | Implement exponential backoff and retry. |
| API 5xx Errors | Server-side issues at the provider | Retry with backoff; have a fallback model if possible. |
| XML Parsing Failure | Malformed XML in LLM response | Add a retry loop that asks the model to fix the XML. |
| Token Limit Exceeded | Input text is too long | Truncate input or use a model with a larger context window. |
| Request Timeout | Network latency or slow model response | Increase client timeout; implement async processing. |


## Section 2: Rate Limiting & Retry Strategies (Exponential Backoff)

When an API fails due to a transient issue, it's best to retry with an increasing delay.


In [None]:
import time
import random
import httpx

def call_api_with_retry(url: str, max_retries: int = 5, initial_delay: int = 1):
    """Calls an API with exponential backoff and jitter."""
    delay = initial_delay
    for attempt in range(max_retries):
        try:
            print(f"Attempt {attempt + 1}/{max_retries}...")
            # Simulate an API call that might fail
            if random.random() < 0.8: # 80% chance of failure
                raise httpx.ReadTimeout("Request timed out")
            
            response = {"status": 200, "data": "Success!"}
            print("API call successful.")
            return response
        
        except httpx.ReadTimeout as e:
            print(f"Error: {e}. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)
            delay *= 2  # Double the delay
            delay += random.uniform(0, 1) # Add jitter
            
    raise Exception(f"API call failed after {max_retries} attempts.")

# Example usage
try:
    result = call_api_with_retry("http://example.com/api")
except Exception as e:
    print(e)


## Section 3: Handling XML Parsing Failures

Sometimes, the LLM may return malformed XML. We can create a recovery loop that sends the faulty XML back to the model and asks it to correct it.


In [None]:
import xml.etree.ElementTree as ET

def parse_xml_with_recovery(llm_output: str, max_retries: int = 2) -> ET.Element:
    """Tries to parse XML, and on failure, asks the LLM to fix it."""
    for attempt in range(max_retries):
        try:
            return ET.fromstring(llm_output)
        except ET.ParseError as e:
            print(f"XML parsing failed: {e}. Attempting recovery...")
            # In a real app, this would be an LLM call
            llm_output = f"<root><fixed>{llm_output.replace('<', '')}</fixed></root>" # Simulate fixing
            print("Simulated LLM call to fix XML.")
            
    raise ValueError(f"Failed to parse XML after {max_retries} recovery attempts.")

malformed_xml = "<root><item>One</item><item>Two</malformed></root>"
try:
    parsed = parse_xml_with_recovery(malformed_xml)
    print("Successfully parsed XML after recovery.")
except ValueError as e:
    print(e)


## Section 4: Diagnosing Quality Degradation

When quality scores drop, use the following checklist to diagnose the cause:

1.  **Check for Data Drift**: Has the input data changed? Use the `advanced_monitoring_drift_detection.ipynb` notebook to compare the distribution of recent inputs against a baseline.
    - **Query**: `SELECT AVG(paper_length) FROM logs WHERE time > now() - '1d'`

2.  **Review Failing Examples**: Look at specific inputs that are receiving low quality scores. Is there a pattern?
    - **Query**: `SELECT paper_text, final_summary, critique FROM pipeline_results WHERE overall_quality < 7 ORDER BY timestamp DESC LIMIT 10`

3.  **Check for Prompt Injection**: Have users started submitting inputs designed to subvert the prompts?

4.  **A/B Test a Fix**: Once you have a hypothesis (e.g., "the prompt doesn't handle bullet points well"), create a new prompt version and test it against the old one using the `model_evaluation_qa.ipynb` notebook.


## Section 5: Performance Anomaly Root Cause Analysis

When latency spikes or throughput drops, use this checklist:

1.  **Check External API Status**: Is the LLM provider (e.g., Anthropic) reporting an incident? Check their status page.

2.  **Analyze Latency by Stage**: Use distributed tracing (OpenTelemetry) to see which stage of the pipeline is causing the slowdown.

3.  **Check for Large Inputs/Outputs**: Are you processing unusually large papers or generating very long summaries? Correlate latency with token counts.
    - **Prometheus Query**: `rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])`

4.  **Check Infrastructure Resources**: Are your pods CPU or memory-throttled? Check your Kubernetes dashboard for resource utilization.
