# 🔁 Week 07-08 · Notebook 04 · Runnable Sequences & Workflow Resilience

Coordinate sequential, parallel, and fallback runnables to meet maintenance SLAs while respecting safety policies.

## 🎯 Learning Objectives
- Build runnable sequences that orchestrate retrieval, reasoning, and translation steps.
- Apply error handling strategies (retry, fallback, circuit-breaker) for plant operations.
- Monitor per-stage latency to enforce < 400 ms SLA for on-call technicians.
- Embed governance triggers that escalate to SMEs when automated responses fail.

## 🧩 Scenario
Night shift in Pune receives vibration alarms. Fast guidance is needed; if retrieval fails, system must fall back to cached recipes and page the reliability engineer.

In [None]:
import time
import random
from langchain_core.runnables import RunnableLambda, RunnableParallel
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# --- Step 1: Define Individual Components ---

# Simulate a retrieval function that might fail
def fetch_retrieval(context_key: str) -> str:
    """
    Simulates fetching a document from a vector store.
    Set to fail for this demonstration.
    """
    print(f"Attempting to retrieve document for: {context_key}")
    # Simulate failure
    raise ConnectionError("Retrieval failed: Vector DB is offline.")
    # In a success case, it would return:
    # return "SOP-455 states that vibration above 5mm/s requires immediate shutdown and inspection."

# A fallback function that provides a cached or default response
def fallback_cache(inputs: dict) -> str:
    """
    Provides a safe, cached response when the primary retrieval fails.
    """
    print("--- Primary retrieval failed. Using fallback cache. ---")
    issue = inputs.get("issue", "the reported issue")
    return f"Cached Guidance: For {issue}, consult the general troubleshooting manual (GTM-001) and notify the shift supervisor."

# --- Step 2: Build the Runnable Sequence using LCEL ---

# The main prompt for the LLM
prompt = ChatPromptTemplate.from_template(
    "Context: {context}\n\nIssue: {issue}\n\nAnswer succinctly with SOP citations if available."
)

# The LLM to use for generation
llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)

# Define the primary generation chain
generation_chain = prompt | llm | StrOutputParser()

# Define the retrieval chain with a fallback
# .with_fallbacks() creates a resilient chain that tries the primary runnable first
# and executes the fallback if the primary one fails.
retrieval_with_fallback = RunnableLambda(fetch_retrieval).with_fallbacks(
    fallbacks=[RunnableLambda(fallback_cache)]
)

# --- Step 3: Compose the final workflow ---

# The final workflow uses a RunnableParallel to structure the input for the generation_chain.
# It runs the retrieval_with_fallback and passes the 'issue' straight through.
workflow = (
    RunnableParallel(
        context=retrieval_with_fallback,
        issue=lambda inputs: inputs["issue"]
    )
    | generation_chain
)


# --- Step 4: Execute and Monitor ---

start = time.perf_counter()

# The input dictionary for the workflow
run_input = {
    "context_key": "vibration_alarm_press_12",
    "issue": "High vibration alarm on Press 12"
}

# Invoke the workflow
final_answer = workflow.invoke(run_input)

latency_ms = (time.perf_counter() - start) * 1000

print("\n--- Final Answer ---")
print(final_answer)
print(f'\nLatency: {latency_ms:.1f} ms')

### ⚠️ Circuit Breaker Pattern
Track failure counts; if > 3 within 10 minutes, disable automated responses and notify SMEs.

In [None]:
from datetime import datetime, timedelta

class CircuitBreaker:
    """
    A simple circuit breaker implementation to prevent repeated failures.
    """
    def __init__(self, failure_threshold: int, recovery_timeout_seconds: int):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = timedelta(seconds=recovery_timeout_seconds)
        self.failures = []
        self.is_open = False
        self.last_opened_time = None

    def record_failure(self):
        """Records a failure and opens the circuit if the threshold is met."""
        now = datetime.now()
        self.failures.append(now)
        
        # Remove old failures
        self.failures = [t for t in self.failures if now - t < self.recovery_timeout]
        
        if len(self.failures) >= self.failure_threshold:
            self.is_open = True
            self.last_opened_time = now
            print(f"CIRCUIT BREAKER OPENED at {now}. Further calls will be blocked.")

    def can_execute(self) -> bool:
        """Checks if the circuit is closed or if the recovery timeout has passed."""
        if not self.is_open:
            return True
        
        if datetime.now() - self.last_opened_time > self.recovery_timeout:
            self.reset()
            print("CIRCUIT BREAKER RESET. Calls are now permitted.")
            return True
            
        print("Circuit breaker is open. Call is blocked.")
        return False

    def reset(self):
        """Resets the circuit breaker to a closed state."""
        self.is_open = False
        self.failures = []
        self.last_opened_time = None

# --- Example Usage ---
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout_seconds=60)

for i in range(5):
    if breaker.can_execute():
        print(f"Attempt {i+1}: Executing the operation...")
        # Simulate a failure
        breaker.record_failure()
        time.sleep(1) # Simulate time between calls
    else:
        print(f"Attempt {i+1}: Operation blocked by circuit breaker.")
        # In a real app, you would wait or redirect here
        time.sleep(10)

# Check if it resets after timeout (manual check for demo)
print("\n--- Waiting for recovery timeout ---")
# time.sleep(61) 
# if breaker.can_execute():
#     print("System recovered and is operational again.")

## 🧪 Lab Assignment
1. Add a parallel branch that translates outputs into Spanish using `RunnableParallel`.
2. Extend the fallback to fetch last known guidance from Redis when retrieval fails.
3. Log per-node latency and completion status to Prometheus.
4. Demo circuit breaker resetting after manual override by SME.

## ✅ Checklist
- [ ] Runnable sequence deployed
- [ ] Failure handling strategy documented
- [ ] Latency metrics instrumented
- [ ] Lab deliverables verified with ops team

## 📚 References
- LangChain Runnables Guide
- Site Reliability Engineering: Circuit Breakers
- Week 09-10 Monitoring Notebook