Skip to content

Conversation

@jy-tan
Copy link
Contributor

@jy-tan jy-tan commented Jan 14, 2026

Summary

Adds resilience and observability to the Python SDK's span export pipeline.

Changes

Resilience (drift/core/resilience.py)

  • Retry with exponential backoff: Automatic retries for transient failures with configurable delays and jitter
  • Circuit breaker: Fail-fast pattern to prevent cascading failures when backend is unavailable

SDK Metrics (drift/core/metrics.py)

  • Event-driven warnings at WARN level (no background thread):
    • High span drop rate (>5%)
    • High export failure rate (>10%)
    • Queue nearing capacity (>80%)
    • Circuit breaker state changes
  • Programmatic access via get_sdk_metrics() for power users

API Adapter (drift/core/tracing/adapters/api.py)

  • Integrated retry and circuit breaker for export operations
  • Optional gzip compression (disabled by default, matches Node SDK)
  • Improved error handling: 5xx errors are retryable, 4xx are not

Batch Processor (drift/core/batch_processor.py)

  • Integrated metrics collection (spans exported/dropped/failed, latency)
  • Integrated trace blocking for oversized spans (matches Node SDK behavior)

Cleanup

  • Removed unused patch_instances_via_gc (expensive gc.get_objects() scan)

Testing

  • 191 unit tests passing
  • E2E tests verified (Flask, FastAPI, Requests)

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 7 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="drift/core/tracing/adapters/api.py">

<violation number="1" location="drift/core/tracing/adapters/api.py:236">
P2: 4xx HTTP responses are retried even though they should fail fast, because `_do_export` raises the same generic `Exception` for both 5xx and 4xx, and `retry_async` retries on any `Exception`. Introduce a distinct non-retryable exception (or avoid raising) for 4xx responses so they fail immediately while keeping 5xx retryable.</violation>
</file>

<file name="drift/core/metrics.py">

<violation number="1" location="drift/core/metrics.py:270">
P2: `MetricsCollector.reset()` does not clear `_warned_*` flags, so post-reset anomalies never log warnings.</violation>
</file>

<file name="drift/core/batch_processor.py">

<violation number="1" location="drift/core/batch_processor.py:127">
P2: `_dropped_spans` is incremented outside the processor lock when a span is blocked, so concurrent calls can lose updates and under-report dropped spans.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@jy-tan jy-tan merged commit f11664e into main Jan 14, 2026
14 checks passed
@jy-tan jy-tan deleted the improve-code-quality-2 branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants