# Sprint 2 Planning — Observability, Reliability, and Private Repo Automation

This notebook confirms repo status after Sprints 0–1, audits tools and automation, and proposes an incremental Sprint 2 plan that minimizes human intervention and preserves safety, policy compliance, and observability.

## Repository status (post Sprint 1)

- Core modules: `auth.py`, `x_client.py`, `rate_limiter.py`, `budget.py`, `scheduler.py`, `actions.py`, `learn.py`, `storage.py`, `logger.py`, plus `config_schema.py` and `user_filter.py`.
- Logging: secrets redaction, UTC timestamps; params stringifier with boolean normalization.
- Storage: indexes and dedup logic present; tests cover auth/x_client/storage/budget/logger.
- Dry-run defaults for any X write actions remain enabled.

Developer experience and CI:
- .vscode editor settings and recommended extensions (Copilot, Pylance, Ruff, Coverage Gutters, etc.).
- .editorconfig; pre-commit with Ruff (lint/format), MyPy, detect-secrets (baseline), mdformat.
- CI matrix: Ubuntu + Windows on Python 3.11/3.12; Ruff/mypy/pytest (>=80% coverage); artifacts uploaded.
- CodeQL workflow; Dependabot for pip + actions (weekly).
- Devcontainer (Python 3.12) installs deps and pre-commit.

Private repo automation:
- Sprint 0 bootstrap workflow (labels/milestone/issues/PR) uses GITHUB_TOKEN. For private repos, Actions → Workflow permissions must be set to 'Read and write'. Auto-merge depends on repo setting.

Archives:
- `_archive/` retained and lint-cleaned; can keep with NOTICE or move to separate branch/tag.

## Tools & extensions audit

- Collaboration/AI: Copilot + Chat, GitLens, GitHub PRs & Issues — recommended and listed.
- Python: Python + Pylance, Ruff, Pytest, MyPy — integrated and CI-enforced.
- Security/quality: detect-secrets baseline; CodeQL workflow; Dependabot weekly. Bandit can be added as advisory (non-blocking).
- Docs/authoring: Markdown All in One; mdformat via pre-commit; Jupyter used for planning.

## Risks & gaps

1) Private repo automation can fail if Actions → Workflow permissions is read-only; auto-merge might be disabled.
2) CodeQL alert visibility can require Security & analysis settings/licensing.
3) Observability is primarily logs; tracing/metrics (OpenTelemetry) not yet implemented.
4) Reliability patterns (timeouts, retries with jitter/backoff, idempotency) should be standardized in `x_client`.
5) Governance: ensure branch protection (PR required, pass CI) and block direct commits to `main`.
6) Archives strategy: keep `_archive/` with NOTICE or move to `legacy-archive` branch/tag.

## Needed assistance (one-time, smallest toggles)

- Actions → General → Workflow permissions: set GITHUB_TOKEN to **Read and write**.
- Pull Requests: Enable **Allow auto-merge** (optional).
- Branch protection on `main`: Require PR, require CI checks (Ruff, MyPy, pytest+coverage), optional CodeQL.
- Security & analysis: Enable **Code scanning (CodeQL)** for alert visibility.
- Archives: choose to keep with NOTICE or move to `legacy-archive`. Dry-run defaults remain.

## Sprint 2 goals (4–6 days)

- Observability: Add OpenTelemetry tracing baseline and correlate trace IDs in logs.
- Reliability: Standardize timeouts, retries (jitter/backoff), idempotency simulation in dry-run.
- Private repo automation: bootstrap auto-PRs using GITHUB_TOKEN; document required settings.
- Optional: archive cleanup; advisory Bandit job.
- Maintain CI green and ≥ 80% coverage across the matrix.

## Incremental PR plan (small, focused PRs)

PR1 — OpenTelemetry baseline and log correlation
- Add `src/telemetry.py` with minimal tracing helper (console exporter by default; OTLP optional via env/config).
- Instrument scheduler loops and `x_client` API calls; annotate retries and budget checks.
- Inject `trace_id` / `span_id` into structured logs.
- DoD: spans visible in logs; tests for wrappers and log correlation; docs added (OBSERVABILITY.md or MONITORING.md).

PR2 — Reliability: timeouts, retries (jitter/backoff), idempotency
- Standardize requests timeouts and retry policy in `x_client`.
- Map error taxonomy (retryable vs. fail-fast).
- Simulate idempotency in dry-run via idempotency keys (param hashing); document prod strategy.
- DoD: tests for timeout/retry/backoff; coverage ≥ 80%.

PR3 — Private-repo automation polish
- Add workflow to auto-open PRs for `feat/*` with default labels and templated body using GITHUB_TOKEN.
- Update CONTRIBUTING.md to document required repo settings and safeguards.
- DoD: feature branch pushes auto-create PRs; settings doc shipped.

PR4 — Archive cleanup
- Move `_archive/` to `legacy-archive` branch/tag; remove from `main`; add README link and notice.
- DoD: repo clarity improved; CI unaffected.

PR5 — Bandit advisory scan
- Add Bandit as non-blocking job to CI; report to logs.
- DoD: results visible; can tighten later.

## Definition of Done (Sprint 2)

- Tracing: spans for scheduler/API calls; logs contain `trace_id`/`span_id`.
- Reliability: timeouts, retries with jitter/backoff; clear error taxonomy; idempotency simulated in dry-run.
- Automation: planning auto-PR works; later feature auto-PR for `feat/*` branches.
- Tests & coverage: ≥ 80% across matrix; CI green.
- Docs: observability and repo settings documented.

In [None]:
# Pseudocode: minimal OpenTelemetry helper and usage pattern
from contextlib import contextmanager
try:
    from opentelemetry import trace
    from opentelemetry.sdk.resources import Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
    _OTEL = True
except Exception:  # optional dependency
    _OTEL = False

def init_tracing(service_name: str = "x-agent", enable_console: bool = True) -> None:
    if not _OTEL:
        return
    provider = TracerProvider(resource=Resource.create({"service.name": service_name}))
    if enable_console:
        provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    trace.set_tracer_provider(provider)

@contextmanager
def start_span(name: str):
    tracer = trace.get_tracer(__name__) if _OTEL else None
    span = tracer.start_span(name) if tracer else None
    try:
        yield span
    finally:
        if span is not None:
            span.end()

In [None]:
# Pseudocode: retry with jitter
import random, time
def retry_with_jitter(fn, attempts=3, base=0.5, cap=4.0):
    last = None
    for i in range(attempts):
        try:
            return fn()
        except Exception as exc:
            last = exc
            sleep = min(cap, base * (2 ** i)) + random.random() * 0.2
            time.sleep(sleep)
    raise last

## Owner checklist for private-repo automation

- [ ] Actions → Workflow permissions: set GITHUB_TOKEN to **Read and write**.
- [ ] Pull Requests: **Allow auto-merge** (optional).
- [ ] Branch protection on `main`: require PR + CI checks; optional CodeQL.
- [ ] Security & analysis: enable **Code scanning (CodeQL)**.
- [ ] Decide `_archive/` strategy: keep with NOTICE or move to `legacy-archive`.

## Next steps

1) Confirm repo settings above (one-time).
2) Merge this planning PR (auto-PR will be opened by workflow).
3) Start implementation on `feat/sprint-2-observability` following PR1–PR5.