Skip to content

feat(connectors): add Apache Doris sink connector#3215

Open
ryankert01 wants to merge 5 commits into
apache:masterfrom
ryankert01:feat/doris-sink-connector
Open

feat(connectors): add Apache Doris sink connector#3215
ryankert01 wants to merge 5 commits into
apache:masterfrom
ryankert01:feat/doris-sink-connector

Conversation

@ryankert01
Copy link
Copy Markdown
Member

@ryankert01 ryankert01 commented May 5, 2026

Which issue does this PR close?

Closes #3112

Rationale

Adds an Apache Doris sink so Iggy streams can be written into Doris for analytical querying.

What changed?

Iggy had no path to land messages in Apache Doris. A new iggy_connector_doris_sink crate consumes JSON payloads and writes them via Doris's HTTP Stream Load API (PUT /api/{db}/{table}/_stream_load).

The non-obvious bits the connector handles: re-attaching Authorization across the FE→BE 307 redirect (which reqwest strips by default), parsing the JSON Status body to classify success / Label Already Exists / transient (Publish Timeout, 5xx) / permanent (Fail, 4xx, unknown), and emitting a deterministic per-batch label so replays are deduplicated by Doris's label-keep window. v1 is sink-only, JSON-only, HTTP Basic auth only, and assumes pre-created tables — no DDL.

Local Execution

  • Passed
  • Pre-commit hooks ran. Pre-push C#/Java hooks skipped (no dotnet/JDK locally; contribution is Rust-only).

AI Usage

  1. Claude Code (Anthropic).
  2. Crate scaffolding against quickwit_sink / influxdb_sink, testcontainer fixture, and iteration on the Stream Load redirect + Status-body classification.
  3. 14 unit tests + 6 integration tests against a real apache/doris:doris-all-in-one-2.1.0 container, covering happy path, 1k-row bulk, max_filter_ratio, label-replay dedupe, missing-target-table, and columns derived expressions; row state verified via the MySQL frontend. help write docs.
  4. Yes.

@ryankert01 ryankert01 marked this pull request as draft May 5, 2026 17:58
@ryankert01 ryankert01 changed the title feat(connectors): add Apache Doris sink connector (#2753) feat(connectors): add Apache Doris sink connector May 5, 2026
@ryankert01 ryankert01 force-pushed the feat/doris-sink-connector branch from a9f3652 to b5434dd Compare May 5, 2026 18:17
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 87.68473% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.74%. Comparing base (239d7ee) to head (470c85c).
⚠️ Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
core/connectors/sinks/doris_sink/src/lib.rs 87.68% 39 Missing and 11 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3215       +/-   ##
=============================================
- Coverage     74.46%   51.74%   -22.72%     
  Complexity      943      943               
=============================================
  Files          1188     1187        -1     
  Lines        106543    93393    -13150     
  Branches      83560    70428    -13132     
=============================================
- Hits          79332    48325    -31007     
- Misses        24459    42484    +18025     
+ Partials       2752     2584      -168     
Components Coverage Δ
Rust Core 45.43% <87.68%> (-30.29%) ⬇️
Java SDK 60.14% <ø> (ø)
C# SDK 69.13% <ø> (-0.33%) ⬇️
Python SDK 81.43% <ø> (ø)
Node SDK 91.53% <ø> (+0.12%) ⬆️
Go SDK 39.80% <ø> (ø)
Files with missing lines Coverage Δ
core/connectors/sinks/doris_sink/src/lib.rs 87.68% <87.68%> (ø)

... and 307 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ryankert01 ryankert01 force-pushed the feat/doris-sink-connector branch 5 times, most recently from 7e263c9 to 03ed46c Compare May 11, 2026 12:58
Sink connector that writes Iggy messages to Apache Doris via the HTTP
Stream Load API. v1 scope: JSON payloads only, HTTP Basic auth,
pre-created tables only (no DDL).

Behaviour:
- Manual 307/308 redirect following (capped at 5) so the Authorization
  header survives the FE -> BE hop, which reqwest strips by default.
- Deterministic per-batch label
  ({prefix}-{stream}-{topic}-{partition}-{first_offset}-{last_offset})
  so replays are deduplicated by Doris within label_keep_max_second.
- Response body Status field drives error classification: Success and
  "Label Already Exists" -> Ok; Publish Timeout -> CannotStoreData
  (transient); Fail or any unknown status -> PermanentHttpError so the
  runtime DLQs the batch instead of looping.
- Optional columns / where / max_filter_ratio / batch_size / timeout
  forwarded as Stream Load headers.
- Password held as secrecy::SecretString; auth header wrapped in
  SecretString so Debug derivation never leaks the base64 credential.
- Client built in open() with InitError on failure; fe_url validated
  there too so a bad config fails at startup rather than first batch.

Tests: 6 integration tests under core/integration/tests/connectors/doris
backed by an apache/doris all-in-one testcontainer (FE HTTP + FE MySQL).
Coverage includes happy path, 1k-row bulk, max_filter_ratio skip path,
label-replay dedupe, missing-target-table (proves no auto-create), and
the columns derived-expression header. The container must bind host:8040
1:1 because the FE 307-redirects to 127.0.0.1:8040; tests are serialized
via a 'doris' nextest test-group (max-threads = 1) so concurrent test
processes don't race for that port.
Addresses review feedback on the Doris sink connector before merge.

Correctness:
- Label format now appends an 8-hex blake3 of the *raw* stream/topic names,
  so streams that sanitize identically (e.g. `events.v1` vs `events_v1`)
  can no longer collide and silently dedupe against each other in Doris.
  Each variable-length segment is also truncated; total label is bounded
  under Doris's 128-char cap regardless of input length.
- `build_label` is now a pure `pub` free function. The integration test's
  manual label construction (used to verify server-side dedupe) now calls
  it directly, so the test cannot drift from the production format.
- `consume` tracks the *most severe* error across chunks via `record_error`:
  permanent shadows transient. The previous first-error strategy let a
  transient error from chunk N hide a permanent error from chunk M and
  caused the runtime to retry forever instead of routing to DLQ.
- HTTP 408 (Request Timeout) and 429 (Too Many Requests) classified as
  `CannotStoreData` (transient). They are 4xx but recoverable; the old
  code lumped them with all 4xx and DLQ'd retryable conditions.
- Parse failures on the response body now return `PermanentHttpError`.
  An unparseable 200-OK is almost always a Doris bug or proxy interference
  — retrying the same bytes won't help.

Security:
- `open()` rejects `database`/`table` values outside `[A-Za-z0-9_]+`.
  Doris would reject them server-side anyway, but rejecting at config-load
  also prevents path traversal in the `/api/{db}/{table}/_stream_load` URL.
- `open()` emits a `warn!` when `fe_url` is `http://` and the host is
  not loopback. README's new "Security notes" section spells out the
  trust boundary the manual-redirect-following implies (a compromised FE
  could exfiltrate credentials via a hostile `Location` header).
- Response body truncated to 4 KB at a UTF-8 boundary before being
  formatted into errors or logs, so a misbehaving proxy that returns a
  giant body cannot OOM the connector or flood logs.

Robustness:
- Explicit `connect_timeout` (5 s) so an unreachable FE fails fast
  instead of consuming the full request timeout on the handshake alone.
- `send_stream_load` takes `bytes::Bytes`; clones inside the redirect
  loop are now refcount bumps instead of full `Vec<u8>` copies.

Observability:
- `warn!` when Doris reports `number_filtered_rows > 0` — schema drift
  in upstream messages was previously logged at `info!` and easy to miss.
- Per-batch success log demoted from `info!` to `debug!`.
- README documents `Expect: 100-continue`, `label_keep_max_second`
  guidance, and the filtered-row alert.

Tests: 21 unit tests pass (was 13, added 8 covering hash-suffix label
collision resistance, label length cap, severity ordering, identifier
validation, and log truncation). All 6 testcontainer integration tests
pass against a real Doris all-in-one image.
@ryankert01 ryankert01 force-pushed the feat/doris-sink-connector branch from 03ed46c to 9fd85d6 Compare May 11, 2026 12:59
@ryankert01 ryankert01 marked this pull request as ready for review May 11, 2026 13:01
@ryankert01 ryankert01 marked this pull request as draft May 11, 2026 13:18
…308 redirects

Three small README gaps surfaced during a re-read against the post-review
code:

- `database` / `table` must match `[A-Za-z0-9_]+`. The connector rejects
  anything else at startup with `Error::InvalidConfigValue` — surface the
  constraint where operators look for it (Requirements + Configuration
  table).
- Non-JSON payloads are dropped with `warn!` and the offset advances past
  them. That is silent data loss, not a recoverable skip, so the README
  now spells it out instead of glossing it as "skipped with a warning".
- `308 Permanent Redirect` is followed in addition to `307` (defensive),
  and the redirect cap of 5 is documented.
@ryankert01 ryankert01 marked this pull request as ready for review May 11, 2026 13:50
@ryankert01
Copy link
Copy Markdown
Member Author

There's a performance optimization(may or may not works) that I want to leave as a follow-up PR:
multi-output-format support: currently use json, but json seems to be slow in Doris. So I can test out csv & parquet.

Four pre-merge check failures from the previous commit, all mechanical:

- typos: `unparseable` → `unparsable` (1 in README, 2 in lib.rs comments).
- markdown lint MD013: README's label-format bullet was 583 chars; split
  into a parent bullet + 3 sub-bullets, all within the 500-char cap.
- rustfmt: trailing blank line in the integration test after the recent
  removal of the local `sanitize` helper.
- cargo sort: `iggy_connector_doris_sink` was added under
  `iggy_connector_sdk` in core/integration/Cargo.toml; reordered so the
  dependency list stays alphabetical.

No behavior change. 21 unit tests still pass; `cargo fmt --check` and
`cargo sort --workspace --check` both clean locally.
@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 12, 2026

we'll check this in upcoming 2-3 days.

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented May 14, 2026

/ready

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review PR is waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Apache Doris connector

2 participants