shipyard: phase 4 — CI wiring for TaskScheduler integration tests#115
Merged
shipyard: phase 4 — CI wiring for TaskScheduler integration tests#115
Conversation
…ation tests) Archives prior code-coverage PROJECT.md/ROADMAP.md and starts a new initiative: fix NetMQ lock contention in the TaskScheduler module, publish 0.4.0, and add end-to-end integration tests in DNQ. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 targets the sibling TaskScheduler repo. 4 plans across 4 waves: PLAN-1.1 (wave 1): NetMQ API probe + SetCountMsg struct PLAN-1.2 (wave 2): Poller infrastructure refactor (3 sequential tasks) PLAN-1.3 (wave 3): Async-friendly Start() + Dispose cleanup PLAN-2.1 (wave 4): Concurrency + state + lifecycle test suite Also archives the prior code-coverage phases/1-5/ to phases-archive- code-coverage/ so phase 1 can restart fresh for the new project. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 shipped on sibling repo branch phase-1-lock-fix (12 commits). _lockSocket = 0, 9/9 tests green (concurrency 5/5 in loop), Release clean. All gates PASS: Verification DONE, Audit CLEAN, Simplifier production-clean (2 deferred), Documenter CHANGELOG draft ready for Phase 2. Deferred to ISSUES.md: ISSUE-025..028 (RunPoller race, probe test cleanup, test-helper DRY, Start() remarks XML doc). Next: /shipyard:plan 2 for NuGet 0.4.0 release in sibling repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 plans across 5 waves targeting the sibling TaskScheduler repo:
PLAN-1.1 (wave 1): Pre-flight — verify 0.3.0 Symbols badge on
nuget.org + capture release commit/tag format
PLAN-2.1 (wave 2): Merge phase-1-lock-fix -> master
PLAN-2.2 (wave 2): [CONDITIONAL] Fix CI workflow to use combined
deploy/*.nupkg push form (only if 0.3.0 Symbols
were red)
PLAN-3.1 (wave 3): Release commit — version bump, CHANGELOG,
README, Start() remarks XML doc (closes
ISSUE-028)
PLAN-4.1 (wave 4): Local clean pack + .nupkg inspection (pre-tag
sanity)
PLAN-5.1 (wave 5): Tag v0.4.0, user pushes tag, GH Actions
publishes, user verifies on nuget.org
Publish path: tag-triggered GitHub Actions (mirrors 0.3.0), NOT
local dotnet nuget push. API key stays in GH secrets.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0.4.0 is live on nuget.org, published via GH Actions tag-triggered workflow. Sibling TaskScheduler repo released with: commit: b904ac3 (5 files, +23/-2) tag: v0.4.0 (annotated, unsigned, mirrors v0.3.0) push jobs: both .nupkg and .snupkg green on GH Actions run 24423676631 verified: user confirmed all green badges on nuget.org Phase 2 success criteria (ROADMAP lines 148-156): all 4 PASS. Deferred non-verification gates (audit/simplifier/documenter) by explicit user decision — Phase 2 diff is 5-file version/text/doc, no code logic changes, no new deps. Closed: ISSUE-028 (Start() <remarks> XML doc landed in release commit). Opened: ISSUE-029 (GH Actions Node.js 20 deprecation advisory). Next: /shipyard:plan 3 for DotNetWorkQueue integration test project referencing 0.4.0 from nuget.org. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 plans across 3 waves. New test project in this repo:
Source/DotNetWorkQueue.TaskScheduling.Distributed.TaskScheduler.Integration.Tests/
PLAN-1.1 (wave 1): Scaffold csproj + CPM entry for 0.4.0 + SLN
wiring + AssemblyInit + TestHelpers (port
seeds 50000/55000/60000, Linux-aware beacon)
PLAN-2.1 (wave 2): EndToEndSchedulingTests (real DNQ jobs)
PLAN-2.2 (wave 2): ConcurrencyRegressionTests (30s deadlock guard)
PLAN-2.3 (wave 2): NodeDiscoveryTests (two-node UDP beacon)
PLAN-3.1 (wave 3): Full 5x loop + Release solution build
Opened ISSUE-030: upstream README uses udpBroadcastPort: (wrong name)
Workaround: Phase 3 tests use positional args.
Orchestrator inline fixes to PLAN-2.2 and PLAN-2.3: verifier
confirmed ITaskSchedulerJobCountSync is public in 0.4.0, stripped
pivot-to-ATaskScheduler fallback prose; added mandatory
sync.Start() rationale (without Start() the _outbound null-safe
guard from Phase 1 makes the concurrency test a false positive).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r TaskScheduler 0.4.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etWorkQueue.sln Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Helpers.cs with port/beacon constants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The paper spec said net10.0;net8.0 but the CONTEXT-3 rationale "matches the rest of DNQ's test projects" was factually wrong — Memory integration tests and every other DNQ integration test project are net10.0-only, and Jenkins CI only runs net10.0 on ubuntu-latest (CLAUDE.md). User chose Option B: relax the spec to single-target net10.0 to match reality. Updated ROADMAP, CONTEXT-3, VERIFICATION, and all five PLAN files to remove stale net8.0 references.
Cross-repo regression guard for Phase 1's TaskSchedulerJobCountSync lock fix. Uses the IContainer closure pattern to resolve ITaskSchedulerJobCountSync after InjectDistributedTaskScheduler registers it, then hammers IncreaseCurrentTaskCount/DecreaseCurrentTaskCount from 12 threads with 5000 iterations each. 30s deadlock detector via Task.WaitAll; final count asserted == 0 via FluentAssertions. Start() is called before spawning threads so _outbound is initialized and the real concurrency path (not the null-safe no-op guard) is exercised — without Start(), the test would pass even if Phase 1's lock fix were reverted. Disjoint port counter via TestHelpers.NextPort(ref _portCounter) with ConcurrencyPortBase=55000. Positional args to InjectDistributedTaskScheduler (ISSUE-030 workaround). Test passed green in 5/5 consecutive full-suite runs locally.
…sses (PLAN-2.1)
SharedClasses.cs: cloned from DotNetWorkQueue.Transport.Memory.Integration.Tests
(provides Helpers + VerifyQueueRecordCount). Added explicit
`using DotNetWorkQueue.Transport.Memory;` because namespace walk-up resolves
IDataStorage in the Memory test project (namespace
DotNetWorkQueue.Transport.Memory.Integration.Tests) but not in this project
(namespace DotNetWorkQueue.TaskScheduling.Distributed.TaskScheduler.Integration.Tests).
This is the same shadowing issue documented in CLAUDE.md for IConfiguration
and Metrics.Metrics — different symbol, same root cause.
EndToEndSchedulingTests: scope-reduced from the original plan after discovering
that DotNetWorkQueue.IntegrationTests.Shared.Consumer.Implementation.SimpleConsumer.Run<>()
exposes `Action<TTransportCreate> setOptions` for transport options, not
`Action<IContainer> registerService`. There is no seam in the shared runner
through which a test can both reuse the shared producer/consumer flow AND
inject the distributed task scheduler into the consumer's IContainer.
Additionally, Memory transport's in-process data storage lives per
QueueContainer instance, so a naive two-container hand-roll where the producer
creates the queue in one container and the consumer reads from another never
sees the produced messages (the producer's 50 messages stay pending, the
consumer sees an empty store).
The smoke test proves the critical claim: InjectDistributedTaskScheduler
integrates cleanly into a real DNQ consumer's SimpleInjector container —
QueueContainer<MemoryMessageQueueInit>(...) construction runs SimpleInjector
Verify() over every binding introduced by the injection, so a broken
registration would throw ActivationException at the `using (var creator = ...)`
line. Phase 3's critical cross-repo regression guard (Phase 1's lock fix)
is covered by ConcurrencyRegressionTests which exercises the real injection
path end-to-end via IContainer.GetInstance<ITaskSchedulerJobCountSync>().
Uses positional args to InjectDistributedTaskScheduler (ISSUE-030 workaround).
Queue name is `"q" + Guid.NewGuid().ToString("N")` because DNQ queue names
must be alphanumeric/underscore/dot and the default Guid format contains
hyphens.
Two SchedulerContainer instances share one UDP beacon port and verify: 1. DiscoverEachOther_RemoteCountConverges — node B observes RemoteCountChanged after node A bumps its local task count, within a 10s deadline. 2. NodeStop_RemoteCountDecays — after node A is disposed, node B's aggregate count drops below the pre-disposal snapshot within 15s (RemovedNode decay path). Uses the IContainer closure pattern via a private `Node` helper class that captures the container during the SchedulerContainer(registerService) callback, then triggers build via CreateTaskScheduler() and resolves ITaskSchedulerJobCountSync. This is the same pattern ConcurrencyRegressionTests uses and is the ONLY way to reach the sync instance in 0.4.0 — SchedulerContainer exposes CreateTaskScheduler() / CreateTaskFactory() but no GetInstance<T>(). An earlier revision of this file (from the parallel-wave builder) used a nonexistent SchedulerContainer.GetInstance<ITaskSchedulerJobCountSync>() and produced 10 compile errors; rewritten during build recovery using the pattern discovered by PLAN-2.2's builder. Disjoint port counter via TestHelpers.NextPort(ref _portCounter) with NodeDiscoveryPortBase=60000. BeaconInterface goes through TestHelpers.BeaconInterface (never hardcoded "loopback"). Both test methods passed green in 5/5 consecutive full-suite runs locally.
Phase 3 (DNQ integration test project for TaskScheduler 0.4.0) is complete. All 5 ROADMAP success criteria satisfied: - #1 Debug build clean on net10.0 (single-target, spec corrected mid-build) - #2 Full suite 5x flakiness loop: 5/5 green, 4 tests, ~26s per run - #3 NuGet 0.4.0 resolution verified (no project reference) - #4 Release build -p:CI=true: 0 errors (2 pre-existing SYSLIB0012 warnings in LiteDB/SQLite ConnectionString.cs — not regressions) - #5 Pre-existing DNQ tests pass: 896/896 core + 57/57 Memory integration Test classes: - ConcurrencyRegressionTests (PLAN-2.2) — Phase 1 lock-fix cross-repo regression guard. 12 threads × 5000 iters, 30s deadlock detector. Start() called before threads to exercise the real concurrency path. - NodeDiscoveryTests (PLAN-2.3) — UDP beacon discovery + disposal decay. Uses IContainer closure pattern (SchedulerContainer has no GetInstance<T>); rewritten during build recovery. - EndToEndSchedulingTests (PLAN-2.1) — scope-reduced to a SimpleInjector Verify() smoke test after three blockers surfaced: shared SimpleConsumer runner has no Action<IContainer> seam, Memory transport storage is per-container, SharedSetup/VerifyMetrics/Metrics.Metrics are internal to the shared test project. Scope reduction user-approved during build. Build spec corrected mid-phase: TargetFrameworks changed from net10.0;net8.0 to net10.0 single-target. The plan's rationale ("matches the rest of DNQ's test projects") was factually wrong — Memory.Integration.Tests is net10.0-only, Jenkins CI only runs net10.0 on ubuntu-latest (CLAUDE.md). Phase gates: - Verification: PASS (Mode A plan-time + Mode B build-time rollup) - Audit: PASS (0 critical, 0 high, 0 medium, 2 informational) - Simplification: PASS_NO_ACTION (no duplication severe enough to extract) - Documentation: PASS_NO_ACTION (inline docs present, CLAUDE.md lessons deferred to /shipyard:ship) Ready for /shipyard:plan 4 (Jenkins + GitHub Actions CI wiring).
Wires the Phase 3 DNQ integration test project into both CI surfaces: - .github/workflows/ci.yml: append a new dotnet test step after the last unit test step (first integration test in ci.yml; existing jobs are unit-test-only). - Jenkinsfile: append a 14th parallel stage after the Dashboard stage with sleep(time: 65, unit: 'SECONDS') matching the (n-1)*5 stagger formula. 5 locked CONTEXT-4 decisions guide the build: 1. Optimistic UDP multicast — no pre-emptive skip mechanism, no --network=host, no [TestCategory]. If NetMQ beacon fails on Jenkins Docker on the feature-branch run, open a follow-up issue and add the skip in a separate PR. 2. Feature-branch-first validation before merging to master. 3. Exclude new stage from Codecov — no Coverlet --collect, no --settings, no .trx upload. Amends ROADMAP success criterion #2. 4. Append to Jenkins parallel block at the END (not alphabetical, not adjacent to Memory). 5. Add to existing unit+integration job in ci.yml (no new job). Correction captured in CONTEXT-4 post-research: use -c Debug, NOT -c Release -p:CI=true (the latter is a NuGet packaging flag, not a test-run flag, and all 13 existing stages use -c Debug). Plan structure: 1 wave, 2 plans, strictly disjoint file sets (PLAN-1.1 touches only ci.yml; PLAN-1.2 touches only Jenkinsfile) so they can run in parallel. - PLAN-1.1 (risk: low) — GitHub Actions step, uses --no-build to match the 11 existing unit-test steps. - PLAN-1.2 (risk: medium due to UDP multicast unknown) — Jenkins stage with sleep(time: 65), explicit "NO Coverlet" guardrails via scoped awk+grep verification. Verification: Mode A PASS (coverage + structure), Mode B READY (feasibility stress test — files exist, line references accurate, stagger formula verified, verify commands runnable, no forward references or hidden dependencies).
# Conflicts: # .shipyard/HISTORY.md # .shipyard/ROADMAP.md # .shipyard/STATE.json # .shipyard/phases/5/plans/PLAN-1.1.md
Phase 4 (CI wiring for TaskScheduler integration tests) build is complete on branch phase-4-ci-wiring. Two plans, one wave, two atomic commits: - 5bdcf84 PLAN-1.1 .github/workflows/ci.yml: append dotnet test step - 6ecd8e8 PLAN-1.2 Jenkinsfile: append 14th parallel stage (sleep 65s) All 5 CONTEXT-4 locked decisions honored: 1. Optimistic UDP (no skip/network=host/TestCategory) 2. Feature branch first (commits on phase-4-ci-wiring, not master) 3. Exclude from Codecov (no --collect/--settings/--results-directory) 4. Append at end of Jenkins parallel block (formula (n-1)*5 holds) 5. Same ci.yml job (no new job) Phase gates: - Verification: PASS (structural; runtime deferred to post-push CI) - Audit: PASS (0 critical, 0 high, 0 medium, 2 informational) - Simplification: PASS_NO_ACTION (15-line additive diff, no surface) - Documentation: PASS_NO_ACTION (deferred to /shipyard:ship) Success criteria #1 (GitHub Actions run) and #2 (Jenkins stage run) are deferred to the first push of phase-4-ci-wiring — those are enforced by the live CI systems, not local verification. Per CONTEXT-4 decision #1, if NetMQ beacon discovery fails on Jenkins Docker, the fallback is a follow-up issue + skip mechanism in a separate PR. Plan data error captured for lessons: PLAN-1.2 verify check #8 expected unstash count = 13 but actual pre-edit was 14. Reviewer confirmed the builder's diff is purely additive with no touch to the Coverage Report stage. Plan research missed one unstash. Agent reliability pattern: Phase 4 builds were done inline from the main thread rather than via builder agent dispatches, because Phase 3 showed 4/5 builder agents hit turn budget before completing SUMMARYs. For 1-task plans with exact-match anchors, direct edits are faster and more reliable than agent dispatches.
b8a521c to
b77b988
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #115 +/- ##
==========================================
- Coverage 88.55% 88.50% -0.06%
==========================================
Files 994 994
Lines 29298 29298
Branches 2380 2380
==========================================
- Hits 25944 25929 -15
- Misses 2475 2484 +9
- Partials 879 885 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
blehnen
added a commit
that referenced
this pull request
Apr 15, 2026
Milestone shipped via PR #115 (merged to master as commit 190f122). All 4 phases complete across 2 repositories: - Phase 1 (sibling TaskScheduler repo): Lock contention fix — NetMQPoller owns _actor, outbound SetCount routed through NetMQQueue, _lockSocket eliminated. 9/9 unit tests green, 5/5 concurrency loop green. - Phase 2 (sibling TaskScheduler repo): NuGet 0.4.0 published to nuget.org via tag-triggered GH Actions. Symbols + deterministic Source Link green. - Phase 3 (this repo): New net10.0 integration test project consuming the 0.4.0 NuGet via CPM. 3 test classes: ConcurrencyRegressionTests (real cross-repo regression guard), NodeDiscoveryTests, EndToEndSchedulingTests (smoke). 4/4 green in 5/5 flakiness loop. 896/896 core + 57/57 Memory integration tests still green. - Phase 4 (this repo): CI wiring — 1 new GH Actions step + 14th Jenkins parallel stage (sleep 65s, (n-1)*5 stagger preserved). 15-line additive diff. Optimistic UDP multicast on Docker bridge worked on first Jenkins run, no beacon-skip fallback needed. Ship-time artifacts: - MILESTONE-REPORT.md: aggregated phase summaries, key decisions, known issues, metrics - LESSONS.md: 15-entry milestone section covering process lessons (Jenkins PR-triggered, git fetch discipline, agent turn budgets), code/API discoveries (SchedulerContainer closure pattern, Start() invariant, queue name format, Memory per-container storage, cross-namespace walk-up), and CI config clarifications - CLAUDE.md: 9 new Lessons Learned entries, plus the "13 → 14 parallel integration test stages" update and the Jenkins PR-trigger note added to the Conventions CI section - STATE.json: status=shipped, position="TaskScheduler 0.4.0 milestone shipped via PR #115" Worktree .worktrees/phase-4-ci-wiring removed; local phase-4-ci-wiring branch deleted; remote branch auto-deleted by GitHub on merge. 30 issues remain open in ISSUES.md, preserved for the next milestone. Notable: ISSUE-030 upstream README positional-args workaround still applies; pre-existing SYSLIB0012 warnings in LiteDB/SQLite ConnectionString.cs tracked for a future cleanup phase.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the Phase 3
DotNetWorkQueue.TaskScheduling.Distributed.TaskScheduler.Integration.Testsproject into both CI surfaces. This is the final phase of theTaskScheduler 0.4.0milestone (issue #6 lock-contention fix + NuGet release + DNQ-side integration tests + CI wiring)..github/workflows/ci.yml: one newdotnet teststep appended afterUnit Tests - Memory. First integration test inci.yml; existing job was unit-only.Jenkinsfile: one new 14th parallel stage (TaskScheduler Distributed) appended afterstage('Dashboard'), withsleep(time: 65, unit: 'SECONDS')following the(n-1)*5stagger formula.Change set
.github/workflows/ci.ymlJenkinsfileTotal: 15 insertions, 0 deletions, 2 files. Strictly additive; no existing stage or step is touched.
CONTEXT-4 locked decisions honored
--network=host, no[TestCategory], no beacon-skip env var, no--filteron the new Jenkins stage. If NetMQ beacon discovery fails on the Docker bridge network, a follow-up issue + skip mechanism lands as a second commit on this same PR.--collect \"XPlat Code Coverage\", no--settings, no--results-directory, nostash. The new project tests an external NuGet; core DLLs are already covered by the other 13 Jenkins stages.-c Debug(not-c Release -p:CI=true— the latter is a NuGet packaging flag, not a test-run flag).ci.ymljob. No new job, no new matrix, no new trigger. Appended as one step to the existingbuild-and-testjob.Hard gates (this PR enforces)
Integration Tests - TaskScheduler Distributedstep green onubuntu-latest/net10.0.TaskScheduler Distributedparallel stage green on the Docker agent. This is the UDP-multicast reality check.unstashed coverage reports (the new stage contributes no stash).Known risks
NodeDiscoveryTestsmay or may not work on Jenkins' Docker agent. If it fails with a discovery-event timeout, the fallback plan is:[TestCategory(\"BeaconRequired\")]toNodeDiscoveryTests.cson this branch--filter \"TestCategory!=BeaconRequired\"to the Jenkinsfile stage(n-1)*5formula. If a 15th stage is added in a future phase, the stagger formula or the upper bound will need to be revisited.Shipyard artifacts (for reviewers)
Under `.shipyard/phases/4/` on this branch:
Test plan