Parallel Add Ordering: A Research Note And Proposal #153

gwokhou · 2026-06-30T05:31:54Z

gwokhou
Jun 30, 2026

Parallel Add Ordering: A Research Note And Proposal

This document is an opinionated research note. I am using the current OpenKB parallel-add implementation as the grounding point, then proposing how batch ordering should evolve when LLM- or index-generated Markdown quality is affected by write order.

The core idea is that ordering should be treated as a product and architecture decision, not as an incidental side effect of thread scheduling. Prepare work can be parallel; official commits still need a deliberate, reviewable order.

Positioning: this is not a full roadmap. It is a research-backed design proposal that contributes a direction, evaluates alternatives, and leaves an implementation path that can later be split into a narrower spec.

Context

This proposal assumes the parallel architecture roadmap has reached the point where the prepare/commit split is real:

prepare workers are allowed to run concurrently;
prepare workers write only private .openkb/staging/prepare/... output;
workers do not acquire the mutation lock or publish official KB state;
a serial mutation owner resolves final names, snapshots rollback-critical paths, publishes staged artifacts, compiles wiki output, updates the registry, and marks the commit point.

In the current implementation, the calling thread holds the exclusive KB ingest lock across both prepare and commit. That means these strategies improve throughput inside one directory-add batch; they do not make multiple commands mutate the same KB concurrently.

Under that architecture, mutation safety is mostly a solved boundary problem: workers can prepare disposable data, and official KB writes remain serial. The interesting design question is not "can workers write concurrently?" They cannot. The question I want to answer is which prepared document the serial owner should commit next, and what tradeoff that choice makes visible to users.

Commit order matters because each commit observes the official KB state left by the commits before it. OpenKB's compile path does not generate isolated per-file Markdown only; it reads and updates shared wiki context. The final contents of generated Markdown files can therefore be order-sensitive, especially for:

wiki/summaries/
wiki/concepts/
wiki/entities/
wiki/index.md

The main order-sensitive mechanisms are:

Visible concept/entity context: the concepts plan reads existing concept/entity briefs. A document committed later can update or link to pages created by earlier commits; the same document committed earlier cannot.
Create-vs-update decisions: the LLM may create a concept/entity page when no matching page exists, or update an existing page when a previous commit already introduced it.
Wikilink whitelist: summary rewrites and generated concept/entity pages are constrained by the set of valid wiki targets visible at that commit. Different commit orders produce different valid link sets.
Backlinks and index entries: each commit appends summary links, related concept/entity links, and index.md entries against the wiki state it sees.
Final naming and deduplication: final doc_name resolution and registry deduplication happen under the serial owner. Order can affect which source first claims a clean name and which later source receives a deterministic suffix when names collide.
Long-document artifacts: PageIndex summaries and source JSON are written inside the serial mutation window, and later global wiki generation can see or reference them.

My goal is not simply to commit whichever file finishes first. The design target is stable, complete, low-redundancy Markdown while preserving deterministic and recoverable mutation semantics.

Working Thesis

My long-term bet is batch/global consolidation, not relying on a single lucky file commit order.
For default and reproducible modes, I would keep commit order deterministic and independent of worker completion order.
For quality-oriented modes, I would optimize coverage, diversity, dependency order, and long-context position effects.
I would keep path order as the default because it is the most legible reproducibility contract; quality ordering should be opt-in and experimentally evaluated.
I would allow completion-order commit only as an explicit throughput experiment where the user accepts non-reproducible official KB state.

Research Threads Behind The Proposal

GraphRAG: Build Global Structure Before Generating Global Markdown

GraphRAG's key lesson is to avoid treating global summaries as an incidental side effect of per-document order. Instead, extract entities and relationships, build a graph, cluster it, and then generate summaries from that global structure.

This is directly relevant to OpenKB outputs such as concepts, entities, and index.md.

Reference:

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Potential OpenKB flow:

prepare each document
  -> extract local headings, entities, links, embeddings
  -> build batch-level entity/link graph
  -> resolve clusters and communities
  -> generate consolidated concepts/entities/index
  -> commit final Markdown deterministically

MMR And Submodular Summarization: Coverage With Low Redundancy

MMR and submodular summarization optimize two competing objectives:

include important, representative information;
avoid repeating information already covered.

This is useful for ordering documents in a batch when many files overlap.

References:

Example scoring model:

score(doc) =
  centrality(doc)
  + coverage_gain(doc, selected)
  - redundancy(doc, selected)
  + metadata_priority(doc)

Possible feature definitions:

centrality: similarity to the batch centroid.
coverage_gain: new topics, entities, or headings introduced by the document.
redundancy: overlap with already selected documents.
metadata_priority: deterministic boosts for README, overview, spec, date, or other known high-signal markers.

Tie-breaks should always be deterministic, such as normalized path order.

Content Ordering: Respect Natural Narrative And Dependency Structure

Some corpora have a natural order: background before design, design before implementation, implementation before results. Content ordering research captures this idea and is useful for tutorials, papers, specs, and logs.

Reference:

Catching the Drift: Probabilistic Content Models

Practical OpenKB heuristics:

overview / README / abstract
  -> requirements / background
  -> architecture / design
  -> implementation details
  -> experiments / results
  -> appendix / logs

Engineering signals:

filename: README, overview, intro, architecture, design, spec;
headings: Introduction, Background, Method, Results;
timestamps for logs or meeting notes;
links and references between files;
repository/module topology.

Long-Context Position Effects

When several documents or summaries are placed into one prompt, model behavior depends on position. Work on long-context usage shows that information in the middle can be underused compared with information near the beginning or end.

References:

OpenKB implication:

Put global context and high-signal overview material near the beginning.
Put constraints or recap material near the end.
Put low-priority, repetitive, or appendix-like material in the middle.

Prompt layout pattern:

[core overview]
[high-coverage representative documents]
[ordinary supporting documents]
[low-priority or redundant documents]
[final high-signal recap / constraints]

Proposed OpenKB Strategy

Current Baseline: Collect-All Prepare, Canonical Input/Path-Order Commit

The current --jobs > 1 local directory implementation is:

parallel prepare all documents
  -> collect prepared results by input index
  -> serially commit input 0, 1, 2, ...

This is implemented by openkb.parallel_add.add_directory_with_jobs():

prepare workers run through ThreadPoolExecutor and are collected with as_completed;
results are stored by input index;
the commit loop iterates range(len(files));
commit_prepared_document() requires the lock-owning caller thread and commits through the serial mutation path.

This baseline is reliable and reproducible, but it leaves throughput on the table because commit does not start until all prepare work has completed.

The CLI serial --jobs=1 directory path still loops over add_single_file() and does not use the PreparedDocument handoff. Existing parallel-add tests compare the internal prepared add_directory_with_jobs(..., jobs=1) and jobs=2 paths, but they do not yet cover the real CLI --jobs=1 versus --jobs>1 split.

Default Ordering Policy: Stable Path Order

My proposed default ordering policy is intentionally conservative:

sorted(path)

Why I would keep it:

reproducible;
easy to debug;
matches current serial expectations;
does not require embeddings or extra LLM calls.

This is an ordering policy, not a scheduling requirement. OpenKB can still execute the policy with a better schedule, such as an input-order streaming queue, as long as the visible commit order remains sorted(path).

Proposed Near-Term Schedule: Input-Order Streaming Queue

My preferred near-term improvement is to turn prepare into a producer pool and commit into a single input-ordered consumer. This gives us earlier commits without changing which document is allowed to commit next:

start bounded prepare workers
next_to_commit = 0

while work remains:
  collect completed prepare results or prepare failures
  while result[next_to_commit] is terminal:
    if result[next_to_commit] is prepared:
      outcome = serial_commit(result[next_to_commit])
      record added / skipped / failed outcome
    else:
      record prepare failure
    next_to_commit += 1

A prepare failure is a terminal input slot for ordering purposes. It must advance next_to_commit after recording "failed"; otherwise one early failed prepare would permanently block later successful prepared documents.

A normal commit failure is also a handled terminal outcome. If the serial mutation rolls back cleanly and returns "failed", the scheduler should record that outcome and continue to the next input slot. Only dirty rollback, KeyboardInterrupt, SystemExit, or another propagated BaseException should abort the streaming scheduler.

Streaming outcome contract:

Event	Scheduler action	Recorded outcome	Continue batch?	Official state expectation
Prepare succeeds for current input	Store `PreparedDocument`; commit when `next_to_commit` reaches it	commit result	Yes, unless commit raises	Official writes happen only inside serial commit
Prepare raises normal `Exception`	Mark the input slot terminal and advance the gate	`failed`	Yes	No official writes for that input
Serial commit returns `added`	Record outcome and clean prepared staging	`added`	Yes	Mutation is committed and visible
Serial commit returns `skipped`	Record outcome and clean prepared staging	`skipped`	Yes	Duplicate was revalidated at commit time
Serial commit returns `failed` after clean rollback	Record outcome and clean prepared staging	`failed`	Yes	Failed mutation effects are rolled back
Dirty rollback raises	Cancel remaining work and clean uncommitted staging	command failure	No	Retained journal owns recovery on next lock acquisition
`KeyboardInterrupt`, `SystemExit`, or propagated `BaseException`	Cancel futures, roll back active mutation best-effort, clean staging	command interruption	No	Earlier committed mutations remain committed

For completed runs with the same successful inputs, this preserves the existing official-state contract:

--jobs=N should match --jobs=1 for the same successful inputs;
LLM compilation observes a stable wiki context order;
final doc names are still resolved under the serial owner;
worker completion order still cannot affect official KB state.

It does not eliminate head-of-line blocking. If input 0 is slow and inputs 1-9 are ready, commits still wait for input 0. That is an explicit reproducibility tradeoff rather than a correctness problem.

It also changes interrupt behavior. The current collect-all implementation does not commit anything until all prepare work has finished; a prepare-phase interrupt can therefore leave no official commits from that batch. A streaming implementation can commit earlier documents while later prepare work is still running. If a later prepare or user interrupt fails the command, the earlier successful commits should remain committed. That is a valid partial-progress contract, but it must be documented and tested separately from the current collect-all behavior.

Implementing streaming is an orchestration change, not just a different final for index in ... loop. The executor lifecycle, ready-result buffer, next_to_commit gate, future cancellation, staging cleanup, and dirty-rollback abort path must be coordinated in one scheduler.

Experimental Throughput Idea: Completion Order

If we want a mode that maximizes throughput rather than reproducibility, I would make it explicit:

openkb add docs/ --jobs 4 --order completion

In this mode, whichever prepare finishes first enters the serial commit path first:

for prepared in completed_prepare_order:
    serial_commit(prepared)

Only the lock-owning main thread may run serial_commit. Completion-order mode must not let worker callbacks commit directly; workers still only produce private prepared output.

This can reduce head-of-line blocking and improve batch throughput. However, it makes official KB state depend on worker scheduling, machine performance, file sizes, and transient LLM/indexing latency. That includes generated Markdown, registry ownership for duplicate hashes, final doc_name ownership for name collisions, and raw/source artifact ownership. It matters because each commit can read and rewrite shared global wiki state such as:

wiki/concepts/;
wiki/entities/;
wiki/index.md;
the current wikilink whitelist;
concept/entity briefs used by the LLM planning prompt.

My view is that completion order is acceptable only as an explicit opt-in mode for users who prefer throughput over reproducibility. It should not be the default, and its implementation cost is not just the code change: it also requires a new contract, test expectations, evaluation baselines, and documentation for the fact that --jobs=N may no longer match --jobs=1 official KB state.

Experimental Quality Idea: Quality Order

I would add an opt-in quality ordering mode only after deterministic path order is stable:

openkb add docs/ --jobs 4 --order quality

Suggested pipeline:

Prepare workers extract metadata into private staging.
Build ordering features:
- path;
- title;
- headings;
- timestamp;
- links;
- content hash;
- embedding;
- entities.
Apply dependency/topological ordering when explicit dependencies exist.
Apply coverage/diversity greedy selection for unordered documents.
Use path order as the final deterministic tie-break.

This requires extending the prepare handoff. Today PreparedDocument carries only the input index, source path, private staging path, and conversion result. Quality and cost modes need an explicit metadata schema for titles, headings, timestamps, links, hashes, embeddings, entities, estimated cost, and any errors or unavailable fields. Manifest mode mostly needs parser, validation, and coverage reporting, but should share the same dry-run order inspection surface. The dry-run output should expose the relevant metadata and the computed order so users can inspect why a batch will commit in that order.

Prepare-time metadata is advisory. Prepare currently reads the registry for duplicate detection and produces only a candidate document name; a streaming batch can commit earlier documents while later prepared results still carry registry observations from before those commits. Therefore duplicate/skip status, final doc_name, and official artifact ownership must be revalidated inside the serial commit owner.

Expensive metadata such as embeddings or entity extraction should remain opt-in. The default path-order mode should not require extra LLM calls.

Minimum metadata schema:

Field	Required for	Source	Notes
`input_index`	all modes	batch enumeration	Preserves stable tie-break and diagnostics
`source_path`	all modes	input scanner	Store normalized path relative to the user target when possible
`file_hash`	all modes	prepare conversion	Advisory until commit revalidates duplicate state
`doc_name_candidate`	all modes	prepare conversion	Advisory only; commit resolves final `doc_name`
`title`	quality, dry-run	document parser	Optional; missing title falls back deterministically
`headings`	quality, dry-run	document parser	Should preserve document order
`links`	quality, manifest validation	parser or Markdown converter	Normalize paths/targets before scoring
`timestamp`	quality, manifest validation	file metadata or parsed frontmatter	Optional; never the sole tie-break
`estimated_cost`	cost	local estimator	Must be deterministic and cheap
`embedding`	quality	opt-in model call	Expensive; never required by default path order
`entities`	quality, global consolidation	opt-in extractor	Expensive; store extractor/model version
`metadata_errors`	all experimental modes	prepare worker	Missing fields should be visible in dry-run output

Dry-run output should include input_index, normalized path, selected rank, mode-specific scoring inputs, unavailable metadata fields, and the final tie-break that decided each adjacent ordering decision.

Pseudo-code:

selected = []
remaining = docs

while remaining:
    best = max(
        remaining,
        key=lambda doc: (
            centrality(doc)
            + coverage_gain(doc, selected)
            - redundancy(doc, selected)
            + metadata_priority(doc),
            -path_rank(doc),
        ),
    )
    selected.append(best)
    remaining.remove(best)

Experimental User-Control Idea: Explicit Order

Some corpora have an authoritative order that OpenKB cannot infer reliably, such as course chapters, meeting logs, design histories, or paper collections. For those cases, I think a manifest-driven mode is the right escape hatch:

openkb add docs/ --jobs 4 --order manifest --order-file manifest.txt

The manifest must be validated strictly:

every listed path must belong to the input set;
duplicate entries must be rejected;
unlisted inputs must either fail the run or be appended by a documented deterministic fallback;
the final commit loop must remain serial.

This mode trades automation for user control while keeping reproducibility.

Experimental Latency Idea: Deterministic Cost Order

A deterministic cost-oriented mode could commit estimated cheap work first:

order by estimated cost:
  short markdown / text
  short converted documents
  long PDFs / PageIndex work
  path tie-break

This may improve perceived throughput, but it changes wiki generation order and its cost estimates can be wrong. I would keep it lower priority than input-order streaming and quality order.

Longer-Term Design Bets

The following ideas change more than commit ordering. They alter the boundary between local artifacts, global Markdown generation, registry visibility, and rollback semantics. I would treat them as architecture bets, not simple --order strategies.

Long-Term Bet: Batch/Global Consolidation

My strongest long-term bet is to reduce order sensitivity by separating local document processing from global Markdown generation.

Phase 1: per-document local generation
  - source markdown
  - local summary
  - local entities
  - local concepts
  - local links

Phase 2: batch/global consolidation
  - merge concepts
  - dedupe entities
  - rebuild index
  - regenerate global summaries

I expect this to produce more stable and globally coherent Markdown than relying on incremental per-file updates.

Long-Term Bet: Parallel Local Candidates, Serial Global Merge

Another direction I think is worth exploring is to run expensive LLM/index candidate generation in parallel while keeping official writes serial:

parallel prepare
  -> local summaries / concepts / entities / link candidates in private staging
  -> serial validation and merge against current KB state
  -> official Markdown write

This can expose more parallelism than prepare-only execution. It is reliable only if worker output remains private and the merge step validates candidates against the current official KB state before publishing.

Structural Alternative: Local Artifacts First, Global Compile Later

A more structural alternative is to split local artifact publication from global wiki generation:

Phase 1: publish raw/source/local metadata with serial mutation safety
Phase 2: generate or regenerate global wiki outputs from the batch

This can reduce per-document ordering effects, but it requires a new commit point and failure contract. For example, OpenKB must decide whether a document with committed raw/source artifacts but failed global compilation is visible, retryable, or rolled back.

It also changes registry semantics. Today the add path registers a hash only after successful compilation. A two-phase design must explicitly define:

when a document becomes visible to duplicate detection;
whether openkb remove can remove a document whose raw/source artifacts are committed but whose global wiki output failed;
whether retry reuses the existing registry entry or creates a new mutation;
how failed global compile state is recorded and recovered.

Candidate Ordering Modes

Mode	Core idea	Throughput	Reproducibility	Markdown quality effect	Implementation cost	Role
Current: collect-all + canonical input/path-order commit	Prepare all files in parallel, then commit canonical input/path order	Low-medium	High	Matches current serial path-order semantics	Implemented	Baseline
Input-order streaming queue	Commit the next input as soon as its prepare is ready	Medium within one batch	High for completed runs; different partial-progress behavior on interrupt	Same intended semantics as current baseline	Low-medium	Recommended default improvement
Completion-order serial commit	Commit whichever prepare finishes first	High within one batch	Low	Depends on completion order and current wiki/registry state	Code: low-medium; contract/evaluation: medium-high	Explicit throughput mode
Deterministic quality order	Compute a stable quality-oriented batch order	Medium	High	Potentially improves coverage and reduces redundancy	Medium	Experimental quality mode
User-selected explicit order	Commit according to a validated manifest	Medium	High	User controls natural or dependency order	Medium	Advanced batch mode
Deterministic cost order	Commit estimated cheap work first	Medium-high	High	Changes semantic order; benefit depends on estimates	Medium	Lower-priority experiment

For streaming input order, "High" reproducibility means completed runs produce the same official state as canonical input/path-order commit for the same successful inputs. Failure and interrupt behavior is a separate contract because streaming can validly leave earlier input-order commits in place.

CLI Contract I Would Make Explicit

Before implementing optional ordering modes, I would make the CLI contract explicit instead of treating ordering as an internal scheduler flag:

CLI form	Meaning	Reproducibility contract	Notes
`openkb add docs/ --jobs N`	Default path/input-order commit	Matches CLI `--jobs=1` for completed runs with the same successful inputs	No extra metadata or LLM calls
`openkb add docs/ --jobs N --order path`	Explicit default order	Same as default	Useful for scripts that want a stable named mode
`openkb add docs/ --jobs N --order completion`	Completion-order commit	Not reproducible across runs or machines	Must print or document non-reproducible official KB state
`openkb add docs/ --jobs N --order quality`	Deterministic quality order	Reproducible if metadata extractors and model versions are fixed	Requires dry-run inspection before broad use
`openkb add docs/ --jobs N --order manifest --order-file manifest.txt`	User-specified order	Reproducible if manifest is valid	Reject duplicate or out-of-input paths
`openkb add docs/ --jobs N --order cost`	Deterministic cost order	Reproducible if cost estimator version is fixed	Lower priority experiment

I would scope the initial implementation to local directory ingestion, because that is where --jobs already applies. Single-file, URL, and PageIndex Cloud ingest should keep their current behavior unless a separate batch API is introduced.

Candidate Architecture Modes

Mode	Core idea	Throughput	Reproducibility	Markdown quality effect	Implementation cost	Role
Two-phase local artifacts then global compile	Commit local artifacts first, compile global Markdown later	Medium-high	High	Can make global output more stable	High	Medium/long-term architecture
Parallel local candidates + serial merge	Generate private candidates in parallel, validate and merge serially	High	Medium-high	High potential, but stale candidates need validation	High	Medium/long-term performance
Batch/global consolidation	Build global graph/structure, then generate consolidated Markdown	Medium-high	High	Highest potential for global coherence	High	Long-term target

These architecture modes require coordinator-level changes. The current add mutation body publishes staged artifacts, runs compile/index work, and writes the registry entry inside one AddMutationPlan. Splitting those phases changes the commit point and cannot be implemented as a local ordering policy alone.

Rejected Ordering/Write Strategies

The following strategies are not reliable under the current mutation model:

Fully parallel official commits from multiple workers.
Worker writes directly into official wiki/, raw/, .openkb/, registry, journal, PageIndex DB, concepts, entities, or index.md state.
Prepare-time final doc_name reservation in official state.
Sharded official commits that split raw/source writes from shared global wiki writes without first redesigning the coordinator and commit point.

These designs conflict with the current serial mutation owner model. They also make rollback dangerous because an earlier failed mutation can snapshot and restore shared paths over later committed work.

Implications For Parallel Architecture

Prepare workers may compute ordering features, but must not decide official KB state.

Prepare workers may:

hash files;
parse headings;
extract metadata;
compute embeddings;
extract local entities;
write private staging.

Prepare workers must not:

write official wiki/;
write official raw/;
write official .openkb/ state outside private staging;
decide final doc_name;
update concepts, entities, or index.md.

Commit remains serial:

batch_order = compute_order(prepared_docs)
for doc in batch_order:
    serial_commit(doc)

Worker completion order must never define commit order unless the user has explicitly selected a non-reproducible completion-order throughput mode.

Risks I Would Keep Visible

LLM output may still be nondeterministic.
Embedding or model upgrades can change quality order.
Completion order can make official KB state depend on worker scheduling and transient runtime conditions, including Markdown, registry ownership, and final artifact ownership.
Quality order can be harder for users to predict than path order.
Cost order can be wrong if estimates do not match actual indexing or LLM latency.
Batch/global consolidation increases rollback surface.
Global consolidation requires stronger evaluation than path-order ingest.

This leads to the proposal map I would use:

current = collect-all prepare + canonical input/path-order commit
near-term default schedule = input-order streaming queue
experimental throughput = completion order
experimental deterministic = quality / manifest / cost order
long-term = serial merge or batch/global consolidation

Proposed Implementation Path

Keep current collect-all canonical input/path-order implementation as the regression baseline.
Add input-order streaming commit so commit can begin before every prepare has completed while preserving the same visible commit order for completed runs.
Add metadata extraction during prepare.

Add dry-run order inspection:

openkb add docs/ --jobs 4 --order quality --dry-run-order

Add deterministic opt-in order modes first: quality, manifest, and possibly cost.
Add completion only as an explicitly non-reproducible throughput mode.
Later add batch-level concepts / entities / index.md consolidation.

Streaming must include explicit tests for partial-progress behavior:

earlier input-order commits remain committed when a later prepare fails;
an early prepare-failed slot advances next_to_commit so later prepared documents can commit;
a cleanly rolled-back commit failure records "failed" and advances next_to_commit;
KeyboardInterrupt during prepare or commit leaves no dirty staging;
dirty rollback aborts the remaining batch;
duplicate hashes and name collisions are revalidated at serial commit time.

Stage A should also add a true CLI-level regression test comparing directory ingest through --jobs=1 and --jobs>1, because the former currently uses add_single_file() while the latter enters the PreparedDocument path.

How I Would Evaluate This

Use these to compare path order, streaming input order, completion order, quality order, manifest order, cost order, and global consolidation:

concept/entity deduplication rate;
index coverage;
dead link count;
cross-reference count;
LLM judge score for summary completeness;
human review samples;
repeated-run Markdown diff stability;
downstream query answer quality;
wall-clock ingest time;
time to first committed document;
time spent blocked behind earlier unfinished prepare work;
partial-progress recovery behavior after prepare failure, commit interrupt, and dirty rollback.

Minimum acceptance gates:

Gate	Required evidence
Baseline equivalence	CLI `--jobs=1` and CLI `--jobs=N` produce the same official state for deterministic fixtures with the same successful inputs
Streaming scheduler correctness	Outcome contract tests cover prepare success, prepare failure, `added`, `skipped`, clean `failed`, dirty rollback, and interrupt
Staging hygiene	No private prepare staging remains after success, clean failure, interrupt, or dirty rollback tests
Commit-owner invariant	Workers never call `serial_commit` and never write official `wiki/`, `raw/`, or `.openkb/` state outside private staging
Default reproducibility	Repeated default/path-order runs produce stable official state under mocked LLM/index outputs
Completion mode isolation	Completion order is opt-in, documented as non-reproducible, and still commits only on the lock-owning thread
Metadata transparency	Experimental deterministic modes expose dry-run order, metadata inputs, missing fields, and tie-breaks
Performance signal	Streaming demonstrates earlier time to first committed document on a fixture with slow later prepare work, without changing completed-run state

Proposed Roadmap

Stage A: current collect-all canonical input/path order
  -> ensure CLI --jobs=N matches CLI --jobs=1 for official KB state

Stage B: input-order streaming queue
  -> begin serial commits as soon as the next canonical input is prepared
  -> document and test partial-progress behavior on failure or interrupt

Stage C: metadata extraction
  -> title, date, links, embeddings, entities

Stage D: deterministic optional order modes
  -> coverage + diversity + dependency + stable tie-break
  -> manifest order for user-controlled corpora
  -> optional deterministic cost order

Stage E: explicit completion-order throughput mode
  -> opt-in only; document non-reproducible official KB state

Stage F: parallel local candidates with serial merge
  -> private LLM/index candidates validated before official write

Stage G: batch/global consolidation
  -> GraphRAG-style entity graph and community summaries

Stage H: evaluation
  -> compare baseline vs streaming vs completion vs quality vs global consolidation

My Recommendation

I would treat OpenKB ordering as four layers:

Correctness: official commit order must be deterministic.
Throughput: input-order streaming should improve utilization without changing official semantics for completed runs, while adopting a documented partial-progress contract on failure or interrupt; completion order can exist only as opt-in.
Quality: optional quality order can optimize coverage, diversity, dependency structure, and position effects.
Architecture: long-term global consolidation should reduce dependence on document commit order.

My pragmatic path is:

Keep current collect-all canonical input/path order as the baseline.
Move the default schedule to input-order streaming for better throughput.
Offer completion order only as an explicit non-reproducible throughput mode.
Offer deterministic quality/manifest order as experimental modes.
Move global Markdown toward GraphRAG-style batch consolidation over time.

gwokhou · 2026-06-30T06:06:27Z

gwokhou
Jun 30, 2026
Author

@KylinMountain Hi Kylin, I hope this proposal will be helpful for your concerns about "does concurrency earn its keep?"

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel Add Ordering: A Research Note And Proposal #153

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Parallel Add Ordering: A Research Note And Proposal #153

Uh oh!

Uh oh!

gwokhou Jun 30, 2026

Parallel Add Ordering: A Research Note And Proposal

Context

Working Thesis

Research Threads Behind The Proposal

GraphRAG: Build Global Structure Before Generating Global Markdown

MMR And Submodular Summarization: Coverage With Low Redundancy

Content Ordering: Respect Natural Narrative And Dependency Structure

Long-Context Position Effects

Proposed OpenKB Strategy

Current Baseline: Collect-All Prepare, Canonical Input/Path-Order Commit

Default Ordering Policy: Stable Path Order

Proposed Near-Term Schedule: Input-Order Streaming Queue

Experimental Throughput Idea: Completion Order

Experimental Quality Idea: Quality Order

Experimental User-Control Idea: Explicit Order

Experimental Latency Idea: Deterministic Cost Order

Longer-Term Design Bets

Long-Term Bet: Batch/Global Consolidation

Long-Term Bet: Parallel Local Candidates, Serial Global Merge

Structural Alternative: Local Artifacts First, Global Compile Later

Candidate Ordering Modes

CLI Contract I Would Make Explicit

Candidate Architecture Modes

Rejected Ordering/Write Strategies

Implications For Parallel Architecture

Risks I Would Keep Visible

Proposed Implementation Path

How I Would Evaluate This

Proposed Roadmap

My Recommendation

Replies: 1 comment

Uh oh!

gwokhou Jun 30, 2026 Author

gwokhou
Jun 30, 2026

gwokhou
Jun 30, 2026
Author