Replies: 1 comment
-
|
@KylinMountain Hi Kylin, I hope this proposal will be helpful for your concerns about "does concurrency earn its keep?" |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Parallel Add Ordering: A Research Note And Proposal
This document is an opinionated research note. I am using the current OpenKB parallel-add implementation as the grounding point, then proposing how batch ordering should evolve when LLM- or index-generated Markdown quality is affected by write order.
The core idea is that ordering should be treated as a product and architecture decision, not as an incidental side effect of thread scheduling. Prepare work can be parallel; official commits still need a deliberate, reviewable order.
Positioning: this is not a full roadmap. It is a research-backed design proposal that contributes a direction, evaluates alternatives, and leaves an implementation path that can later be split into a narrower spec.
Context
This proposal assumes the parallel architecture roadmap has reached the point where the prepare/commit split is real:
.openkb/staging/prepare/...output;In the current implementation, the calling thread holds the exclusive KB ingest lock across both prepare and commit. That means these strategies improve throughput inside one directory-add batch; they do not make multiple commands mutate the same KB concurrently.
Under that architecture, mutation safety is mostly a solved boundary problem: workers can prepare disposable data, and official KB writes remain serial. The interesting design question is not "can workers write concurrently?" They cannot. The question I want to answer is which prepared document the serial owner should commit next, and what tradeoff that choice makes visible to users.
Commit order matters because each commit observes the official KB state left by the commits before it. OpenKB's compile path does not generate isolated per-file Markdown only; it reads and updates shared wiki context. The final contents of generated Markdown files can therefore be order-sensitive, especially for:
wiki/summaries/wiki/concepts/wiki/entities/wiki/index.mdThe main order-sensitive mechanisms are:
index.mdentries against the wiki state it sees.doc_nameresolution and registry deduplication happen under the serial owner. Order can affect which source first claims a clean name and which later source receives a deterministic suffix when names collide.My goal is not simply to commit whichever file finishes first. The design target is stable, complete, low-redundancy Markdown while preserving deterministic and recoverable mutation semantics.
Working Thesis
Research Threads Behind The Proposal
GraphRAG: Build Global Structure Before Generating Global Markdown
GraphRAG's key lesson is to avoid treating global summaries as an incidental side effect of per-document order. Instead, extract entities and relationships, build a graph, cluster it, and then generate summaries from that global structure.
This is directly relevant to OpenKB outputs such as
concepts,entities, andindex.md.Reference:
Potential OpenKB flow:
MMR And Submodular Summarization: Coverage With Low Redundancy
MMR and submodular summarization optimize two competing objectives:
This is useful for ordering documents in a batch when many files overlap.
References:
Example scoring model:
Possible feature definitions:
centrality: similarity to the batch centroid.coverage_gain: new topics, entities, or headings introduced by the document.redundancy: overlap with already selected documents.metadata_priority: deterministic boosts for README, overview, spec, date, or other known high-signal markers.Tie-breaks should always be deterministic, such as normalized path order.
Content Ordering: Respect Natural Narrative And Dependency Structure
Some corpora have a natural order: background before design, design before implementation, implementation before results. Content ordering research captures this idea and is useful for tutorials, papers, specs, and logs.
Reference:
Practical OpenKB heuristics:
Engineering signals:
README,overview,intro,architecture,design,spec;Introduction,Background,Method,Results;Long-Context Position Effects
When several documents or summaries are placed into one prompt, model behavior depends on position. Work on long-context usage shows that information in the middle can be underused compared with information near the beginning or end.
References:
OpenKB implication:
Prompt layout pattern:
Proposed OpenKB Strategy
Current Baseline: Collect-All Prepare, Canonical Input/Path-Order Commit
The current
--jobs > 1local directory implementation is:This is implemented by
openkb.parallel_add.add_directory_with_jobs():ThreadPoolExecutorand are collected withas_completed;range(len(files));commit_prepared_document()requires the lock-owning caller thread and commits through the serial mutation path.This baseline is reliable and reproducible, but it leaves throughput on the table because commit does not start until all prepare work has completed.
The CLI serial
--jobs=1directory path still loops overadd_single_file()and does not use thePreparedDocumenthandoff. Existing parallel-add tests compare the internal preparedadd_directory_with_jobs(..., jobs=1)andjobs=2paths, but they do not yet cover the real CLI--jobs=1versus--jobs>1split.Default Ordering Policy: Stable Path Order
My proposed default ordering policy is intentionally conservative:
Why I would keep it:
This is an ordering policy, not a scheduling requirement. OpenKB can still execute the policy with a better schedule, such as an input-order streaming queue, as long as the visible commit order remains
sorted(path).Proposed Near-Term Schedule: Input-Order Streaming Queue
My preferred near-term improvement is to turn prepare into a producer pool and commit into a single input-ordered consumer. This gives us earlier commits without changing which document is allowed to commit next:
A prepare failure is a terminal input slot for ordering purposes. It must advance
next_to_commitafter recording"failed"; otherwise one early failed prepare would permanently block later successful prepared documents.A normal commit failure is also a handled terminal outcome. If the serial mutation rolls back cleanly and returns
"failed", the scheduler should record that outcome and continue to the next input slot. Only dirty rollback,KeyboardInterrupt,SystemExit, or another propagatedBaseExceptionshould abort the streaming scheduler.Streaming outcome contract:
PreparedDocument; commit whennext_to_commitreaches itExceptionfailedaddedaddedskippedskippedfailedafter clean rollbackfailedKeyboardInterrupt,SystemExit, or propagatedBaseExceptionFor completed runs with the same successful inputs, this preserves the existing official-state contract:
--jobs=Nshould match--jobs=1for the same successful inputs;It does not eliminate head-of-line blocking. If input 0 is slow and inputs 1-9 are ready, commits still wait for input 0. That is an explicit reproducibility tradeoff rather than a correctness problem.
It also changes interrupt behavior. The current collect-all implementation does not commit anything until all prepare work has finished; a prepare-phase interrupt can therefore leave no official commits from that batch. A streaming implementation can commit earlier documents while later prepare work is still running. If a later prepare or user interrupt fails the command, the earlier successful commits should remain committed. That is a valid partial-progress contract, but it must be documented and tested separately from the current collect-all behavior.
Implementing streaming is an orchestration change, not just a different final
for index in ...loop. The executor lifecycle, ready-result buffer,next_to_commitgate, future cancellation, staging cleanup, and dirty-rollback abort path must be coordinated in one scheduler.Experimental Throughput Idea: Completion Order
If we want a mode that maximizes throughput rather than reproducibility, I would make it explicit:
In this mode, whichever prepare finishes first enters the serial commit path first:
Only the lock-owning main thread may run
serial_commit. Completion-order mode must not let worker callbacks commit directly; workers still only produce private prepared output.This can reduce head-of-line blocking and improve batch throughput. However, it makes official KB state depend on worker scheduling, machine performance, file sizes, and transient LLM/indexing latency. That includes generated Markdown, registry ownership for duplicate hashes, final
doc_nameownership for name collisions, and raw/source artifact ownership. It matters because each commit can read and rewrite shared global wiki state such as:wiki/concepts/;wiki/entities/;wiki/index.md;My view is that completion order is acceptable only as an explicit opt-in mode for users who prefer throughput over reproducibility. It should not be the default, and its implementation cost is not just the code change: it also requires a new contract, test expectations, evaluation baselines, and documentation for the fact that
--jobs=Nmay no longer match--jobs=1official KB state.Experimental Quality Idea: Quality Order
I would add an opt-in quality ordering mode only after deterministic path order is stable:
Suggested pipeline:
This requires extending the prepare handoff. Today
PreparedDocumentcarries only the input index, source path, private staging path, and conversion result. Quality and cost modes need an explicit metadata schema for titles, headings, timestamps, links, hashes, embeddings, entities, estimated cost, and any errors or unavailable fields. Manifest mode mostly needs parser, validation, and coverage reporting, but should share the same dry-run order inspection surface. The dry-run output should expose the relevant metadata and the computed order so users can inspect why a batch will commit in that order.Prepare-time metadata is advisory. Prepare currently reads the registry for duplicate detection and produces only a candidate document name; a streaming batch can commit earlier documents while later prepared results still carry registry observations from before those commits. Therefore duplicate/skip status, final
doc_name, and official artifact ownership must be revalidated inside the serial commit owner.Expensive metadata such as embeddings or entity extraction should remain opt-in. The default path-order mode should not require extra LLM calls.
Minimum metadata schema:
input_indexsource_pathfile_hashdoc_name_candidatedoc_nametitleheadingslinkstimestampestimated_costembeddingentitiesmetadata_errorsDry-run output should include
input_index, normalized path, selected rank, mode-specific scoring inputs, unavailable metadata fields, and the final tie-break that decided each adjacent ordering decision.Pseudo-code:
Experimental User-Control Idea: Explicit Order
Some corpora have an authoritative order that OpenKB cannot infer reliably, such as course chapters, meeting logs, design histories, or paper collections. For those cases, I think a manifest-driven mode is the right escape hatch:
The manifest must be validated strictly:
This mode trades automation for user control while keeping reproducibility.
Experimental Latency Idea: Deterministic Cost Order
A deterministic cost-oriented mode could commit estimated cheap work first:
This may improve perceived throughput, but it changes wiki generation order and its cost estimates can be wrong. I would keep it lower priority than input-order streaming and quality order.
Longer-Term Design Bets
The following ideas change more than commit ordering. They alter the boundary between local artifacts, global Markdown generation, registry visibility, and rollback semantics. I would treat them as architecture bets, not simple
--orderstrategies.Long-Term Bet: Batch/Global Consolidation
My strongest long-term bet is to reduce order sensitivity by separating local document processing from global Markdown generation.
I expect this to produce more stable and globally coherent Markdown than relying on incremental per-file updates.
Long-Term Bet: Parallel Local Candidates, Serial Global Merge
Another direction I think is worth exploring is to run expensive LLM/index candidate generation in parallel while keeping official writes serial:
This can expose more parallelism than prepare-only execution. It is reliable only if worker output remains private and the merge step validates candidates against the current official KB state before publishing.
Structural Alternative: Local Artifacts First, Global Compile Later
A more structural alternative is to split local artifact publication from global wiki generation:
This can reduce per-document ordering effects, but it requires a new commit point and failure contract. For example, OpenKB must decide whether a document with committed raw/source artifacts but failed global compilation is visible, retryable, or rolled back.
It also changes registry semantics. Today the add path registers a hash only after successful compilation. A two-phase design must explicitly define:
openkb removecan remove a document whose raw/source artifacts are committed but whose global wiki output failed;Candidate Ordering Modes
For streaming input order, "High" reproducibility means completed runs produce the same official state as canonical input/path-order commit for the same successful inputs. Failure and interrupt behavior is a separate contract because streaming can validly leave earlier input-order commits in place.
CLI Contract I Would Make Explicit
Before implementing optional ordering modes, I would make the CLI contract explicit instead of treating ordering as an internal scheduler flag:
openkb add docs/ --jobs N--jobs=1for completed runs with the same successful inputsopenkb add docs/ --jobs N --order pathopenkb add docs/ --jobs N --order completionopenkb add docs/ --jobs N --order qualityopenkb add docs/ --jobs N --order manifest --order-file manifest.txtopenkb add docs/ --jobs N --order costI would scope the initial implementation to local directory ingestion, because that is where
--jobsalready applies. Single-file, URL, and PageIndex Cloud ingest should keep their current behavior unless a separate batch API is introduced.Candidate Architecture Modes
These architecture modes require coordinator-level changes. The current add mutation body publishes staged artifacts, runs compile/index work, and writes the registry entry inside one
AddMutationPlan. Splitting those phases changes the commit point and cannot be implemented as a local ordering policy alone.Rejected Ordering/Write Strategies
The following strategies are not reliable under the current mutation model:
wiki/,raw/,.openkb/, registry, journal, PageIndex DB,concepts,entities, orindex.mdstate.doc_namereservation in official state.These designs conflict with the current serial mutation owner model. They also make rollback dangerous because an earlier failed mutation can snapshot and restore shared paths over later committed work.
Implications For Parallel Architecture
Prepare workers may compute ordering features, but must not decide official KB state.
Prepare workers may:
Prepare workers must not:
wiki/;raw/;.openkb/state outside private staging;doc_name;concepts,entities, orindex.md.Commit remains serial:
Worker completion order must never define commit order unless the user has explicitly selected a non-reproducible completion-order throughput mode.
Risks I Would Keep Visible
qualityorder.This leads to the proposal map I would use:
Proposed Implementation Path
Keep current collect-all canonical input/path-order implementation as the regression baseline.
Add input-order streaming commit so commit can begin before every prepare has completed while preserving the same visible commit order for completed runs.
Add metadata extraction during prepare.
Add dry-run order inspection:
Add deterministic opt-in order modes first:
quality,manifest, and possiblycost.Add
completiononly as an explicitly non-reproducible throughput mode.Later add batch-level
concepts/entities/index.mdconsolidation.Streaming must include explicit tests for partial-progress behavior:
next_to_commitso later prepared documents can commit;"failed"and advancesnext_to_commit;KeyboardInterruptduring prepare or commit leaves no dirty staging;Stage A should also add a true CLI-level regression test comparing directory ingest through
--jobs=1and--jobs>1, because the former currently usesadd_single_file()while the latter enters thePreparedDocumentpath.How I Would Evaluate This
Use these to compare path order, streaming input order, completion order, quality order, manifest order, cost order, and global consolidation:
Minimum acceptance gates:
--jobs=1and CLI--jobs=Nproduce the same official state for deterministic fixtures with the same successful inputsadded,skipped, cleanfailed, dirty rollback, and interruptserial_commitand never write officialwiki/,raw/, or.openkb/state outside private stagingProposed Roadmap
My Recommendation
I would treat OpenKB ordering as four layers:
My pragmatic path is:
Beta Was this translation helpful? Give feedback.
All reactions